wordle pt.2 - entropy

Jul 05 2025

(this post requires you have some knowledge of information theory that was shared in a previous post)

okay! let's take a look two games of two different friends of mine, which we will completely randomly name felix and pam.

pam's game

felix's game

both games took 4 guesses, but do you see how much luckier of a guess moldy was for felix than it was for pam? let's look at the information gained at each guess

pam's game

felix's game

you can see how in pam's game, the final guess yielded only 1 Sh, while in felix's, it gave a whopping 5.49 - felix had a much wider space of possible guesses. but wouldn't you also say that worth is a better guess than foldy? if we didn't know the answer, you'd suppose that letters like t and h are much more probable to appear than d and y, but the information gains reading says that pam's guess was far better than felix's. pam got lucky. is there any way to measure that?

entropy

yeah what'd you think lol. anyways, if you recall how our system works, you might notice a flaw - we only run the checks on the resulting pattern. the player can't actually see it before they guess though! we have to figure out a way to get the possible future information gain. the way this is done is actually quite simple - you just run the guess against every possible answer word and get the average entropy returned. that's entropy (at least in information theory). a neat thing you can do after that is find out how much actual information you've gained, and see how that differs from what you expected. that's basically how lucky you got.

to recap, in this scenario, the entropy of your guess is how much information gain you can expect, the information is how much you actually got, and the difference is how "lucky" you got (better or worse than expected). queue the charts

pam's game

felix's game

each chart has a distribution of the probabilities of possible resulting patterns, the entropy, the expected entropy, and the difference.

you can see how this time, even though foldy got more information, worth would've gave more information, but it got unlucky. you could say that felix's guess was more skilled than pam's, and that pam actually was consistently luckier than felix. on that guess in particular, felix got 1.78 Sh less than expected, while pam got 1.57 Sh more!

the block thingy above the text

the weird shape above the actual information, or the "distribution of the probabilities of possible resulting patterns" as i called it 2 paragraphs ago, is, well, pretty much that. to get the entropy of a guess, you just have to run it against all possible answers (since you don't actually know it in this scenario). back to our example where we only had to guess between phone, cobra, sling, and night, and we put radar as a guess, to calculate its entropy, we just have to average the information gain for each of the possible words. so, for phone, sling, and night, the answer will be all grays. this means that we would've gained ~0.42 Sh if it was any of those words, but if the word was cobra (and it only), we'd have a yellow letter on the first a, giving us a whopping 2 Sh of information. now we just average it! (3*0.42 + 2)/4 = ~0.85 Sh of entropy.

that shape is a histogram of every possible resulting pattern from every possible answer word, and the highlighted part is what we actually got. but because probability is inversely proportional to information gain (if we got a rare pattern, we'd get more information out of it), we just display the information from lowest (most probable) to highest (least probable) information. to build up some more intuition, here's also my game ran through this technique

let's start at the bottom. there are 2 bars, the taller one being highlighted. in this case the possible words left are corse, dorse, gorse, worse, and zorse. guessing worse basically has 2 outcomes (just like the previous example scenario) - that the answer is worse, or that the answer isn't worse. the former gives us more information (it's a rarer outcome), so it's the taller bar, and highlighted because we guessed it. going up, the smaller bar is highlighted because we guessed the higher probability (lower information) option - that the word is not horse. going up, there's an even more complicated distribution, and it seems like we guessed one of the two lowest-entropy answers. the other one looked like this

you'd think that having one less green square highlighted would give less information, but if you remember, we're looking at how many words were eliminated (or remain rather), and in both cases 6 possible words remain. then another bar, for a pattern that leaves only 2 possible words, and the flat bit at the end are all patterns that would give maximum information - leave only one word, for instance this pattern

would only leave mouse as a possible word. actually, another curious thing is that all of our logic a pattern that's all greens is in the same category since it leaves one word (morse). going up again, now the distribution is smaller, yielding less entropy. you can imagine it as the fact that it's smaller, everything has a higher probability, so everything on average will give lower information, and the top two distributions are pretty self-explanatory.

a small thing to note is because our word list is much bigger than the possible answers, our luck is going to be skewed a bit downwards, since the size of the word list gives more variation (you can sort of think about this through the lens of the law of large numbers).

maths

muehehe i installed a maths extension to my blog engine so you will now get to see some formulae :3

to start off, we define some message x (remember, in information theory we talk about messages/events, in our case the wordle patterns) and its probability p(x). to calculate the information content I(x) in bits of this message, you just have to take the log of the inverse of the probability

I(x)=log2(x)=log2(1x)

actually, that base could be anything. so far i've been using bits because they're the most common, but also they're a bit more intuitive if you know binary. given a number with 8 bits lets say, having 256 values, the probability of any message (number) is p(x)=1256, meaning that the information content of any message would be log2(11256)=log2(256)=8 bits (shannons)!

the entropy H of a message space X (that would be all patterns in our result) is the expected value (weighted average) 𝔼 of the individual information content I(x), and we defined it like this

H(X)=𝔼[I(x)]=1nn·p(x)·I(x)

where n is the size of X. n·p(x) is just a way to say "the amount of times this message has occurred, and 1n is just a way to divide the whole thing by n. but you might've already noticed, those cancel out!

1nn·p(x)·I(x)=p(x)·I(x)=p(x)log(1x)=p(x)log(x)

behold, the formula for entropy, as defined by claude shannon himself! (see here, page 13). readers are free to re-check all of this errors, i'm proud of my LATEX, but still not that confident in my maths skills x3

anyways, that's been all from me, toodles!


© nicole; go back