A Short Introduction to Entropy, Cross-Entropy and KL-Divergence | Aurelien Geron

March 10, 2018, 8:50 a.m. By: Kirti Bakshi


In this video by Aurélien Géron, you will understand what entropy, cross-entropy and KL-Divergence actually are. In Machine Learning, cross-entropy is a term that is very commonly used as a cost function when we are training classifiers and so we will see why is that so.

Those concepts come from Claude Shannon's Information Theory. Shannon was an American mathematician, electrical engineer, and cryptographer. And in his 1948 paper, "A Mathematical Theory of communication" he founded what is now known as Information theory. The link to the paper has been given in the end.

The goal is to reliably and efficiently transmit a message from a sender to a recipient. In our digital age, messages are composed of bits. As we all know that a bit is a number that is equal to 0 or 1. But not all the bits are useful: Some of them are redundant, some of them are errors and soon. So when we communicate a message, we want as much useful information as possible to get through.

In order to give you an insight of what this video is all about, we will in detail go through what is Entropy so that you can better understand the counterparts through the video.

Moving onto what Entropy is:

In Shannon's theory, to transmit one bit of information means to reduce recipients uncertainty by a factor of 2. We will understand this with the help of an example.

Let's Say that the weather is completely random, with 50/50 chance of being either sunny or rainy every day. If a weather station tells you that it's going to be rainy tomorrow then they have actually reduced your uncertainty by a factor of two. They were two equally likely options but there is just one.

So, the weather channel did actually send you a single it of useful information. And this is true no matter how they encoded this information. If they encoded it as a string with 5 characters, each encoded on 1 byte, then they actually sent you a 40-bit message, but they still only communicated 1 bit of useful information.

Now, suppose the weather has actually 8 possible states, all equally likely. Now, when the weather station gives you tomorrow's weather, they are dividing your uncertainty by a factor of 8, which is 2 to the power 3. So they sent you 3 bits of useful information. So it's easy to find the number of bits of information that were actually communicated by computing the binary logarithm of the uncertainty reduction factor, which in this example is 8. But what if the possibilities are not equally likely? Say 75% chance sunny, and 25% chance rainy. If the weather station tells you it is going to be rainy tomorrow, then your uncertainty has dropped by a factor of 4, which is 2 bits of information.

The uncertainty reduction is just the inverse of the event's probability, in this case, the inverse of 25% is 4.

Now, the log of 1/x is equal to -log(x), so the equation to compute the number of bits simplifies to minus the binary log of the probability, 25%. Now, if the weather station tells you it's going to be sunny tomorrow then your uncertainty hasn't dropped by much. In fact, you get just over .41 bits of information. So how much information are you actually going to get from the weather station, on average?

There is a 75% chance that it will be sunny tomorrow, so that's what the weather station would tell you and that's .41 bits of information.

Then there is also a 25% chance that it will be rainy, in which case the weather station will tell you so and this will give you 2 bits of information. So, on an average, you will get .81 bits of information from the weather station, every day.

So what we just computed is Entropy: It is a nice measure of how uncertain the events are.

The entropy's equation should now make complete sense. It measures the average amounts of information that you get when you learn the weather each day, or more generally the average amount of information that you get from one sample drawn from a given probability distribution p. It tells you how unpredictable the probability distribution is.

This was all about entropy. Since, you now know what it is, for Cross-entropy and KL-Divergence, go through the link given below to understand where they come from and why we use them in ML.

"A mathematical theory of communication", Claude E. Shannon, 1948: Click Here

A Short Introduction to Entropy, Cross-Entropy and KL-Divergence

Video Source: Aurélien Géron