From Entropy to KL Divergence: A Comprehensive Guide
On this page
KL Divergence, one of the most important concepts in information theory and statistics, measures the difference between two probability distributions. It quantifies how much information is lost when one distribution is used to approximate another. It is widely used in various fields, including machine learning, data science, and artificial intelligence. In this blog post, we will explore the concept of KL Divergence, its mathematical formulation, and its applications.
First, let’s start with the concept of entropy, which is the foundation of KL Divergence.
1 Entropy
Entropy is a measure of uncertainty or randomness in a probability distribution. It quantifies the average amount of information required to describe the outcome of a random variable. The entropy \(H(P)\) of a discrete probability distribution \(P\) is defined as:
\[ H(P) = -\sum_{x} P(x) \log P(x) \tag{1}\]
where the sum is taken over all possible outcomes \(x\) of the random variable.
For the continuous case, the entropy is defined as: \[ H(P) = -\int P(x) \log P(x) dx \tag{2}\]
where the integral is taken over the entire support of the random variable.
2 Cross-Entropy
Cross-entropy measures the difference between two probability distributions. It quantifies the average number of bits needed to identify an event from a set of possibilities, given a predicted probability distribution \(Q\) instead of the true distribution \(P\). The cross-entropy \(H(P, Q)\) is defined as: \[ H(P, Q) = -\sum_{x} P(x) \log Q(x) \tag{3}\]
For the continuous case, the cross-entropy is defined as: \[ H(P, Q) = -\int P(x) \log Q(x) dx \tag{4}\]
3 KL Divergence
KL Divergence measures the difference between two probability distributions \(P\) and \(Q\). It quantifies the amount of information lost when \(Q\) is used to approximate \(P\). The KL Divergence \(D_{KL}(P || Q)\) is defined as: \[ D_{KL}(P || Q) = \mathbb{E}_{x \sim P} \left[ \log \frac{P(x)}{Q(x)} \right] \]
For the discrete case, the KL Divergence is defined as: \[ D_{KL}(P || Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} \tag{5}\]
For the continuous case, the KL Divergence is defined as:
\[ D_{KL}(P || Q) = \int P(x) \log \frac{P(x)}{Q(x)} dx \tag{6}\]
One thing to note about KL Divergence is that it is asymmetric, meaning that \(D_{KL}(P || Q) \neq D_{KL}(Q || P)\). This asymmetry reflects the fact that KL Divergence measures the information lost when approximating \(P\) with \(Q\), and not vice versa.

Another important property of KL Divergence is that it is always non-negative, i.e., \(D_{KL}(P || Q) \geq 0\), with equality if and only if \(P = Q\) almost everywhere. This property is a consequence of Gibbs’ inequality. We can prove this using Jensen’s inequality: \[ D_{KL}(P || Q) = \mathbb{E}_{x \sim P} \left[ -\log \frac{Q(x)}{P(x)} \right] \geq -\log \mathbb{E}_{x \sim P} \left[ \frac{Q(x)}{P(x)} \right] = \log 1 = 0 \]
3.1 Monte Carlo Estimation of KL Divergence
In practice, we sometimes don’t known the true distribution P, but we have samples from it. In this case, we can estimate the KL Divergence using Monte Carlo sampling:
This is un-biased estimator of KL Divergence using samples from distribution \(P\). However, it may have high variance depending on the number of samples and the distributions involved.
To reduce variance, we can use importance sampling:
This method uses samples from distribution \(Q\) and weights them according to the ratio of probabilities under \(P\) and \(Q\), leading to a lower variance estimate of KL Divergence.
Another method is Control Variates
3.2 Control Variates
Control variates is a variance reduction technique that involves using a correlated variable with known expected value to reduce the variance of an estimator. In the context of KL Divergence estimation, we can use a control variate to improve the estimate:
\[ D_{KL}(P || Q) \approx \frac{1}{N} \sum_{i=1}^{N} \left( \log \frac{P(x_i)}{Q(x_i)} - c (g(x_i) - E[g(X)]) \right) \]
4 Applications of KL Divergence
KL Divergence has numerous applications in various fields, including:
- Generative Models: KL Divergence is used in training generative models such as Variational Autoencoders (VAEs) to measure the difference between the learned distribution and the true data distribution.
- Model Distillation: KL Divergence is used to transfer knowledge from a large model (teacher) to a smaller model (student) by minimizing the KL Divergence between their output distributions.
- Reinforcement Learning: KL Divergence is used in policy optimization algorithms to ensure that the updated policy does not deviate too much from the previous policy.
4.1 KL Divergence in Generative Models
4.1.1 Variational Autoencoders (VAEs)
In VAEs, KL Divergence is used to regularize the latent space by minimizing the divergence between the approximate posterior distribution and the prior distribution.
\[ \mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) || p(z)) \]
4.2 Model Distillation
In model distillation, KL Divergence is used to align the output distributions of the teacher and student models.
\[ \mathcal{L} = D_{KL}(P_{teacher} || P_{student}) \]
4.3 Reinforcement Learning
In reinforcement learning, KL Divergence is used to constrain policy updates to ensure stability. For example, the RL in the LLM (RLHF) uses KL Divergence to keep the updated policy close to the original policy.
\[ \mathcal{L} = \mathbb{E}_{s \sim \pi_{ref}} \left[ \frac{\pi_{new}(a|s)}{\pi_{ref}(a|s)} A(s, a) \right] - \beta D_{KL}(\pi_{new} || \pi_{ref}) \]
5 Conclusion
KL Divergence is a powerful tool for measuring the difference between probability distributions. It has wide-ranging applications in machine learning, data science, and artificial intelligence. Understanding KL Divergence and its applications can help you build better models and improve your understanding of probabilistic systems.