From Entropy to KL Divergence: A Comprehensive Guide

Mathematics
KL Divergence, also known as Kullback-Leibler Divergence, is a fundamental concept in information theory and statistics. In this blog post, we will explore the concept of KL Divergence, its mathematical formulation, and its applications in various fields such as machine learning, data science, and artificial intelligence. By the end of this post, you will have a solid understanding of KL Divergence and how to apply it in your own projects.
Author

Yuyang Zhang

Published

2025-12-02

Last modified

2025-12-02

KL Divergence, one of the most important concepts in information theory and statistics, measures the difference between two probability distributions. It quantifies how much information is lost when one distribution is used to approximate another. It is widely used in various fields, including machine learning, data science, and artificial intelligence. In this blog post, we will explore the concept of KL Divergence, its mathematical formulation, and its applications.

First, let’s start with the concept of entropy, which is the foundation of KL Divergence.

1 Entropy

Entropy is a measure of uncertainty or randomness in a probability distribution. It quantifies the average amount of information required to describe the outcome of a random variable. The entropy \(H(P)\) of a discrete probability distribution \(P\) is defined as:

\[ H(P) = -\sum_{x} P(x) \log P(x) \tag{1}\]

where the sum is taken over all possible outcomes \(x\) of the random variable.

For the continuous case, the entropy is defined as: \[ H(P) = -\int P(x) \log P(x) dx \tag{2}\]

where the integral is taken over the entire support of the random variable.

2 Cross-Entropy

Cross-entropy measures the difference between two probability distributions. It quantifies the average number of bits needed to identify an event from a set of possibilities, given a predicted probability distribution \(Q\) instead of the true distribution \(P\). The cross-entropy \(H(P, Q)\) is defined as: \[ H(P, Q) = -\sum_{x} P(x) \log Q(x) \tag{3}\]

For the continuous case, the cross-entropy is defined as: \[ H(P, Q) = -\int P(x) \log Q(x) dx \tag{4}\]

3 KL Divergence

KL Divergence measures the difference between two probability distributions \(P\) and \(Q\). It quantifies the amount of information lost when \(Q\) is used to approximate \(P\). The KL Divergence \(D_{KL}(P || Q)\) is defined as: \[ D_{KL}(P || Q) = \mathbb{E}_{x \sim P} \left[ \log \frac{P(x)}{Q(x)} \right] \]

For the discrete case, the KL Divergence is defined as: \[ D_{KL}(P || Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)} \tag{5}\]

For the continuous case, the KL Divergence is defined as:

\[ D_{KL}(P || Q) = \int P(x) \log \frac{P(x)}{Q(x)} dx \tag{6}\]

One thing to note about KL Divergence is that it is asymmetric, meaning that \(D_{KL}(P || Q) \neq D_{KL}(Q || P)\). This asymmetry reflects the fact that KL Divergence measures the information lost when approximating \(P\) with \(Q\), and not vice versa.

Another important property of KL Divergence is that it is always non-negative, i.e., \(D_{KL}(P || Q) \geq 0\), with equality if and only if \(P = Q\) almost everywhere. This property is a consequence of Gibbs’ inequality. We can prove this using Jensen’s inequality: \[ D_{KL}(P || Q) = \mathbb{E}_{x \sim P} \left[ -\log \frac{Q(x)}{P(x)} \right] \geq -\log \mathbb{E}_{x \sim P} \left[ \frac{Q(x)}{P(x)} \right] = \log 1 = 0 \]

3.1 Monte Carlo Estimation of KL Divergence

In practice, we sometimes don’t known the true distribution P, but we have samples from it. In this case, we can estimate the KL Divergence using Monte Carlo sampling:

This is un-biased estimator of KL Divergence using samples from distribution \(P\). However, it may have high variance depending on the number of samples and the distributions involved.

To reduce variance, we can use importance sampling:

This method uses samples from distribution \(Q\) and weights them according to the ratio of probabilities under \(P\) and \(Q\), leading to a lower variance estimate of KL Divergence.

Another method is Control Variates

3.2 Control Variates

Control variates is a variance reduction technique that involves using a correlated variable with known expected value to reduce the variance of an estimator. In the context of KL Divergence estimation, we can use a control variate to improve the estimate:

\[ D_{KL}(P || Q) \approx \frac{1}{N} \sum_{i=1}^{N} \left( \log \frac{P(x_i)}{Q(x_i)} - c (g(x_i) - E[g(X)]) \right) \]

4 Applications of KL Divergence

KL Divergence has numerous applications in various fields, including:

  • Generative Models: KL Divergence is used in training generative models such as Variational Autoencoders (VAEs) to measure the difference between the learned distribution and the true data distribution.
  • Model Distillation: KL Divergence is used to transfer knowledge from a large model (teacher) to a smaller model (student) by minimizing the KL Divergence between their output distributions.
  • Reinforcement Learning: KL Divergence is used in policy optimization algorithms to ensure that the updated policy does not deviate too much from the previous policy.

4.1 KL Divergence in Generative Models

4.1.1 Variational Autoencoders (VAEs)

In VAEs, KL Divergence is used to regularize the latent space by minimizing the divergence between the approximate posterior distribution and the prior distribution.

\[ \mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) || p(z)) \]

4.2 Model Distillation

In model distillation, KL Divergence is used to align the output distributions of the teacher and student models.

\[ \mathcal{L} = D_{KL}(P_{teacher} || P_{student}) \]

4.3 Reinforcement Learning

In reinforcement learning, KL Divergence is used to constrain policy updates to ensure stability. For example, the RL in the LLM (RLHF) uses KL Divergence to keep the updated policy close to the original policy.

\[ \mathcal{L} = \mathbb{E}_{s \sim \pi_{ref}} \left[ \frac{\pi_{new}(a|s)}{\pi_{ref}(a|s)} A(s, a) \right] - \beta D_{KL}(\pi_{new} || \pi_{ref}) \]

5 Conclusion

KL Divergence is a powerful tool for measuring the difference between probability distributions. It has wide-ranging applications in machine learning, data science, and artificial intelligence. Understanding KL Divergence and its applications can help you build better models and improve your understanding of probabilistic systems.

Back to top