<aside> 📌 By Dr. Nir Regev

</aside>

<aside> 📌 Join Circuit of Knowledge for more content

</aside>

In the field of information theory and machine learning, cross entropy and Kullback-Leibler divergence (KLD) are fundamental concepts that play seminal roles in the evaluation of probabilistic models. Though they are closely related, their specific purposes and interpretations can sometimes lead to confusion. This blog post aims to demystify these concepts and elucidate the relationship between them.

What is Cross Entropy?

Cross entropy is defined as following: for two distributions $P$ (the true distribution) and $Q$ (the approximate distribution), the cross entropy $H(P, Q)$ is defined as:

$$ ⁍ $$

Here, $P(x)$ represents the true probability (PMF or PDF) of event $x$, and $Q(x)$ represents the predicted probability of event $x$. Cross entropy quantifies the average number of bits ($\log_2(\cdot)$, or in machine learning nats if we use $\ln(\cdot)$) needed to encode data from distribution $P$ when using the distribution $Q$.

What is Kullback-Leibler Divergence?

Kullback-Leibler divergence, also known as relative entropy often abbreviated as KL divergence or KLD, measures how one probability distribution diverges from a second, expected probability distribution. This divergence is NOT a distance metric. For the same distributions $P$ and $Q$, the KL divergence $D_{KL}(P \parallel Q)$ is given by:

$$ ⁍ $$

Alternatively, KL divergence can also be expressed as:

$$ ⁍ $$

The Relationship Between Cross Entropy and KL Divergence

The relationship between cross entropy and KL divergence becomes clear when we decompose the cross entropy term to read:

$$ ⁍ $$

where $H(P)$ is the entropy of the true distribution $P$, which is fixed for the optimization problems in Machine Learning as we usually optimize/learn the empirical distribution Q:

$$ ⁍ $$

From this decomposition, it is evident that the cross entropy between $P$ and $Q$ consists of two parts:

  1. Entropy $H(P)$: The intrinsic entropy of the distribution $P$, which represents the minimum average number of bits required to encode events drawn from $P$ using an optimal code.
  2. KL Divergence $D_{KL}(P \parallel Q)$: The extra number of bits required to encode events from $P$ using the distribution $Q$ instead of the optimal distribution $P$.