Tutorial: Maximum Likelihood Estimation and Kullback-Leibler Divergence Minimization - are they equivalent?

<aside> 📌 By Nir Regev, and Guy Regev, alephzero.ai, May 26, 2023

</aside>

<aside> 📌 For more content Unlock Circuit of Knowledge Premium Account.

</aside>

1 Introduction

Maximum Likelihood estimation is a widely-used classical estimation method, in which the parameter estimator is the result of maximizing the likelihood function. The likelihood function is defined as the probability density function (PDF) $f(x; \theta)$ with respect to the parameter $\theta$. In this paper, we explore the intriguing relationship between Maximum Likelihood Estimation and Information Theory, specifically the Kullback–Leibler divergence.

2 The Kullback–Leibler Divergence

The Kullback–Leibler divergence (KLD) is a measure of the difference between two probability distributions. Given two probability distributions $P$ and $Q$, the KLD from $P$ to $Q$ is defined as:

$$ ⁍ $$

where the integral is taken over the support of the distributions.

3 Minimizing the KLD and Maximum Likelihood Estimation

Now, let's consider the problem of estimating the parameter $\theta$ based on a set of observed data points. We denote the empirical probability density function estimate of the data as $\hat{f}(x)$. The goal is to find the value of $\theta$ that maximizes the likelihood function $f(x; \theta)$.

Interestingly, it can be shown that minimizing the Kullback–Leibler divergence between the empirical PDF estimate $\hat{f}(x)$ and the true PDF $f(x; \theta)$ with respect to $\theta$ leads to the Maximum Likelihood Estimator (MLE) for $\theta$. Mathematically, we have:

$$ ⁍ $$

4 Proof

To prove this relationship, we start by expanding the KLD as follows:

$$ ⁍ $$

We observe that minimizing the KLD over $\theta$ is equivalent to maximizing only the second term in the KLD, since the first one is independent of $\theta$, hence,

$J(\theta) = \int \hat{f}(x) \log f(x; \theta)dx,$ so $\hat{\theta}{KL}$ **will be **$\hat{\theta}{KL} = \arg \max_{\theta} J(\theta)$.

Now, we use the definition of the empirical PDF

$$ ⁍ $$