Nir Regev, Guy Regev, alephzero.ai May 26, 2023

1 Introduction

Maximum Likelihood estimation is a widely-used classical estimation method, in which the parameter estimator is the result of maximizing the likelihood function. The likelihood function is defined as the probability density function (PDF) $f(x; \theta)$ with respect to the parameter $\theta$. In this paper, we explore the intriguing relationship between Maximum Likelihood Estimation and Information Theory, specifically the Kullback–Leibler divergence.

2 The Kullback–Leibler Divergence

The Kullback–Leibler divergence (KLD) is a measure of the difference between two probability distributions. Given two probability distributions $P$ and $Q$, the KLD from $P$ to $Q$ is defined as:

$$ ⁍ $$

where the integral is taken over the support of the distributions.

3 Minimizing the KLD and Maximum Likelihood Estimation

Now, let's consider the problem of estimating the parameter $\theta$ based on a set of observed data points. We denote the empirical probability density function estimate of the data as $\hat{f}(x)$. The goal is to find the value of $\theta$ that maximizes the likelihood function $f(x; \theta)$.

Interestingly, it can be shown that minimizing the Kullback–Leibler divergence between the empirical PDF estimate $\hat{f}(x)$ and the true PDF $f(x; \theta)$ with respect to $\theta$ leads to the Maximum Likelihood Estimator (MLE) for $\theta$. Mathematically, we have:

$$ ⁍ $$

4 Proof

To prove this relationship, we start by expanding the KLD as follows:

$$ ⁍ $$

$$ ⁍ $$

We observe that minimizing the KLD over $\theta$ is equivalent to maximizing only the second term in the KLD, since the first one is independent of $\theta$, hence,

$$ ⁍ $$

so $\hat{\theta}{KL}$ will be $\hat{\theta}{KL} = \arg \max{\theta} J(\theta)$.

Now, we use the definition of the empirical PDF