<aside> 📌 By Nir Regev, and Guy Regev, alephzero.ai, May 26, 2023
</aside>
<aside> 📌 For more content Unlock Circuit of Knowledge Premium Account.
</aside>
Maximum Likelihood estimation is a widely-used classical estimation method, in which the parameter estimator is the result of maximizing the likelihood function. The likelihood function is defined as the probability density function (PDF) $f(x; \theta)$ with respect to the parameter $\theta$. In this paper, we explore the intriguing relationship between Maximum Likelihood Estimation and Information Theory, specifically the Kullback–Leibler divergence.
The Kullback–Leibler divergence (KLD) is a measure of the difference between two probability distributions. Given two probability distributions $P$ and $Q$, the KLD from $P$ to $Q$ is defined as:
$$ ⁍ $$
where the integral is taken over the support of the distributions.
Now, let's consider the problem of estimating the parameter $\theta$ based on a set of observed data points. We denote the empirical probability density function estimate of the data as $\hat{f}(x)$. The goal is to find the value of $\theta$ that maximizes the likelihood function $f(x; \theta)$.
Interestingly, it can be shown that minimizing the Kullback–Leibler divergence between the empirical PDF estimate $\hat{f}(x)$ and the true PDF $f(x; \theta)$ with respect to $\theta$ leads to the Maximum Likelihood Estimator (MLE) for $\theta$. Mathematically, we have:
$$ ⁍ $$
To prove this relationship, we start by expanding the KLD as follows:
$$ ⁍ $$
$$ ⁍ $$
We observe that minimizing the KLD over $\theta$ is equivalent to maximizing only the second term in the KLD, since the first one is independent of $\theta$, hence,
$J(\theta) = \int \hat{f}(x) \log f(x; \theta)dx,$ so $\hat{\theta}{KL}$ **will be **$\hat{\theta}{KL} = \arg \max_{\theta} J(\theta)$.
Now, we use the definition of the empirical PDF
$$ ⁍ $$