<aside> πŸ“Œ By Dr. Nir Regev

</aside>

<aside> πŸ“Œ Sign up to Circuit of Knowledge blog for unlimited tutorials and content

</aside>

<aside> πŸ“Œ If it’s knowledge you’re after, join our growing Slack community!

</aside>

July 5th 2024


The reparameterization trick is a key technique in variational inference that enables the optimization of variational autoencoders (VAEs) and other models with continuous latent variables. It addresses the challenge of backpropagating gradients through stochastic sampling operations, which are non-differentiable. By reformulating the sampling process as a deterministic function of the parameters and a separate source of randomness, the reparameterization trick allows gradients to flow through the sampling operation and enables end-to-end training of VAEs using standard gradient-based optimization methods.

In a VAE, the objective is to maximize the evidence lower bound (ELBO), which involves an expectation over the variational distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$. The ELBO can be written as:

$$ ⁍ $$

To optimize the ELBO with respect to the variational parameters $\phi$, we need to compute the gradient:

$$ ⁍ $$

However, the expectation is taken with respect to the variational distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$, which depends on the parameters $\phi$. This makes the gradient computation challenging because the sampling operation $\mathbf{z} \sim q_{\phi}(\mathbf{z}|\mathbf{x})$ is non-differentiable. Why? because the process of drawing samples from a probability distribution is a discrete operation that does not have a well-defined gradient with respect to the parameters of the distribution. So, in the context of variational autoencoders (VAEs), the variational distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$ is parameterized by $\phi$, which are typically the parameters of a neural network that outputs the mean and variance of a Gaussian distribution. The goal is to optimize these parameters to maximize the evidence lower bound (ELBO) and learn a meaningful latent representation. Thus, we introduce a parametrization trick that can optimize on the mean vector and (diagonal) covariance matrix and derive by these deterministic parameters and not by a random variable.

The reparameterization trick overcomes this challenge by expressing the sampling operation as a deterministic function of the variational parameters and a separate source of randomness. Instead of sampling directly from $q_{\phi}(\mathbf{z}|\mathbf{x})$, we introduce a differentiable transformation $g_{\phi}(\boldsymbol{\epsilon}, \mathbf{x})$ and a noise variable $\boldsymbol{\epsilon}$ such that:

$$ ⁍ $$

The transformation $g_{\phi}$ is chosen such that the resulting $\mathbf{z}$ has the desired distribution $q_{\phi}(\mathbf{z}|\mathbf{x})$. For example, if $q_{\phi}(\mathbf{z}|\mathbf{x})$ is a Gaussian distribution with mean $\boldsymbol{\mu}{\phi}(\mathbf{x})$ **and diagonal covariance **$\boldsymbol{\sigma}^2{\phi}(\mathbf{x})$, the reparameterization trick can be applied as follows:

$$ ⁍ $$

Here, $\odot$ denotes element-wise multiplication. By expressing $\mathbf{z}$ in this way, the sampling operation becomes deterministic with respect to $\phi$, and the randomness is isolated in the noise variable $\boldsymbol{\epsilon}$.

With the reparameterization trick, the gradient of the ELBO with respect to $\phi$ can be computed as:

$$ ⁍ $$

Now, the expectation is taken with respect to the distribution of the noise variable $p(\boldsymbol{\epsilon})$, which is independent of $\phi$. This allows the gradient to be estimated using Monte Carlo samples of $\boldsymbol{\epsilon}$ and backpropagated through the deterministic function $g_{\phi}$.

Here's a simple Python code snippet that demonstrates the reparameterization trick for a Gaussian distribution: