I’m a big fan of Feynman’s technique of learning something new by trying to explain it to someone else. So in this post, I’ll try to explain normalizing flows (NF), a relatively simple yet powerful new tool in statistics for constructing expressive probability distributions from simple base distribution using smooth bijective transformations. It’s a hot topic right now and has exciting applications in the context of probabilistic modeling and inference.
First, let’s take a step back and look at which developments preceded NFs and how they fit into the larger setting of a machine learning subfield known as deep generative models (DGM). There are currently three types of DGMs:
- variational autoencoders (VAE) [Kingma et al. in 2013]
- generative adversarial networks (GAN) [Goodfellow et al. in 2014]
- flow-based generative networks (FGN) [Rezende et al. in 2015]
Here’s a brief explanation for each:
- VAEs consists of two trapezoidal neural networks connected at their narrow sides. The first one, known as the encoder, goes from wide to narrow and learns a latent, lower-dimensional representation of the input . The second one, known as the decoder, goes from narrow back to wide. Together, they indirectly maximize the likelihood of the data they generate through variational inference (hence the name VAE). Their objective is to maximize the evidence lower bound (ELBO) of generated data given the observed data (which is tractable unlike the likelihood itself). This implicitly drives the approximate posterior to resemble the true one since the difference between likelihood and ELBO is the Kullback-Leibler divergence between true and approximate posterior. And this divergence becomes ever smaller as the ELBO is being raised.
- GANs are an example of likelihood-free inference. They cleverly recast the unsupervised learning problem of trying to generate realistic new data into a supervised one by introducing an adversarial discriminator network whose goal it is to learn to distinguish real data from samples produced by the generator model. The two models constantly try to outwit each other as they engage in this zero-sum game (a.k.a. minimax objective).
- FGNs consist of a sequence of invertible transformations. Unlike VAEs and GANs, FGNs explicitly attempt the difficult problem of density estimation where the goal is to learn the underlying probability distribution of the observed data. To do so, they use normalizing flows. The loss function to minimize in this case is the negative log-likelihood of observed data which (unlike for the VAE) in this case can be made tractable by the change of variables formula as we’ll see below.
The cool thing about FGNs is that a sufficiently accurate proxy for the posterior enables a host of useful downstream applications such as
- data generation: drawing statistically independent (i.e. new and unobserved) one-shot samples of (important for Bayesian inference),
- density estimation: predict the rareness of future events or infer latent variables,
- sample completion: fill in incomplete data (e.g. image or audio samples).
The problem that normalizing flows aim to address is turning a simple distribution into a complex, multi-modal one in an invertible manner. Why would we want to do that? Training a machine learning model usually means tuning its parameters to maximize the probability of observed training data under the model. To quantify this probability, we have to assume some probability distribution as the model’s output. In classification, this is typically a categorical distribution and in regression usually a Gaussian, mostly because it’s the only non-uniform continuous distribution we really know how to deal with. However, assuming the model output to be distributed according to a Gaussian is problematic because the world is complicated and the true probability density function (PDF) of actual data will in general be completely unlike a Gaussian.
Luckily, we can take a simple distribution like a Gaussian, sample from it and then transform those samples using smooth bijective functions which essentially performs a change of variables in probability distributions. Repeating this process multiple times can quickly result in a complex PDF for the transformed variable. Danilo Rezende formalized this in his 2015 paper on normalizing flows.
Let be a continuous, real random variable. We would like to know its joint distribution . However, we know from the get-go that this is too complicated an expression to write down or manipulate exactly. So we contend ourselves with indirectly constructing as accurate an approximation as possible.
For that we can use flow-based modeling. The main idea behind that is to express as a transformation of a real vector sampled from a simpler distribution :
is called the_base distribution_of the model. It’s usually a diagonal-covariance Gaussian. (To connect terminology with latent-variable models such as VAEs, is also sometimes referred to as the prior and as the latent variable. However, as Papamakarios et al. rightly point out in their excellent review, is not actually latent (unobserved) since knowing uniquely determines the corresponding .)
The transformation and base distribution are the_only_two defining properties of a flow model. Both may have their own parameters denoted and , respectively. Together, and induce a family of distributions over parameterized by (as usual in variational inference). If we’re doing posterior inference, we can only get as close to the true posterior as the most similar member of this family. (We won’t always indicate the dependence on these parameters to simplify notation but in the following, actually refers to and to .)
For flow models to be tractable, must be a diffeomorphism. This is just a fancy way of saying it has to be invertible (bijective) and differentiable (smooth). As a corollary, and must be of equal dimension (else could not be invertible). Given these constraints, the generated density is well-defined and computable in practice from the base distribution by a change of variables.
where denotes the Jacobian of at , that is the -matrix of first-order partial derivatives,
Eq. (1) follows from the change of variables theorem (a.k.a. integration by substitution, integration’s pendant to the chain rule of differentiation). By definition of probability
If under the transformation an infinitesimal neighborhood around is mapped to an infinitesimal neighborhood around , then, due to conservation of probability mass, this equality has to hold locally, i.e. . We can rewrite this as
In the last step we used two things:
the inverse function theorem which states that for continuously differentiable (which it is, in our case, by definition), we have
and because .
Intuitively, think of as expanding and contracting the space so as to mold into . The absolute Jacobian determinant then measures the change of volume of a small neighborhood around under , i.e. the ratio of volumes of corresponding neighborhoods in - and -parametrization of space. Hence we divide by it to account for the stretching or contracting of space under .
The cool thing about diffeomorphisms is that they form a group under composition. Again, this is a fancy way of saying that successively applying multiple diffeomorphisms always results in a new transformations that is itself a diffeomorphism. Given and , the inverse and Jacobian determinant of their composition are
Therefore, we can apply a series of transformations , to generate a normalizing flow,
where and . The probability distributions and of the random variables at both ends of the flow can be related by repeatedly applying eq. (2),
This flow can transform a simple base distribution such as a multivariate Gaussian into a complicated multi-modal one as illustrated below.
A normalizing flow transforming a simple distribution step by step into a complex one approximating some target distribution . Source: Lilian Weng
You may have already guessed that the word ‘flow’ in normalizing flow refers to the trajectories that a collection of samples from move along as transformations are applied to them, sort of like the particles in a compressible fluid undergoing stirring motions to achieve a certain pattern. The modifier ‘normalizing’ comes from the fact that we renormalize the resulting density after every transformation by its Jacobian determinant in order to retain a normalized probability distribution at every step.
There are two ways we might want to use a flow model.
- We can generate samples using eq. (1).
- We can evaluate the model’s density at a given point using eq. (2).
For practical applications, depending on which of these we need (and we might need both), we must ensure that certain computational operations can be performed efficiently.
- To draw samples, we need the ability to sample and to compute the forward transformation .
- To evaluate the model’s density, we need to perform the inverse transformation , calculate its Jacobian determinant and finally evaluate the density .
Of course, before using it, we first need to fit the model.
As with many probabilistic models, we fit a normalizing flow by minimizing the divergence between its output and the target distribution . The knobs and dials at our disposal are the model’s parameters for the base distribution and the transformation . By far the most common measure of discrepancy between and is the Kullback–Leibler (KL) divergence. There are two equivalent ways of expressing it. Which one to choose depends on whether it’s easier to sample from the target or the base distribution.
This loss function is useful in cases where we have or can generate samples from the target distribution .
Given a set of samples from , we can obtain an unbiased Monte Carlo estimate of as
We can then iteratively minimize using stochastic gradient descent by differentiating eq. (5) with respect to , yielding
Finding a minimum of eq. (5) is equivalent to finding the maximum likelihood estimates for the flow model’s parameters given the samples . Thus to fit a flow to maximum likelihood, besides sampling from , we need to be able to differentiate the transformation , its Jacobian determinant and the base density .
Alternatively, we can swap the order of arguments of the KL divergence to obtain a slightly different loss function:
We used eqs. (1) and (2) in the last step to replace with . This loss function comes in handy if we can sample from the base density and are able to evaluate the target density (at least up to a normalizing constant which becomes an additive constant under the log and drops out in the gradient).
By generating a set of samples from , the gradient of can be estimated as
The gradient with respect to isn’t strictly necessary since any adjustment to the parameters of the base distribution can be absorbed into the transformation . We can fit with respect to only without loss of generality.
This loss function requires that we be able to differentiate and its Jacobian determinant. It works even if we can’t evaluate the base density nor perform the inverse transform . (Of course, we will still need these if we want to evaluate the fitted model’s density.)
This loss function is chosen when using normalizing flows in the context of e.g. variational inference or model distillation. The latter is an exciting application in which a flow model is trained to replace a target model whose density can be evaluated but is otherwise inconvenient, e.g. difficult to sample. Two examples of model distillation with flows are Parallel WaveNets by van den Oord et al. and Boltzmann Generators by Noé et al.
In their original paper, Rezende et al. introduced two simple families of transformations known as planar and radial flows that satisfy these constraints. These are, however, only useful for low-dimensional random variables and so we will not cover them here. Since the original paper was published in 2015, a host of additional flows have been and continue to be developed. A class of these that are widely applicable yet still have decent expressivity are known as affine coupling flows.
To use normalizing flows in actual computation, we are constrained to transformations whose Jacobian is easy to calculate. This makes developing flows with greater expressivity non-trivial. It turns out, though, that by introducing dependencies between different dimensions of the input variable, versatile flows with a tractable Jacobian are possible. Specifically, if dimension of the transformed variable depends only on dimensions of the input variable, then the Jacobian of this transformation is triangular. And a triangular matrix has a determinant given simply by the product of diagonal entries.
Two types of transformation that follow this line of thought are known as RNVP and NICE. They form the heart of a Boltzmann generator, which hopefully finally explains how all of the above ties into the actual topic of this post. So let’s take a closer look at these and before diving into Boltzmann generators themselves.
RNVP stands for real-valued non-volume preserving and was introduced by Dinh et al. in 2017). It’s also referred to as affine coupling layer. It splits the input dimensions into two parts:
- The first dimensions of the input variable remain unaltered.
- The remaining dimensions undergo an affine transformation, that is they are scaled and translated. Both scaling and translation are functions of the first dimensions.
where is the element-wise Hadamard product and and are the scale and translation functions, respectively, both mapping . The crucial point here is that since and are identical, and don’t have to be invertible themselves for eq. (4) to be easily invertible as a whole. We can simply get back from to by computing
with denoting Hadamard division. Moreover, since the Jacobian is lower triangular,
its determinant is easy to compute:
It’s worth emphasizing that since does not require computing or and does not involve computing the Jacobian determinants or , those functions can be arbitrarily complex and thus are usually implemented as deep non-invertible neural networks.
Since a single affine coupling layer leaves some dimensions (channels) of the random variable unchanged, such layers usually appear in pairs of two with their channels reversed so that the combined block transforms all channels.
While there are more expressive flows, RNVP is still the most generally applicable because both sampling and evaluating probabilities of external samples is efficient. This is because all elements of and can be computed in parallel since all inputs are available from the start. therefore pops out in a single forward pass.
Developed two years earlier by the same guys (Dinh, et al. 2015), NICE stands for non-linear independent component estimation. It’s an additive coupling layer, i.e. simply RNVP without the scaling factor.
Normalizing flows seem to really be taking off at the moment. Here’s an incomplete list of awesome resources on the topic.
- Ari Seff created a super helpful 3blue1brown-style video explaining the basics of normalizing flows.
- Some of the guys at DeepMind involved in the development of NFs published a very thorough and very readable review article on the subject just days after I published this post. (I updated this post with some of the insights I gained there.)
- Andrej Karpathy created a repo with PyTorch implementations of the most commonly used flows (also just days after this post).
- PyMC3 has a very helpful notebook showcasing how to work with flows in practice and comparing it to their NUTS-based HMC implementation.
I’ve also started compiling a repo of helpful resources on NFs. Feel free to submit PRs to gather even more sources and advice on this topic.