← home

Normalizing Flows

I’m a big fan of Feynman’s technique of learning something new by trying to explain it to someone else. So in this post, I’ll try to explain normalizing flows (NF), a relatively simple yet powerful new tool in statistics for constructing expressive probability distributions from simple base distribution using smooth bijective transformations. It’s a hot topic right now and has exciting applications in the context of probabilistic modeling and inference.

Before we get started, huge thanks to Papamakarios et al., Lilian Weng and Adam Kosiorek from whom I learned most of what follows.

The Big Picture

First, let’s take a step back and look at which developments preceded NFs and how they fit into the larger setting of a machine learning subfield known as deep generative models (DGM). There are currently three types of DGMs:

Here’s a brief explanation for each:

The cool thing about FGNs is that a sufficiently accurate proxy for the posterior p(x)p(\vec x) enables a host of useful downstream applications such as

What are Normalizing Flows?

The problem that normalizing flows aim to address is turning a simple distribution into a complex, multi-modal one in an invertible manner. Why would we want to do that? Training a machine learning model usually means tuning its parameters to maximize the probability of observed training data under the model. To quantify this probability, we have to assume some probability distribution as the model’s output. In classification, this is typically a categorical distribution and in regression usually a Gaussian, mostly because it’s the only non-uniform continuous distribution we really know how to deal with. However, assuming the model output to be distributed according to a Gaussian is problematic because the world is complicated and the true probability density function (PDF) of actual data will in general be completely unlike a Gaussian.

Luckily, we can take a simple distribution like a Gaussian, sample from it and then transform those samples using smooth bijective functions which essentially performs a change of variables in probability distributions. Repeating this process multiple times can quickly result in a complex PDF for the transformed variable. Danilo Rezende formalized this in his 2015 paper on normalizing flows.

Definition

Let xRd\vec x \in \reals^d be a continuous, real random variable. We would like to know its joint distribution px(x)p_x(\vec x). However, we know from the get-go that this is too complicated an expression to write down or manipulate exactly. So we contend ourselves with indirectly constructing as accurate an approximation as possible.

For that we can use flow-based modeling. The main idea behind that is to express x\vec x as a transformation ff of a real vector z\vec z sampled from a simpler distribution pz(z)p_z(\vec z):

x=f(z)withzpz(z).(1)\vec x = f(\vec z) \quad\text{with}\quad \vec z \sim p_z(\vec z). \tag{1}

pz(z)p_z(\vec z) is called the_base distribution_of the model. It’s usually a diagonal-covariance Gaussian. (To connect terminology with latent-variable models such as VAEs, pz(z)p_z(\vec z) is also sometimes referred to as the prior and z\vec z as the latent variable. However, as Papamakarios et al. rightly point out in their excellent review, z\vec z is not actually latent (unobserved) since knowing x\vec x uniquely determines the corresponding z=f1(x)\vec z = f^{-1}(\vec x).)

The transformation ff and base distribution pz(z)p_z(\vec z) are the_only_two defining properties of a flow model. Both may have their own parameters denoted ϕ\vec\phi and ψ\vec\psi, respectively. Together, fϕf*\vec\phi and pz(z;ψ)p*z(\vec z; \psi) induce a family of distributions over x\vec x parameterized by θ={ϕ,ψ}\vec\theta = \{\vec\phi, \vec\psi\} (as usual in variational inference). If we’re doing posterior inference, we can only get as close to the true posterior as the most similar member of this family. (We won’t always indicate the dependence on these parameters to simplify notation but in the following, ff actually refers to fϕf*\vec\phi and pz(z)p_z(\vec z) to pz(z;ψ)p_z(\vec z; \psi).)

Change of Variables for Probability Distributions

For flow models to be tractable, ff must be a diffeomorphism. This is just a fancy way of saying it has to be invertible (bijective) and differentiable (smooth). As a corollary, z\vec z and x\vec x must be of equal dimension (else ff could not be invertible). Given these constraints, the generated density px(x)p_x(\vec x) is well-defined and computable in practice from the base distribution pz(z)p_z(\vec z) by a change of variables.

px(x)=pz(z) detJf(z)1,(2)p_x(\vec x) = p_z(\vec z) \ \bigl|\det J_f(\vec z)\bigr|^{-1}, \tag{2}

where Jf(z)J_f(\vec z) denotes the Jacobian of ff at z\vec z, that is the d×dd \times d-matrix of first-order partial derivatives,

Jf(z)=f(z)z=(f1z1f1zdfdz1fdzd).J_f(\vec z) = \frac{\partial f(\vec z)}{\partial \vec z} = \begin{pmatrix} \frac{\partial f_1}{\partial z_1} & \cdots & \frac{\partial f_1}{\partial z_d} \\ \vdots & \ddots & \vdots\\ \frac{\partial f_d}{\partial z_1} & \cdots & \frac{\partial f_d}{\partial z_d} \\ \end{pmatrix}.
Derivation

Eq. (2) follows from the change of variables theorem (a.k.a. integration by substitution, integration’s pendant to the chain rule of differentiation). By definition of probability

Rdpz(z) dz=1=Rdpx(x) dx,\int_{\reals^d} p_z(\vec z) \ \dif z = 1 = \int_{\reals^d} p_x(\vec x) \ \dif x,

If under the transformation ff an infinitesimal neighborhood dz\dif\vec z around z\vec z is mapped to an infinitesimal neighborhood dx\dif\vec x around x=f(z)\vec x = f(\vec z), then, due to conservation of probability mass, this equality has to hold locally, i.e. pz(z) dz=px(x) dxp_z(\vec z) \ \dif z = p_x(\vec x) \ \dif x. We can rewrite this as

px(x)=pz(z)detdzdx=pz(z)detdf1(x)dx=pz(z)detdf(z)dz1.p_x(\vec x) = p_z(\vec z) \left|\det\frac{\dif \vec z}{\dif \vec x}\right| = p_z(\vec z) \left|\det\frac{\dif f^{-1}(\vec x)}{\dif \vec x}\right| = p_z(\vec z) \left|\det\frac{\dif f(\vec z)}{\dif \vec z}\right|^{-1}.

In the last step we used two things:

  • the inverse function theorem which states that for ff continuously differentiable (which it is, in our case, by definition), we have

    df1(x)dx=dzdx=[dxdz]1=[df(z)dz]1,\frac{\dif f^{-1}(\vec x)}{\dif \vec x} = \frac{\dif \vec z}{\dif \vec x} = \left[\frac{\dif \vec x}{\dif \vec z}\right]^{-1} = \left[\frac{\dif f(\vec z)}{\dif \vec z}\right]^{-1},
  • and det(M1)=det(M)1\det(\mat M^{-1}) = \det(\mat M)^{-1} because det(M)det(M1)=det(M M1)=det(I)=1\det(\mat M) \det(\mat M^{-1}) = \det(\mat M \ \mat M^{-1}) = \det(I) = 1.

Intuitively, think of ff as expanding and contracting the space Rd\reals^d so as to mold pz(z)p_z(\vec z) into px(x)p_x(\vec x). The absolute Jacobian determinant detJT(z)|\det J_T(\vec z)| then measures the change of volume of a small neighborhood around z\vec z under ff, i.e. the ratio of volumes of corresponding neighborhoods in x\vec x- and z\vec z-parametrization of space. Hence we divide by it to account for the stretching or contracting of space under ff.

The cool thing about diffeomorphisms is that they form a group under composition. Again, this is a fancy way of saying that successively applying multiple diffeomorphisms always results in a new transformations that is itself a diffeomorphism. Given f1f_1 and f2f_2, the inverse and Jacobian determinant of their composition f2f1f_2 \circ f_1 are

(f2f1)1=f11f21detJf2f1(z)=detJf2(f1(z))detJf1(z).\begin{aligned} (f_2 \circ f_1)^{-1} &= f_1^{-1} \circ f_2^{-1}\\[1ex] \det J_{f_2 \circ f_1}(\vec z) &= \det J_{f_2}\bigl(f_1(\vec z)\bigr) \cdot \det J_{f_1}(\vec z). \end{aligned}

Therefore, we can apply a series of transformations fif_i, i{1,,k}i \in \{1, \dots, k\} to generate a normalizing flow,

zk=fkf1fxz(z0),z0p0(z0),(3)\vec z_k = \underbrace{f_k \circ \dots \circ f_1}_{f_{xz}}(\vec z_0), \qquad \vec z_0 \sim p_0(\vec z_0), \tag{3}

where z=z0\vec z = \vec z_0 and x=zk\vec x = \vec z_k. The probability distributions px(x)=pk(zk)p_x(\vec x) = p_k(\vec z_k) and p(z)=p0(z0)p(\vec z) = p_0(\vec z_0) of the random variables at both ends of the flow can be related by repeatedly applying eq. (2),

zkpk(zk)=p0(z0) i=1kdetfi(zi1)zi11.(4)\vec z_k \sim p_k(\vec z_k) = p_0(\vec z_0) \ \prod_{i=1}^k \left| \det \frac{\partial f_i(\vec z_{i-1})}{\partial \vec z_{i-1}} \right|^{-1}. \tag{4}

This flow can transform a simple base distribution such as a multivariate Gaussian into a complicated multi-modal one as illustrated below.

Normalizing Flow A normalizing flow transforming a simple distribution p0(z0)p_0(\vec z_0) step by step into a complex one pk(zk)p_k(\vec z_k) approximating some target distribution px(x)p_x(\vec x). Source: Lilian Weng

You may have already guessed that the word ‘flow’ in normalizing flow refers to the trajectories that a collection of samples from pz(z)p_z(\vec z) move along as transformations are applied to them, sort of like the particles in a compressible fluid undergoing stirring motions to achieve a certain pattern. The modifier ‘normalizing’ comes from the fact that we renormalize the resulting density pi(zi1)p_i(\vec z_{i-1}) after every transformation fif_i by its Jacobian determinant detJfi(zi1)|\det J_{f_i}(\vec z_{i-1})| in order to retain a normalized probability distribution at every step.

Usage

There are two ways we might want to use a flow model.

  1. We can generate samples using eq. (1).
  2. We can evaluate the model’s density at a given point using eq. (2).

For practical applications, depending on which of these we need (and we might need both), we must ensure that certain computational operations can be performed efficiently.

  1. To draw samples, we need the ability to sample pz(z)p_z(\vec z) and to compute the forward transformation ff.
  2. To evaluate the model’s density, we need to perform the inverse transformation f1f^{-1}, calculate its Jacobian determinant detJf(z)\det J_f(\vec z) and finally evaluate the density pz(z)p_z(\vec z).

Of course, before using it, we first need to fit the model.

Fitting a Normalizing Flow

As with many probabilistic models, we fit a normalizing flow by minimizing the divergence between its output px(x;ψ)p_x(\vec x; \vec\psi) and the target distribution px(x)p_x^*(\vec x). The knobs and dials at our disposal are the model’s parameters θ={ϕ,ψ}\vec\theta = \{ \vec\phi, \vec\psi \} for the base distribution pz(z;ψ)p_z(\vec z; \vec\psi) and the transformation fϕf_\vec\phi. By far the most common measure of discrepancy between px(x;ψ)p_x(\vec x; \vec\psi) and px(x)p_x^*(\vec x) is the Kullback–Leibler (KL) divergence. There are two equivalent ways of expressing it. Which one to choose depends on whether it’s easier to sample from the target or the base distribution.

Sampling the target density

This loss function is useful in cases where we have or can generate samples from the target distribution px(x)p_x^*(\vec x).

L(θ)=DKL(px  px(x;ψ))=Rdpx(x) log(px(x)px(x;ψ)) dx=Rdpx(x) logpx(x;ψ) dx+Rdpx logpx(x) dx=\expecpx(logpx(x;θ))+const=\expecpx[logpz ⁣(fϕ1(x);ψ)+logdetJfϕ1(x)]+const.\begin{aligned} \Lcal(\vec\theta) &= D_\text{KL}\bigl(p_x^*\ || \ p_x(\vec x; \vec\psi)\bigr) \\[1ex] &= \int_{\reals^d} p_x^*(\vec x) \ \log\left(\frac{p_x^*(\vec x)}{p_x(\vec x; \vec\psi)}\right) \ \dif\vec x \\[1ex] &= -\int_{\reals^d} p_x^*(\vec x) \ \log p_x(\vec x; \vec\psi) \ \dif\vec x + \int_{\reals^d} p_x^* \ \log p_x^*(\vec x) \ \dif\vec x \\[1ex] &= -\expec_{p_x^*}\Bigl(\log p_x(\vec x; \vec\theta)\Bigr) + \const \\[1ex] &= -\expec_{p_x^*}\left[\log p_z \!\left(f_{\vec\phi}^{-1}(\vec x); \vec\psi\right) + \log\left|\det J_{f_{\vec\phi}^{-1}}(\vec x) \right|\right] + \const. \end{aligned}

Given a set of samples {xn}n=1N\{\vec x_n\}_{n=1}^N from px(x)p_x^*(\vec x), we can obtain an unbiased Monte Carlo estimate of L(θ)\Lcal(\vec\theta) as

L(θ)1Nn=1N[logpz(fϕ1(xn;ψ)+logdetJfϕ1(xn)]+const.(5)\Lcal(\vec\theta) \approx - \frac{1}{N} \sum_{n=1}^N \left[\log p_z(f_{\vec\phi}^{-1}\left(\vec x_n; \vec\psi\right) + \log\left| \det J_{f_{\vec\phi}^{-1}}(\vec x_n) \right|\right] + \const. \tag{5}

We can then iteratively minimize L(θ)\Lcal(\vec\theta) using stochastic gradient descent by differentiating eq. (5) with respect to θ\vec\theta, yielding

ϕL(θ)1Nn=1Nϕlogpz(fϕ1(xn);ψ)+ϕlogdetJfϕ1(xn)ψL(θ)1Nn=1Nψlogpz(fϕ1(xn);ψ).\begin{aligned} &\nabla_{\vec\phi} \Lcal(\vec\theta) \approx -\frac{1}{N} \sum_{n=1}^N \nabla_{\vec\phi}\log p_z\left(f_{\vec\phi}^{-1}(\vec x_n); \vec\psi\right) + \nabla_{\vec\phi} \log\left| \det J_{f_{\vec\phi}^{-1}}(\vec x_n)\right| \\[1ex] &\nabla_{\vec\psi} \Lcal(\vec\theta) \approx -\frac{1}{N} \sum_{n=1}^N \nabla_{\vec\psi} \log p_z\left(f_{\vec\phi}^{-1}(\vec x_n); \vec\psi\right). \end{aligned}

Finding a minimum of eq. (5) is equivalent to finding the maximum likelihood estimates for the flow model’s parameters given the samples {xn}n=1N\{\vec x_n\}_{n=1}^N. Thus to fit a flow to maximum likelihood, besides sampling from px(x)p_x^*(\vec x), we need to be able to differentiate the transformation fϕ1(x)f_{\vec\phi}^{-1}(\vec x), its Jacobian determinant detJfϕ1(x)\det J_{f_{\vec\phi}^{-1}}(\vec x) and the base density pz(z;ψ)p_z(\vec z; \vec\psi).

Sampling the base density

Alternatively, we can swap the order of arguments of the KL divergence to obtain a slightly different loss function:

L(θ)=DKL(px(x;θ)  px(x))=\expecpx(x;θ)(logpx(x;θ)logpx(x))=\expecpz(z;ψ)(logpz(z;ψ)logdetJfϕ(z)logpx(fϕ(z))).\begin{aligned} \Lcal(\vec\theta) &= D_\text{KL}\bigl(p_x(\vec x; \vec\theta) \ || \ p_x^*(\vec x)\bigr) \\ &= \expec_{p_x(\vec x; \vec\theta)} \Bigl(\log p_x(\vec x; \vec\theta) - \log p_x^*(\vec x)\Bigr) \\ &= \expec_{p_z(\vec z; \vec\psi)} \Bigl(\log p_z(\vec z; \vec\psi) - \log\bigl|\det J_{f_{\vec\phi}}(\vec z)\bigr| - \log p_x^*\bigl(f_\vec\phi(\vec z)\bigr)\Bigr). \end{aligned}

We used eqs. (1) and (2) in the last step to replace x\vec x with z\vec z. This loss function comes in handy if we can sample from the base density pz(z)p_z(\vec z) and are able to evaluate the target density px(x)p_x^*(\vec x) (at least up to a normalizing constant which becomes an additive constant under the log and drops out in the gradient).

By generating a set of samples {zn}n=1N\{\vec z_n\}_{n=1}^N from pz(z;ψ)p_z(\vec z; \vec\psi), the gradient of L(θ)\Lcal(\vec\theta) can be estimated as

ϕL(θ)1Nn=1NϕlogdetJfϕ(zn)+ϕlogpz(fϕ(zn)).\nabla_{\vec\phi} \Lcal(\vec\theta) \approx -\frac{1}{N} \sum_{n=1}^N \nabla_{\vec\phi} \log\left| \det J_{f_{\vec\phi}}(\vec z_n) \right| + \nabla_{\vec\phi}\log p_z^*\bigl(f_{\vec\phi}(\vec z_n)\bigr).

The gradient with respect to ψ\vec\psi isn’t strictly necessary since any adjustment to the parameters of the base distribution pz(z;ψ)p_z(\vec z; \vec\psi) can be absorbed into the transformation fϕf_\vec\phi. We can fit with respect to ϕ\vec\phi only without loss of generality.

This loss function requires that we be able to differentiate fϕf_\vec\phi and its Jacobian determinant. It works even if we can’t evaluate the base density nor perform the inverse transform fϕ1f_{\vec\phi}^{-1}. (Of course, we will still need these if we want to evaluate the fitted model’s density.)

This loss function is chosen when using normalizing flows in the context of e.g. variational inference or model distillation. The latter is an exciting application in which a flow model is trained to replace a target model whose density px(x)p_x^*(\vec x) can be evaluated but is otherwise inconvenient, e.g. difficult to sample. Two examples of model distillation with flows are Parallel WaveNets by van den Oord et al. and Boltzmann Generators by Noé et al.


In their original paper, Rezende et al. introduced two simple families of transformations known as planar and radial flows that satisfy these constraints. These are, however, only useful for low-dimensional random variables and so we will not cover them here. Since the original paper was published in 2015, a host of additional flows have been and continue to be developed. A class of these that are widely applicable yet still have decent expressivity are known as affine coupling flows.

Affine Coupling Flows

To use normalizing flows in actual computation, we are constrained to transformations whose Jacobian is easy to calculate. This makes developing flows with greater expressivity non-trivial. It turns out, though, that by introducing dependencies between different dimensions of the input variable, versatile flows with a tractable Jacobian are possible. Specifically, if dimension ii of the transformed variable depends only on dimensions {1,,i}\{1, \dots, i\} of the input variable, then the Jacobian of this transformation is triangular. And a triangular matrix has a determinant given simply by the product of diagonal entries.

Two types of transformation that follow this line of thought are known as RNVP and NICE. They form the heart of a Boltzmann generator, which hopefully finally explains how all of the above ties into the actual topic of this post. So let’s take a closer look at these and before diving into Boltzmann generators themselves.

RNVP

RNVP stands for real-valued non-volume preserving and was introduced by Dinh et al. in 2017). It’s also referred to as affine coupling layer. It splits the input dimensions into two parts:

fRNVP={x1:j=z1:jxj+1:d=zj+1:dexp(S(z1:j))+T(z1:j)(5)f_\text{RNVP} = \begin{cases} \vec x_{1:j} = \vec z_{1:j} \\ \vec x_{j+1:d} = \vec z_{j+1:d} \odot \exp({S(\vec z_{1:j})}) + T(\vec z_{1:j}) \tag{5} \end{cases}

where \odot is the element-wise Hadamard product and S()S(\cdot) and T()T(\cdot) are the scale and translation functions, respectively, both mapping RjRdj\mathbb{R}^j \mapsto \mathbb{R}^{d-j}. The crucial point here is that since x1:j\vec x_{1:j} and z1:j\vec z_{1:j} are identical, SS and TT don’t have to be invertible themselves for eq. (4) to be easily invertible as a whole. We can simply get back from x\vec x to z\vec z by computing

fRNVP1={z1:j=x1:jzj+1:d=(xj+1:dT(x1:j))exp(S(x1:j)),f_\text{RNVP}^{-1} = \begin{cases} \vec z_{1:j} = \vec x_{1:j} \\ \vec z_{j+1:d} = (\vec x_{j+1:d} - T(\vec x_{1:j})) \oslash \exp(S(\vec x_{1:j})), \end{cases}

with \oslash denoting Hadamard division. Moreover, since the Jacobian is lower triangular,

JRNVP=(Ij0j×(dj)xj+1:dz1:jdiag(exp(S(z1:j)))),\vec J_\text{RNVP} = \begin{pmatrix} \unity_j & \mat 0_{j\times(d-j)} \\[1ex] \frac{\partial \vec x_{j+1:d}}{\partial \vec z_{1:j}} & \diag(\exp(S(\vec z_{1:j}))) \end{pmatrix},

its determinant is easy to compute:

det(JRNVP)=j=1djexp(S(z1:j))j=exp(j=1djS(z1:j)j).\det(\vec J_\text{RNVP}) = \prod_{j=1}^{d-j}\exp\Bigl(S(\vec z_{1:j})\Bigr)_j = \exp\left(\sum_{j=1}^{d-j} S(\vec z_{1:j})_j\right).

It’s worth emphasizing that since fRNVP1f_\text{RNVP}^{-1} does not require computing S1S^{-1} or T1T^{-1} and det(JRNVP)\det(\vec J_\text{RNVP}) does not involve computing the Jacobian determinants det(JS)\det(J_S) or det(JT)\det(J_T), those functions can be arbitrarily complex and thus are usually implemented as deep non-invertible neural networks.

Since a single affine coupling layer leaves some dimensions (channels) of the random variable unchanged, such layers usually appear in pairs of two with their channels reversed so that the combined block transforms all channels.

While there are more expressive flows, RNVP is still the most generally applicable because both sampling and evaluating probabilities of external samples is efficient. This is because all elements of SS and TT can be computed in parallel since all inputs z\vec z are available from the start. x\vec x therefore pops out in a single forward pass.

NICE

Developed two years earlier by the same guys (Dinh, et al. 2015), NICE stands for non-linear independent component estimation. It’s an additive coupling layer, i.e. simply RNVP without the scaling factor.

fNICE={x1:j=z1:jxj+1:d=zj+1:d+T(z1:j)f_\text{NICE} = \begin{cases} \vec x_{1:j} = \vec z_{1:j} \\ \vec x_{j+1:d} = \vec z_{j+1:d} + T(\vec z_{1:j}) \end{cases}

and

fNICE1={z1:j=x1:jzj+1:d=xj+1:dT(x1:j)f_\text{NICE}^{-1} = \begin{cases} \vec z_{1:j} = \vec x_{1:j} \\ \vec z_{j+1:d} = \vec x_{j+1:d} - T(\vec x_{1:j}) \end{cases}

Further Reading

To learn more about normalizing flows, here are some resources to take you further.

I also started this much more up-to-date collection on NFs that’s received an unexpected amount of attention. Feel free to submit PRs to gather even more sources/advice/applications.