Skip to content

Commit 15e64aa

Browse files
committed
Add Bijectors.jl section
1 parent 9ae004d commit 15e64aa

File tree

1 file changed

+120
-16
lines changed

1 file changed

+120
-16
lines changed

src/transforms.qmd

+120-16
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,12 @@ import Random
88
Random.seed!(468);
99
```
1010

11-
This article is about transforming distributions and Bijectors.jl.
11+
This article seeks to motivate Bijectors.jl and how distributions are transformed in the Turing.jl probabilistic programming language.
12+
13+
It assumes:
14+
15+
- some basic knowledge of probability distributions (the notions of sampling from them and calculating the probability density function for a given distribution); and
16+
- some calculus (the chain and product rules for differentiation, and changes of variables in integrals).
1217

1318
## Sampling from a distribution
1419

@@ -152,30 +157,32 @@ This equation is (11.5) in Bishop's textbook.
152157

153158
::: {.callout-note}
154159
The absolute value here takes care of the case where $f$ is decreasing, i.e., the distribution is flipped.
155-
You can try this out with the transformation $y = -\exp(x)$: you will have to flip the integration limits round to ensure that the integral comes out positive.
160+
You can try this out with the transformation $y = -\exp(x)$.
161+
If $a < b$, then $-exp(a) > -exp(b)$, and so you will have to swap the integration limits to ensure that the integral comes out positive.
156162
:::
157163

158164
## The Jacobian
159165

160166
In general, we may have transforms that act on multivariate distributions, for example something mapping $p(x_1, x_2)$ to $q(y_1, y_2)$.
161167
In this case, the rule above has to be extended by replacing the derivative $\mathrm{d}x/\mathrm{d}y$ with the determinant of the Jacobian matrix:
162168

163-
$$\mathbf{J} = \begin{pmatrix}
169+
$$\mathcal{J} = \begin{pmatrix}
164170
\partial x_1/\partial y_1 & \partial x_1/\partial y_2 \\
165171
\partial x_2/\partial y_1 & \partial x_2/\partial y_2
166172
\end{pmatrix}.$$
167173

168174
and specifically,
169175

170-
$$q(y_1, y_2) = p(x_1, x_2) \left| \det(\mathbf{J}) \right|.$$
176+
$$q(y_1, y_2) = p(x_1, x_2) \left| \det(\mathcal{J}) \right|.$$
171177

172-
This is the same as equation (11.9) in Bishop, except that he denotes the absolute value of the determinant with just $|\mathbf{J}|$.
178+
This is the same as equation (11.9) in Bishop, except that he denotes the absolute value of the determinant with just $|\mathcal{J}|$.
173179

174180
::: {.callout-note}
175181
In different contexts the Jacobian can have different 'numerators' and 'denominators' in the partial derivatives.
176-
For example, if $\mathbf{y} = f(\mathbf{x})$, then it's common to write $\mathbf{J}_f$ as a matrix of partial derivatives of elements of $y$ with respect to elements of $x$.
177-
(For example, $\mathbf{J}_f$ is used in [Newton's method](https://en.wikipedia.org/wiki/Newton%27s_method) to find the zeroes of $f$, i.e. the values of $\mathbf{x}$ such that $\mathbf{y} = \mathbf{0}$.)
178-
However, it is always the case that the elements of the 'numerator' vary with rows and the elements of the 'denominator' vary with columns.
182+
For example, if $\mathbf{y} = f(\mathbf{x})$, then it's common to write $\mathbf{J}$ as a matrix of partial derivatives of elements of $y$ with respect to elements of $x$.
183+
Indeed, later in this article we will see that Bijectors.jl uses this convention.
184+
185+
It is always the case, though, that the elements of the 'numerator' vary with rows and the elements of the 'denominator' vary with columns.
179186
:::
180187

181188
The rest of this section will be devoted to an example to show that this works, and contains some slightly less pretty mathematics.
@@ -240,7 +247,7 @@ $$\frac{\partial x_2}{\partial y_1} = \frac{1}{2\pi} \left(\frac{1}{1 + (y_2/y_1
240247

241248
Putting together the Jacobian matrix, we have:
242249

243-
$$\mathbf{J} = \begin{pmatrix}
250+
$$\mathcal{J} = \begin{pmatrix}
244251
-y_1 x_1 & -y_2 x_1 \\
245252
-cy_2/y_1^2 & c/y_1 \\
246253
\end{pmatrix},$$
@@ -249,7 +256,7 @@ where $c = [2\pi(1 + (y_2/y_1)^2)]^{-1}$.
249256
The determinant of this matrix is
250257

251258
$$\begin{align}
252-
\det(\mathbf{J}) &= -cx_1 - cx_1(y_2/y_1)^2 \\
259+
\det(\mathcal{J}) &= -cx_1 - cx_1(y_2/y_1)^2 \\
253260
&= -cx_1\left[1 + \left(\frac{y_2}{y_1}\right)^2\right] \\
254261
&= -\frac{1}{2\pi} x_1 \\
255262
&= -\frac{1}{2\pi}\exp{\left(-\frac{y_1^2}{2}\right)}\exp{\left(-\frac{y_2^2}{2}\right)},
@@ -258,7 +265,7 @@ $$\begin{align}
258265
Coming right back to our probability density, we have that
259266

260267
$$\begin{align}
261-
q(y_1, y_2) &= p(x_1, x_2) \cdot |\det(\mathbf{J})| \\
268+
q(y_1, y_2) &= p(x_1, x_2) \cdot |\det(\mathcal{J})| \\
262269
&= \frac{1}{2\pi}\exp{\left(-\frac{y_1^2}{2}\right)}\exp{\left(-\frac{y_2^2}{2}\right)},
263270
\end{align}$$
264271

@@ -289,7 +296,7 @@ Since bijections are a one-to-one mapping between elements, we can also reverse
289296
In the case of $y = \exp(x)$, the inverse function is $x = \log(y)$.
290297

291298
::: {.callout-note}
292-
Technically, Bijectors.jl is concerned with functions $f: X \to Y$ for which:
299+
Technically, the bijections in Bijectors.jl are functions $f: X \to Y$ for which:
293300

294301
- $f$ is continuously differentiable, i.e. the derivative $\mathrm{d}f(x)/\mathrm{d}x$ exists and is continuous (over the domain of interest $X$);
295302
- If $f^{-1}: Y \to X$ is the inverse of $f$, then that is also continuously differentiable (over _its_ own domain, i.e. $Y$).
@@ -301,12 +308,109 @@ For example, taking the inverse function $\log(y)$ from above, its derivative is
301308
However, we specified that the bijection $y = \exp(x)$ maps values of $x \in (-\infty, \infty)$ to $y \in (0, \infty)$, so the point $y = 0$ is not within the domain of the inverse function.
302309
:::
303310

304-
It's still unclear to me how the term biject**or** was adopted over biject**ion**, which is the common mathematical term.
305-
As far as I can tell, it's only used in this specific context of transforming distributions.
311+
It's not entirely clear to me who first coined the term biject**or** (as opposed to biject**ion**), which is the mathematical term.
312+
As far as I can tell, it's only used in this specific context of transforming probability distributions, and apart from Bijectors.jl itself, it is also used in [the TensorFlow deep learning framework](https://www.tensorflow.org/probability/api_docs/python/tfp/bijectors).
313+
314+
Specifically, one of the primary purposes of Bijectors.jl is used to construct _bijections which map constrained distributions to unconstrained ones_.
315+
For example, the log-normal distribution which we saw above is constrained: its _support_, i.e. the range over which $p(x) \geq 0$, is (0, $\infty$).
316+
However, we can transform that to an unconstrained distribution (the normal distribution) using the transformation $y = \log(x)$.
317+
The `bijector` function, when applied to a distribution, returns a bijection $f$ that can be used to map the constrained distribution to an unconstrained one.
318+
319+
```{julia}
320+
import Bijectors as B
321+
322+
f = B.bijector(LogNormal())
323+
```
324+
325+
We can apply this transformation to samples from the original distribution, for example:
306326

307-
TODO: describe and illustrate API of Bijectors
327+
```{julia}
328+
samples_lognormal = rand(LogNormal(), 5)
329+
330+
samples_normal = f(samples_lognormal)
331+
```
332+
333+
We can also obtain the inverse of a bijection, $f^{-1}$:
334+
335+
```{julia}
336+
f_inv = B.inverse(f)
337+
338+
f_inv(samples_normal) == samples_lognormal
339+
```
340+
341+
We know that the transformation $y = \log(x)$ changes the log-normal distribution to the normal distribution.
342+
Bijectors.jl also gives us a way to access that transformed distribution:
343+
344+
```{julia}
345+
transformed_dist = B.transformed(LogNormal(), f)
346+
```
347+
348+
This type doesn't immediately look like a `Normal()`, but it behaves in exactly the same way.
349+
For example, we can sample from it and plot a histogram:
350+
351+
```{julia}
352+
samples_plot = rand(transformed_dist, 5000)
353+
histogram(samples_plot, bins=50)
354+
```
308355

309-
Maybe TODO: describe how logabsdetjac is calculated (or can be calculated) via AD
356+
We can also obtain the logpdf of the transformed distribution and check that it is the same as that of a normal distribution:
357+
358+
```{julia}
359+
println("Sample: $(samples_plot[1])")
360+
println("Expected: $(logpdf(Normal(), samples_plot[1]))")
361+
println("Actual: $(logpdf(transformed_dist, samples_plot[1]))")
362+
```
363+
364+
Given the discussion in the previous sections, you might not be surprised to find that the transformed distribution is implemented using the Jacobian of the transformation.
365+
Recall that
366+
367+
$$q(\mathbf{y}) = p(\mathbf{x}) \left| \det(\mathcal{J}) \right|,$$
368+
369+
where (if we assume that both $\mathbf{x}$ and $\mathbf{y}$ have length 2)
370+
371+
$$\mathcal{J} = \begin{pmatrix}
372+
\partial x_1/\partial y_1 & \partial x_1/\partial y_2 \\
373+
\partial x_2/\partial y_1 & \partial x_2/\partial y_2
374+
\end{pmatrix}.$$
375+
376+
Slightly annoyingly, the convention in Bijectors.jl is the opposite way round compared to that in Bishop's book.
377+
(Or perhaps it's annoying that Bishop's book uses the opposite convention!)
378+
In Bijectors.jl, the Jacobian is defined as
379+
380+
$$\mathbf{J} = \begin{pmatrix}
381+
\partial y_1/\partial x_1 & \partial y_1/\partial x_2 \\
382+
\partial y_2/\partial x_1 & \partial y_2/\partial x_2
383+
\end{pmatrix},$$
384+
385+
(note the partial derivatives have been flipped upside-down) and we have that
386+
387+
$$q(\mathbf{y})\left| \det(\mathbf{J}) \right| = p(\mathbf{x}),$$
388+
389+
or equivalently
390+
391+
$$\log(q(\mathbf{y})) = \log(p(\mathbf{x})) - \log(|\det(\mathbf{J})|).$$
392+
393+
You can access $\log(|\det(\mathbf{J})|)$ (evaluated at the point $\mathbf{x}$) using the `logabsdetjac` function:
394+
395+
```{julia}
396+
# Reiterating the setup, just to be clear
397+
x = rand(LogNormal())
398+
f = B.bijector(LogNormal())
399+
y = f(x)
400+
transformed_dist = B.transformed(LogNormal(), f)
401+
402+
println("log(q(y)) : $(logpdf(transformed_dist, y))")
403+
println("log(p(x)) : $(logpdf(LogNormal(), x))")
404+
println("log(|det(J)|) : $(B.logabsdetjac(f, x))")
405+
```
406+
407+
from which you can see that the equation above holds.
408+
There are more functions available in the Bijectors.jl API; for full details do check out the [documentation](https://turinglang.org/Bijectors.jl/stable/).
409+
For example, `logpdf_with_trans` can directly give us $\log(q(\mathbf{y}))$:
410+
411+
```{julia}
412+
B.logpdf_with_trans(LogNormal(), x, true)
413+
```
310414

311415
## Why is this useful for sampling anyway?
312416

0 commit comments

Comments
 (0)