Add Bijectors.jl section

penelopeysm · penelopeysm · commit 15e64aab413c · 2024-11-19T14:58:38.000Z
diff --git a/src/transforms.qmd b/src/transforms.qmd
@@ -8,7 +8,12 @@ import Random
 Random.seed!(468);
 ```
 
-This article is about transforming distributions and Bijectors.jl.
+This article seeks to motivate Bijectors.jl and how distributions are transformed in the Turing.jl probabilistic programming language.
+
+It assumes:
+
+- some basic knowledge of probability distributions (the notions of sampling from them and calculating the probability density function for a given distribution); and
+- some calculus (the chain and product rules for differentiation, and changes of variables in integrals).
 
 ## Sampling from a distribution
 
@@ -152,30 +157,32 @@ This equation is (11.5) in Bishop's textbook.
 
 ::: {.callout-note}
 The absolute value here takes care of the case where $f$ is decreasing, i.e., the distribution is flipped.
-You can try this out with the transformation $y = -\exp(x)$: you will have to flip the integration limits round to ensure that the integral comes out positive.
+You can try this out with the transformation $y = -\exp(x)$.
+If $a < b$, then $-exp(a) > -exp(b)$, and so you will have to swap the integration limits to ensure that the integral comes out positive.
 :::
 
 ## The Jacobian
 
 In general, we may have transforms that act on multivariate distributions, for example something mapping $p(x_1, x_2)$ to $q(y_1, y_2)$.
 In this case, the rule above has to be extended by replacing the derivative $\mathrm{d}x/\mathrm{d}y$ with the determinant of the Jacobian matrix:
 
-$$\mathbf{J} = \begin{pmatrix}
+$$\mathcal{J} = \begin{pmatrix}
 \partial x_1/\partial y_1 & \partial x_1/\partial y_2 \\
 \partial x_2/\partial y_1 & \partial x_2/\partial y_2
 \end{pmatrix}.$$
 
 and specifically,
 
-$$q(y_1, y_2) = p(x_1, x_2) \left| \det(\mathbf{J}) \right|.$$
+$$q(y_1, y_2) = p(x_1, x_2) \left| \det(\mathcal{J}) \right|.$$
 
-This is the same as equation (11.9) in Bishop, except that he denotes the absolute value of the determinant with just $|\mathbf{J}|$.
+This is the same as equation (11.9) in Bishop, except that he denotes the absolute value of the determinant with just $|\mathcal{J}|$.
 
 ::: {.callout-note}
 In different contexts the Jacobian can have different 'numerators' and 'denominators' in the partial derivatives.
-For example, if $\mathbf{y} = f(\mathbf{x})$, then it's common to write $\mathbf{J}_f$ as a matrix of partial derivatives of elements of $y$ with respect to elements of $x$.
-(For example, $\mathbf{J}_f$ is used in [Newton's method](https://en.wikipedia.org/wiki/Newton%27s_method) to find the zeroes of $f$, i.e. the values of $\mathbf{x}$ such that $\mathbf{y} = \mathbf{0}$.)
-However, it is always the case that the elements of the 'numerator' vary with rows and the elements of the 'denominator' vary with columns.
+For example, if $\mathbf{y} = f(\mathbf{x})$, then it's common to write $\mathbf{J}$ as a matrix of partial derivatives of elements of $y$ with respect to elements of $x$.
+Indeed, later in this article we will see that Bijectors.jl uses this convention.
+
+It is always the case, though, that the elements of the 'numerator' vary with rows and the elements of the 'denominator' vary with columns.
 :::
 
 The rest of this section will be devoted to an example to show that this works, and contains some slightly less pretty mathematics.
@@ -240,7 +247,7 @@ $$\frac{\partial x_2}{\partial y_1} = \frac{1}{2\pi} \left(\frac{1}{1 + (y_2/y_1
 
 Putting together the Jacobian matrix, we have:
 
-$$\mathbf{J} = \begin{pmatrix}
+$$\mathcal{J} = \begin{pmatrix}
 -y_1 x_1 & -y_2 x_1 \\
 -cy_2/y_1^2 & c/y_1 \\
 \end{pmatrix},$$
@@ -249,7 +256,7 @@ where $c = [2\pi(1 + (y_2/y_1)^2)]^{-1}$.
 The determinant of this matrix is
 
 $$\begin{align}
-\det(\mathbf{J}) &= -cx_1 - cx_1(y_2/y_1)^2 \\
+\det(\mathcal{J}) &= -cx_1 - cx_1(y_2/y_1)^2 \\
 &= -cx_1\left[1 + \left(\frac{y_2}{y_1}\right)^2\right] \\
 &= -\frac{1}{2\pi} x_1 \\
 &= -\frac{1}{2\pi}\exp{\left(-\frac{y_1^2}{2}\right)}\exp{\left(-\frac{y_2^2}{2}\right)},
@@ -258,7 +265,7 @@ $$\begin{align}
 Coming right back to our probability density, we have that
 
 $$\begin{align}
-q(y_1, y_2) &= p(x_1, x_2) \cdot |\det(\mathbf{J})| \\
+q(y_1, y_2) &= p(x_1, x_2) \cdot |\det(\mathcal{J})| \\
 &= \frac{1}{2\pi}\exp{\left(-\frac{y_1^2}{2}\right)}\exp{\left(-\frac{y_2^2}{2}\right)},
 \end{align}$$
 
@@ -289,7 +296,7 @@ Since bijections are a one-to-one mapping between elements, we can also reverse
 In the case of $y = \exp(x)$, the inverse function is $x = \log(y)$.
 
 ::: {.callout-note}
-Technically, Bijectors.jl is concerned with functions $f: X \to Y$ for which:
+Technically, the bijections in Bijectors.jl are functions $f: X \to Y$ for which:
 
  - $f$ is continuously differentiable, i.e. the derivative $\mathrm{d}f(x)/\mathrm{d}x$ exists and is continuous (over the domain of interest $X$);
 - If $f^{-1}: Y \to X$ is the inverse of $f$, then that is also continuously differentiable (over _its_ own domain, i.e. $Y$).
@@ -301,12 +308,109 @@ For example, taking the inverse function $\log(y)$ from above, its derivative is
 However, we specified that the bijection $y = \exp(x)$ maps values of $x \in (-\infty, \infty)$ to $y \in (0, \infty)$, so the point $y = 0$ is not within the domain of the inverse function.
 :::
 
-It's still unclear to me how the term biject**or** was adopted over biject**ion**, which is the common mathematical term.
-As far as I can tell, it's only used in this specific context of transforming distributions.
+It's not entirely clear to me who first coined the term biject**or** (as opposed to biject**ion**), which is the mathematical term.
+As far as I can tell, it's only used in this specific context of transforming probability distributions, and apart from Bijectors.jl itself, it is also used in [the TensorFlow deep learning framework](https://www.tensorflow.org/probability/api_docs/python/tfp/bijectors).
+
+Specifically, one of the primary purposes of Bijectors.jl is used to construct _bijections which map constrained distributions to unconstrained ones_.
+For example, the log-normal distribution which we saw above is constrained: its _support_, i.e. the range over which $p(x) \geq 0$, is (0, $\infty$).
+However, we can transform that to an unconstrained distribution (the normal distribution) using the transformation $y = \log(x)$.
+The `bijector` function, when applied to a distribution, returns a bijection $f$ that can be used to map the constrained distribution to an unconstrained one.
+
+```{julia}
+import Bijectors as B
+
+f = B.bijector(LogNormal())
+```
+
+We can apply this transformation to samples from the original distribution, for example:
 
-TODO: describe and illustrate API of Bijectors
+```{julia}
+samples_lognormal = rand(LogNormal(), 5)
+
+samples_normal = f(samples_lognormal)
+```
+
+We can also obtain the inverse of a bijection, $f^{-1}$:
+
+```{julia}
+f_inv = B.inverse(f)
+
+f_inv(samples_normal) == samples_lognormal
+```
+
+We know that the transformation $y = \log(x)$ changes the log-normal distribution to the normal distribution.
+Bijectors.jl also gives us a way to access that transformed distribution:
+
+```{julia}
+transformed_dist = B.transformed(LogNormal(), f)
+```
+
+This type doesn't immediately look like a `Normal()`, but it behaves in exactly the same way.
+For example, we can sample from it and plot a histogram:
+
+```{julia}
+samples_plot = rand(transformed_dist, 5000)
+histogram(samples_plot, bins=50)
+```
 
-Maybe TODO: describe how logabsdetjac is calculated (or can be calculated) via AD
+We can also obtain the logpdf of the transformed distribution and check that it is the same as that of a normal distribution:
+
+```{julia}
+println("Sample:   $(samples_plot[1])")
+println("Expected: $(logpdf(Normal(), samples_plot[1]))")
+println("Actual:   $(logpdf(transformed_dist, samples_plot[1]))")
+```
+
+Given the discussion in the previous sections, you might not be surprised to find that the transformed distribution is implemented using the Jacobian of the transformation.
+Recall that
+
+$$q(\mathbf{y}) = p(\mathbf{x}) \left| \det(\mathcal{J}) \right|,$$
+
+where (if we assume that both $\mathbf{x}$ and $\mathbf{y}$ have length 2)
+
+$$\mathcal{J} = \begin{pmatrix}
+\partial x_1/\partial y_1 & \partial x_1/\partial y_2 \\
+\partial x_2/\partial y_1 & \partial x_2/\partial y_2
+\end{pmatrix}.$$
+
+Slightly annoyingly, the convention in Bijectors.jl is the opposite way round compared to that in Bishop's book.
+(Or perhaps it's annoying that Bishop's book uses the opposite convention!)
+In Bijectors.jl, the Jacobian is defined as
+
+$$\mathbf{J} = \begin{pmatrix}
+\partial y_1/\partial x_1 & \partial y_1/\partial x_2 \\
+\partial y_2/\partial x_1 & \partial y_2/\partial x_2
+\end{pmatrix},$$
+
+(note the partial derivatives have been flipped upside-down) and we have that
+
+$$q(\mathbf{y})\left| \det(\mathbf{J}) \right| = p(\mathbf{x}),$$
+
+or equivalently
+
+$$\log(q(\mathbf{y})) = \log(p(\mathbf{x})) - \log(|\det(\mathbf{J})|).$$
+
+You can access $\log(|\det(\mathbf{J})|)$ (evaluated at the point $\mathbf{x}$) using the `logabsdetjac` function:
+
+```{julia}
+# Reiterating the setup, just to be clear
+x = rand(LogNormal())
+f = B.bijector(LogNormal())
+y = f(x)
+transformed_dist = B.transformed(LogNormal(), f)
+
+println("log(q(y))     : $(logpdf(transformed_dist, y))")
+println("log(p(x))     : $(logpdf(LogNormal(), x))")
+println("log(|det(J)|) : $(B.logabsdetjac(f, x))")
+```
+
+from which you can see that the equation above holds.
+There are more functions available in the Bijectors.jl API; for full details do check out the [documentation](https://turinglang.org/Bijectors.jl/stable/).
+For example, `logpdf_with_trans` can directly give us $\log(q(\mathbf{y}))$:
+
+```{julia}
+B.logpdf_with_trans(LogNormal(), x, true)
+```
 
 ## Why is this useful for sampling anyway?