README.qmd

---
title: "Bayesian Conditional Transformation Models in Liesel"
format: gfm
---

[![pre-commit](https://github.com/liesel-devs/liesel-ctm/actions/workflows/pre-commit.yml/badge.svg)](https://github.com/liesel-devs/liesel-ctm/actions/workflows/pre-commit.yml) [![pytest](https://github.com/liesel-devs/liesel-ctm/actions/workflows/pytest.yml/badge.svg)](https://github.com/liesel-devs/liesel-ctm/actions/workflows/pytest.yml) [![pytest-cov](tests/coverage.svg)](https://github.com/liesel-devs/liesel-ctm/actions/workflows/pytest.yml)

`bctm` is a Python library for building Bayesian Conditional
Transformation Models and sampling from their parameter's posterior distribution
via MCMC. It is built on top of the probabilistic programming framework.

For more on Liesel, see

- [The Liesel documentation](https://docs.liesel-project.org/en/latest/)
- [The Liesel GitHub repository](https://github.com/liesel-devs/liesel)

For more on Bayesian Conditional Transformation Models, see

- [Carlan, Kneib & Klein (2022). Bayesian Conditional Transformation Models](https://arxiv.org/abs/2012.11016)
- [Carlan & Kneib (2022). Bayesian discrete conditional transformation models](https://journals.sagepub.com/doi/10.1177/1471082X221114177)


## Example usage

For a start, we import the relevant packages.

```{python}
#| label: Import Python packages
import numpy as np
import liesel.model as lsl
import liesel.goose as gs
import liesel_bctm as bctm
```

### Data generation

Next, we generate some data. We do that by defining a data-generating function -
this function will become useful for comparing our model performance to the
actual distribution later.

```{python}
#| label: Generate data
rng = np.random.default_rng(seed=3)

def dgf(x, n):
    """Data-generating function."""
    return rng.normal(loc=1. + 5*np.cos(x), scale=2., size=n)

n = 300
x = np.sort(rng.uniform(-2, 2, size=n))

y = dgf(x, n)
data = dict(y=y, x=x)
```

Let's have a look at a scatter plot for these data. The solid blue line shows
the expected value.

```{r}
#| label: Plot data
#| echo: false
library(ggplot2)
library(reticulate)
ggplot() +
    geom_point(aes(py$x, py$y)) +
    geom_line(aes(py$x, (1 + 5*cos(py$x))), color = "#56B4E9", size=1) +
    labs(x="x", y="y") +
    NULL
```

### Model building

Now, we can build a model:


```{python}
#| label: Build model
ctmb = (
    bctm.CTMBuilder(data)
    .add_intercept()
    .add_trafo("y", nparam=15, a=2.0, b=0.5, name="y")
    .add_pspline("x", nparam=15, a=2.0, b=0.5, name="x")
    .add_response("y")
)

model = ctmb.build_model()
```

Let's have a look at the model graph:

```{python}
#| label: Model graph
lsl.plot_vars(model)
```

### MCMC sampling

With `bctm.ctm_mcmc`, it's easy to set up an [EngineBuilder](https://docs.liesel-project.org/en/latest/generated/liesel.goose.builder.EngineBuilder.html) with predefined MCMC kernels for our model parameters. We just
need to define the warmup duration and the number of desired posterior samples,
then we are good to go.

```{python}
#| label: Sample model
nchains = 2
nsamples = 1000
eb = bctm.ctm_mcmc(model, seed=1, num_chains=nchains)
eb.positions_included += ["z"]
eb.set_duration(warmup_duration=500, posterior_duration=nsamples)

engine = eb.build()
engine.sample_all_epochs()

results = engine.get_results()
samples = results.get_posterior_samples()
```

### Posterior predictive conditional distribution

To interpret our model, we now define some fixed covariate values for which we
are going to evaluate the conditional distribution. We also define an even grid
of response values to use in the evaluations. The function `bctm.grid`
helps us to create arrays of fitting length.

```{python}
#| label: Define grid for posterior predictions
yvals = np.linspace(np.min(y), np.max(y), 100)
xvals = np.linspace(np.min(x), np.max(x), 6).round(2)[1:-1]
ygrid, xgrid = bctm.grid(yvals, xvals)
```

To get a sense of how well our model captures the true distribution of the
response even for complex data-generating mechanism, `bctm` offers the
helper function `bctm.samle_dgf`. This function gives us a data-frame with
our desired number of samples, each for a grid of covariate values. We just need
to enter the data-generating function that we defined above.

```{python}
#| label: Samples from true data-generating function
dgf_samples = bctm.sample_dgf(dgf, 5000, x=xvals)
dgf_samples
```

#### Uncertainty visualization via quantiles

Next, we evaluate the transformation density for our fixed covariate values. The
function `bctm.dist_df` does that for us. It evaluates the transformation
density for each individual MCMC sample and returns a summary data-frame that
contains the mean and some quantiles of our choice (default: 0.1, 0.5, 0.9).

```{python}
#| label: Create prediction dataframe
dist_df = bctm.cdist_quantiles(samples, ctmb, y=ygrid, x=xgrid)
dist_df
```

Now it is time to plot the estimated distribution. For this part, I like to
switch to R, because `ggplot2` is just amazing. I load `reticulate` here, too,
because that lets us access Python objects from within R.

```{r}
#| label: Import R packages
library(reticulate)
library(tidyverse)
```

For easy handling during plotting, I transform the data to wide format.

```{r}
#| label: Transport dist_df to R and make wide
dist_df <- py$dist_df

dist_df_wide <- dist_df |>
    pivot_wider(
        names_from = "summary",
        values_from = c("pdf", "cdf", "log_prob")
    )
```

Next, it is time for our plot:

```{r}
#| label: Plot cPDF with quantiles
dist_df_wide |>
    ggplot() +
    geom_ribbon(aes(x=y, ymin=`pdf_q0.1`, ymax=`pdf_q0.9`),
                fill = "#56B4E9",
                alpha = 0.5) +
    geom_density(data = py$dgf_samples,
                 aes(y),
                 linetype = "dotted",
                 adjust=2) +
    geom_line(aes(y, `pdf_q0.5`)) +
    facet_wrap(~x, labeller = label_both) +
    labs(
        x = "Response",
        y = "Density",
        title = "Posterior estimates of the response distribution",
        subtitle = "Shaded area shows 0.1 - 0.9 quantiles.\n
        Dotted line is KDE of samples from the data-generating function."
    )
```

#### Uncertainty visualization via random samples

```{python}
#| label: Create sample dataframe
rdist_df = bctm.cdist_rsample(samples, ctmb, size=100, y=ygrid, x=xgrid)
rdist_df
```


```{r}
#| label: Plot cPDF with samples
py$rdist_df |>
    ggplot() +
    geom_line(aes(y, pdf, group = id), alpha = 0.15, color = "#56B4E9") +
    geom_line(data = dist_df_wide, aes(y, `pdf_q0.5`)) +
    geom_density(data = py$dgf_samples, aes(y), linetype = "dotted", adjust=2) +
    facet_wrap(~x, labeller = label_both) +
        labs(
        x = "Response",
        y = "Density",
        title = "Posterior estimates of the response distribution",
        subtitle = "Blue: 100 samples from the posterior predictions. Solid: Median.\n
        Dotted line is KDE of samples from the data-generating function."
    )
```

### Plot the transformation function

#### Transformation part

```{python}
#| label: Extract h0(y)
tf_quantiles = bctm.partial_ctrans_quantiles(samples, ctmb, y=yvals)
tf_samples = bctm.partial_ctrans_rsample(samples, ctmb, y=yvals, size=50)
```

```{r}
#| label: Plot h0(y)
tf_wide <- py$tf_quantiles |> pivot_wider(
    names_from = "summary", values_from = c("value", "value_d")
)

tf_wide |>
    ggplot() +
    geom_line(
        data = py$tf_samples,
        aes(y, value, group = id),
        color = "#56B4E9",
        alpha = 0.2
    ) +
    geom_line(aes(y, `value_q0.5`)) +
    labs(
        x = "Response",
        y = expression(h[0](y)),
        title = "Conditional transformation function",
        subtitle = "Black: Median, Blue: 50 random samples from the posterior"
    )
```

#### Location shift part

```{python}
#| label: Extract h1(x)
xvals = np.linspace(np.min(x), np.max(x), 100)
tf_quantiles = bctm.partial_ctrans_quantiles(samples, ctmb, x=xvals)
tf_samples = bctm.partial_ctrans_rsample(samples, ctmb, x=xvals, size=50)
```

```{r}
#| label: Plot h1(x)
tf_wide <- py$tf_quantiles |>
    pivot_wider(names_from = "summary",
                values_from = c("value"),
                names_prefix="value_")

tf_wide |>
    ggplot() +
    geom_line(
        data = py$tf_samples,
        aes(x, value, group = id),
        color = "#56B4E9",
        alpha = 0.2
    ) +
    geom_line(aes(x, `value_q0.5`)) +
    labs(
        x = "Response",
        y = expression(h[1](x)),
        title = "Conditional transformation function",
        subtitle = "Black: Median, Blue: 50 random samples from the posterior"
    )
```


### Diagnostics

#### Distribution of the latent variable

```{python}
#| label: Obtain latent variable
z = bctm.ConditionalPredictions(samples, ctmb).ctrans().mean(axis=(0,1))
```

```{r}
#| label: Plot estimated density of latent variable
#| fig-width: 8
#| fig-height: 3.5
kde_plot <- ggplot() +
    stat_function(fun = dnorm, linetype = "dotted") +
    geom_density(aes(py$z), adjust = 1) +
    labs(x = "h(y|x)", title = "Kernel density estimator for h(y|x)",
         subtitle = "Dotted line: True Standard Normal distribution") +
    NULL

qq_plot <- ggplot() +
    aes(sample = py$z, color="h(y|x)") +
    stat_qq() +
    stat_qq(aes(sample = py$y, color="y"), alpha=0.6) +
    stat_qq_line(color="black") +
    labs(x = "Theoretical quantile", y="Empirical quantile",
         title = "QQ-plot for latent variable h(y|x) and y")

ggpubr::ggarrange(qq_plot, kde_plot, ncol = 2, common.legend = TRUE, legend = "top")
```


#### rhat

```{python}
#| label: Goose summary
summary = gs.Summary(results)
summary_df = summary._param_df().reset_index()
error_df = summary._error_df().reset_index()
error_df = error_df.astype({"count": "int32", "relative": "float32"})
```

```{r}
#| label: Plot rhat
#| fig-height: 11
py$summary_df |>
    filter(parameter != "z") |>
    group_by(parameter) |>
    mutate(index = seq(0, n()-1)) |>
    mutate(parameter_i = paste0(parameter, "[", sprintf("%02d", index), "]")) |>
    relocate(parameter_i) |>
    ggplot() +
    geom_point(aes(parameter_i, rhat)) +
    geom_line(aes(parameter_i, rhat, group=1), alpha = 0.5) +
    geom_hline(yintercept = 1.01, linetype = "dotted") +
    coord_flip() +
    NULL
```

#### Effective sample size

```{r}
#| label: Plot ess_bulk
#| fig-height: 11
py$summary_df |>
    filter(parameter != "z") |>
    group_by(parameter) |>
    mutate(index = seq(0, n()-1)) |>
    mutate(parameter_i = paste0(parameter, "[", sprintf("%02d", index), "]")) |>
    relocate(parameter_i) |>
    ggplot() +
    aes(parameter_i, ess_bulk) +
    # aes(parameter_i, ess_bulk, color=dgf) +
    geom_point() +
    geom_line(aes(group = 1), alpha = 0.5) +
    geom_hline(yintercept = py$nchains*py$nsamples, linetype = "dotted") +
    coord_flip() +
    # facet_wrap(~dgf, nrow = 4) +
    labs(title = "Effective sample sizes (ESS Bulk)",
         subtitle = "Dotted line: Actual number of MCMC draws.") +
    NULL
```

```{r}
#| label: Plot ess_tail
#| fig-height: 11
py$summary_df |>
    filter(parameter != "z") |>
    group_by(parameter) |>
    mutate(index = seq(0, n()-1)) |>
    mutate(parameter_i = paste0(parameter, "[", sprintf("%02d", index), "]")) |>
    relocate(parameter_i) |>
    ggplot() +
    aes(parameter_i, ess_tail) +
    geom_point() +
    geom_line(aes(group = 1), alpha = 0.5) +
    geom_hline(yintercept = py$nchains*py$nsamples, linetype = "dotted") +
    coord_flip() +
    # facet_wrap(~dgf, nrow = 4) +
    labs(title = "Effective sample sizes (ESS Tail)",
         subtitle = "Dotted line: Actual number of MCMC draws.") +
    NULL
```

#### Transition errors

```{r}
#| label: Plot error summary
py$error_df |>
    ggplot() +
    aes(error_msg, relative) +
    geom_bar(aes(fill = error_msg), stat = "identity",
             position = position_dodge2(preserve = "single")) +
    facet_wrap(~phase) +
    theme(legend.position = "top") +
    labs(title = "Error Summary") +
    NULL

```


#### Variance parameter plots

```{python}
#| label: Diagnostic plots for variance of trafo
gs.plot_param(results, "y_igvar")
```

```{python}
#| label: Diagnostic plots for variance of pspline
gs.plot_param(results, "x_igvar")
```