-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathMoreModels.qmd
276 lines (186 loc) · 11.2 KB
/
MoreModels.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
title: "More Models!"
author: "Adrien Osakwe"
format: html
editor: visual
---
## Multiparameter Models - INCOMPLETE
Most the examples we have explored have focused on modelling the posterior for **single parameter models**. That is, models where $\theta$ is a scalar as opposed to a vector of length $d$. However, we know in practice that there are many phenomena which require multiple parameters . We previously saw this with the Gaussian distribution which has two parameters and in multivariate settings.
Although it is necessary to formulate a multivariate generative process in terms of all of its parameters, it is not always the case that we are interested in the full posterior. Oftentimes, we may only be interested in the **marginal posterior** for specific parameters in the model (recall the multivariate Gaussian in [chapter 8](./PosteriorApproximation.qmd) ). You can imagine a scenario where we model an observation such as the temperature in a fridge as a Gaussian but may only be interested in the posterior of the mean.
We will therefore explore how **marginalizing** can allow us to separate our parameters of interest from **nuisance parameters** in our analysis of the posterior.
### Marginal Posterior Distribution
Let's imagine a scenario where we are building a model with parameters $\theta = (\theta_1,\theta_2)$. We may view $\theta_2$ as a nuisance parameter and need a way of expressing the posterior solely for $\theta_1$ , that is, $p(\theta_1|y)$. This can be achieves by integrating out $\theta_2$ as follows:
$$
p(\theta_1|y) = \int p(\theta|y)d\theta_2
$$
Which can be expanded as:
$$
p(\theta_1|y) = \int p(\theta_1,\theta_2|y)d\theta_2 \\
\int p(\theta_1|\theta_2,y)p(\theta_2|y)d\theta_2
$$
Where $p(\theta_1|\theta_2,y)$ is the **conditional posterior distribution** of $\theta_1$ given $\theta_2$ and $y$.
### Example #1 Normal Distribution with known variance
The normal distribution is a great example of a situation where a marginalized posterior may be of interest. In many cases, we may see the variance parameter in the normal distribution as a nuisance parameter, and may only be interested in the posterior distribution for the mean. We can therefore use it as a practical way of learning how to derive a marginal posterior.
We can start with a simpler setting where the variance is assumed to be known. We did something similar in a (\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_) previous chapter where the mean was known and the variance unknown. These models provide a simple approach for deriving the conditional posteriors in cases where neither parameter is known.
#### One observation
In a setting with one observation $Y$ coming from a normal distribution with known variance $\sigma_0^2 > 0$ and unknown mean $\theta$. The conjugate prior in this setting is another normal:
$$
Y \sim Normal(\theta,\sigma_0^2) \\
\theta \sim Normal(\mu_0,\tau_0^2)
$$
With this formulation we have the following likelihood and prior pdfs:
$$
p(y|\theta) = \frac{1}{\sqrt{2\pi\sigma_0^2}} \exp\left(-\frac{(y - \theta)^2}{2\sigma_0^2}\right) \propto \exp\left(-\frac{(y - \theta)^2}{2\sigma_0^2}\right) \\
$$
$$
p(\theta) = \frac{1}{\sqrt{2\pi\tau_0^2}} \exp\left(-\frac{(\theta - \mu_0)^2}{2\tau_0^2}\right) \propto \exp\left(-\frac{(\theta - \mu_0)^2}{2\tau_0^2}\right)
$$
with the unnormalized posterior being:
$$
p(\theta|y)\propto \exp\left(-\frac{(y - \theta)^2}{2\sigma_0^2}-\frac{(\theta - \mu_0)^2}{2\tau_0^2} \right) \\
\propto \exp\left(-\frac{\tau_0^2(y - \theta)^2 + \sigma_0^2(\theta - \mu_0)^2}{2\tau_0^2\sigma_0^2}\right) \\
\propto \exp\left(\frac{\tau_0^2(\theta^2-2y\theta) + \sigma_0^2(\theta^2 -2\mu_0\theta)}{2\tau_0^2\sigma_0^2}\right) \\
\propto \exp\left(-\frac{(\sigma_0^2 + \tau_0^2)\theta^2 -2(\sigma_0^2\mu_0 + \tau_0^2y)\theta}{2\tau_0^2\sigma_0^2}\right) \\
\propto \exp\left(-\frac{\theta^2 -2\mu_1\theta}{2\tau_1^2}\right)
$$
Where
$$
\mu_1 = \frac{(\sigma_0^2\mu_0 + \tau_0^2y)}{(\sigma_0^2 + \tau_0^2)} \\
\tau_1 = \frac{\tau_0^2\sigma_0^2}{(\sigma_0^2 + \tau_0^2)}
$$
Which returns a posterior in the form of a normal distribution with mean $\mu_1$ and variance $\tau_1^2$. It is also possible to write the parameters of the posterior in terms of **precision**. Precision is the inverse of the posterior variance, and can be expressed as the sum of the prior and sampling precision.
$$
\frac{1}{\tau_1^2} = \frac{1}{\tau_0^2} + \frac{1}{\sigma_0^2}
$$
Similarly, the posterior mean can be expressed as a **convex combination (**sum of prior and sampling means weighted by the proportion of their proportional precision) of the prior mean and the observed data.
$$
\mu_1 = \frac{\frac{1}{\tau_0^2}\mu_0 + \frac{1}{\sigma_0^2}y}{\frac{1}{\tau_0^2} + \frac{1}{\sigma_0^2}}
$$
#### Many observations
Making the derivation for multiple observations works in a similar fashion, except that we now use a joint-likelihood:
$$
p(y|\theta) = \prod_{i = 1}^np(y_i|\theta)
$$
To make the calculation simple, we can take advantage of the fact that the mean of the normally distributed variables follows a normal distribution too:
$$
\overline{Y} \sim Normal(\theta,\sigma^2/n)
$$
We can then use the solution for the single observation example (and now you see why we started with that :) ) to derive the posterior parameters as:
$$
\mu_n = \frac{\frac{1}{\tau_0^2}\mu_0 + \frac{n}{\sigma_0^2}\overline{y}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma_0^2}}
$$
and
$$
\frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma_0^2}
$$
Now if we had not derive the single observation posterior first, we could have also done this the hard way. But since you're already here, I say why not both?
We start with expanding the joint likelihood:
$$
p(y|\theta) = \prod_{i = 1}^np(y_i|\theta) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n\exp\left(\frac{\sum_{i=1}^n(y_i - \theta)^2}{2\sigma^2}\right)
$$
$$
\propto \exp\left(-\frac{\sum_{i=1}^n(y_i - \theta)^2}{2\sigma^2}\right)
$$
At first glance it may look daunting to find a closed form solution to our problem and identifying a sufficient statistic seems unlikely. However, we can rely on the achievements of our predecessors to get through this. Here, we can use **sum of squares decomposition:**
$$\sum_{i=1}^n(y_i - \theta)^2 = n(\overline{y} - \theta)^2 + \sum_{i=1}^n(y_i - \overline{y})^2$$
Where $\overline{y}$ is the mean of the observed values. We now have the following:
$$
p(y|\theta)\propto \exp\left(-\frac{n(\overline{y} - \theta)^2 + \sum_{i=1}^n(y_i - \overline{y})^2}{2\sigma^2}\right)
$$
$$
\propto \exp\left(-\frac{n(\overline{y} - \theta)^2}{2\sigma^2}\right)
$$
As the second component does not depend on $\theta$. All of a sudden we now have a pdf that depends only on a sufficient statistic! We can now multiply this by the prior and see that it matches the form of the posterior we saw earlier:
$$
p(\theta|y) \propto \exp\left(\frac{n(\overline{y} - \theta)^2}{2\sigma^2}\right)\exp\left(-\frac{(\theta - \mu_0)^2}{2\tau_0^2}\right)
$$
$$
= \exp\left(-\frac{n\tau_0^2(\overline{y} - \theta)^2 + \sigma_0^2(\theta - \mu_0)^2}{2\tau_0^2\sigma_0^2}\right)
$$
From here we can notice the similar form of the posterior from the single observation example, with the only exception being that there is also the variable $n$.
### Example #2 Non-informative prior
We assumed that the mean followed a normal distribution but what if this is not the case? Recall that in [chapter 2](./Conjugate_Dist.qmd) we looked at different choices of priors. In particular, we considered the non-informative prior for cases where we don't have intuition on the shape of the prior.
Imagine a situation where both the mean **and** variance are unknown. Determining the joint prior may be more difficult, so the use of a non-informative prior may be a better choice.
Let's use the following improper prior:
$$
p(\theta,\sigma^2) \propto \frac{1}{\sigma^2}
$$
As the variance is unknown, our likelihood from example 1 will look as follows:
$$
p(y|\theta,\sigma^2) \propto \frac{1}{\sigma^n}\exp\left(-\frac{n(\overline{y} - \theta)^2 + \sum_{i=1}^n(y_i - \overline{y})^2}{2\sigma^2}\right)
$$
$$
= \frac{1}{\sigma^n}\exp\left(-\frac{(n-1)s^2 +n(\overline{y} - \theta)^2}{2\sigma^2}\right)
$$
Notice again that we have to retain terms with $\sigma$ as they are no longer constant. Here, we had to retain the sum of square error between the sample mean and observations, but can express it as another sufficient statistic, $s^2$ which represents the **sample variance**.
$$
s^2 = \frac{\sum_{i=1}^n(y_i - \overline{y})^2}{n-1}
$$
This then gives us the following un-normalized posterior:
$$
p(\theta,\sigma^2|y) \propto \frac{1}{\sigma^{(n+2)}}\exp\left(-\frac{(n-1)s^2 + n(\overline{y} - \theta)^2}{2\sigma^2}\right)
$$
That can be entirely expressed in terms of two sufficient statistics!
We can now generate some samples from a normal distribution and see if this posterior can help us recover the true estimates.
```{r}
mu <- 0
sigma <- 3
n <- 50
y <- rnorm(n,mu,sigma)
mu_range <- seq(-5,5,length = 50)
s_range <- seq(0.001,5,length = 30)
post_approx <- function(y,n,mu_range,s_range){
mu_y <- mean(y)
s_y <- var(y)
post <- matrix(0,nrow = length(mu_range),ncol = length(s_range))
for( i in 1:length(mu_range)){
for(j in 1:length(s_range)){
post[i,j] <- (s_range[j]^(-(n+2)))*exp(-((n-1)*s_y+n*(mu_y-mu_range[i])^2)
/(2*s_range[j]^2))
}
}
return(post)
}
library(fields)
for(n1 in seq(5,n,by = 15)){
estimates <- post_approx(y[1:n1],n1,mu_range,s_range)
#Normalize Posteriors as densities are quite small
estimates <- estimates/sum(estimates)
image.plot(mu_range,s_range,estimates,
xlab=expression(theta),
ylab=expression(sigma),
main = paste(n1," Samples"))
#contour(mu_range,s_range,estimates,add=T)
}
```
We can see that our posterior does a reasonable job of picking up the true parameters from the data when using an improper prior once enough samples are included.
#### Example #3 Marginal Posterior Distribution - Mean
We can now marginalize the posterior to be able to get the marginal posteriors for $\theta$ and $\sigma^2$. In practice this is quite simple (usually). All that is required is to integrate the full posterior over the parameter space of the nuisance parameter (in this case, $\sigma^2$.
$$
p(\theta|y) = \int_0^\infty p(\theta,\sigma^2|y)d\sigma^2
$$
$$
= \int_0^\infty p(y|\theta,\sigma^2)p(\theta,\sigma^2)d\sigma^2
$$
In this setting, we define the joint prior as follows:
$$
p(\theta,\sigma^2) = p(\theta |\sigma^2)p(\sigma^2)
$$
Where
$$
p(\theta |\sigma^2) \sim Normal(\mu,\sigma^2/\lambda) = \frac{1}{\sqrt{2\pi \frac{\sigma^2}{\lambda}}} \exp\left(-\frac{(\theta - \mu)^2}{2 \frac{\sigma^2}{\lambda}}\right)
$$
$$
p(\sigma^2) \sim InvGamma(\alpha,\beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} (\sigma^2)^{-(\alpha + 1)} exp\left({-\frac{\beta}{\sigma^2}}\right)
$$
$$
p(\theta,\sigma^2) \propto
$$
The full posterior is therefore
$$
p(\theta,\sigma^2 | y) \propto
$$
$$
p(\theta|y) = \int_0^\infty p(\theta,\sigma^2|y)d\sigma^2 \propto \int_0^\infty
$$
## Hierarchical Models