-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpresentation.Rpres
74 lines (56 loc) · 2.92 KB
/
presentation.Rpres
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
Curse of the Long Tail - Eample of t-test failure
========================================================
author: Charles Allen
date: 2016-06-25
autosize: true
Educating the public on the dangers of non-normal distributions in t-tests
*DOCUMENTATION IS INLINE IN THE APP AT http://allen-net.shinyapps.io/TTestExamples/ *
t-test
========================================================
The t-test is one of the most common statistical tests performed
- Can determine if metrics from two populations differe in a statistically significant manner
- Works best with *normal* distributions
- More information available on [wikipedia](https://en.wikipedia.org/wiki/Student%27s_t-test)
- Key metric is the "t-statistic", which is relate to the difference in means devided by the standard error
- Depends on sample size
$$
t = \frac{\mu_1 - \mu_2}{\sqrt{(var_1 + var_2)/n}}
$$
Normal distribution t-test
========================================================
A simple t-test is often performed against normally distributed data of the type below:
```{r}
gaussian_fn <- function(x, mu, sigma) {
dnorm(x, mean = mu, sd = sigma)
}
mu <- 5000
sigma <- 500
x <- 1:10000
```
***
```{r, fig.width=6, fig.height=4, fig.align='center'}
plot(x, y = gaussian_fn(x, mu, sigma), xlab = "", ylab = "Density", type = 'l')
```
t-test is optimal against this type of distribution.
Non-normal distribution
========================================================
For a distribution with a long-tail, we can use [Planck's law](https://en.wikipedia.org/wiki/Planck%27s_law). When ignoring scaling factors, this takes the general form of:
```{r}
planck_fn <- function(x, a) {
y <- x
positive = x > 0
y[!positive] <- 0
y[positive] <- 1/(exp(a/x[positive]) - 1) / x[positive]^5
}
x <- 1:10000
```
***
```{r, fig.width=6, fig.height=4, fig.align='center'}
plot(x, y = planck_fn(x / 500, 10), xlab = "", ylab = "Density", type = 'l')
```
The long-tail at higher x-values completely screws up a simple t-test, yielding results very different from visual inspection.
TTestExamples
========================================================
t-test examples are [hosted online](http://allen-net.shinyapps.io/TTestExamples/) as an educational tool and allow easy manipulation of a normal and a long-tail distribution to see the effect on the p-value (same-ness statistic) for visually simmilar distributions.
The sources for the app and this presentation are hosted on [github](http://github.com/drcrallen/TTestExamples)
For data with long-tails, t-tests often yield WRONG results when evaluating differences between to groups of data points. This can be mitigated in some circumstances by exercising the central limit theorem, and applying the t-test to the *mean* of groups of data. For example, if data is collected from various sites, taking the mean per site, then looking at the distribution of means of sites in group **A** vs sites in group **B** should yield a viable t-test.