draft vignette on inference for paired data

simonpcouch · simonpcouch · commit 426ae8209b1d · 2023-05-09T14:17:30.000-04:00
diff --git a/vignettes/observed_stat_examples.Rmd b/vignettes/observed_stat_examples.Rmd
@@ -206,7 +206,7 @@ set.seed(1)
 
 gss_paired <- gss %>%
    mutate(
-      hours_previous = hours + 1.8 - rpois(nrow(.), 2),
+      hours_previous = hours + 5 - rpois(nrow(.), 4.8),
       diff = hours - hours_previous
    )
 
diff --git a/vignettes/paired.Rmd b/vignettes/paired.Rmd
@@ -0,0 +1,157 @@
+---
+title: "Tidy inference for paired data"
+description: "Conducting tests for paired independence on tidy data with infer."
+output: rmarkdown::html_vignette
+vignette: |
+  %\VignetteIndexEntry{Tidy inference for paired data}
+  %\VignetteEngine{knitr::rmarkdown}
+  \usepackage[utf8]{inputenc}
+---
+
+```{r settings, include=FALSE}
+knitr::opts_chunk$set(fig.width = 6, fig.height = 4.5) 
+options(digits = 4)
+```
+
+```{r load-packages, echo = FALSE, message = FALSE, warning = FALSE}
+library(ggplot2)
+library(dplyr)
+library(infer)
+```
+
+### Introduction
+
+In this vignette, we'll walk through conducting a randomization-based paired test of independence with infer.
+
+Throughout this vignette, we'll make use of the `gss` dataset supplied by `infer`, which contains a sample of data from the General Social Survey. See `?gss` for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let's suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:
+
+```{r glimpse-gss-actual, warning = FALSE, message = FALSE}
+dplyr::glimpse(gss)
+```
+
+Two sets of observations are paired if each observation in one column has a special correspondence or connection with exactly one observation in the other. For the purposes of this vignette, we'll simulate an additional data variable with a natural pairing: suppose that each of these survey respondents had provided the number of `hours` worked per week when surveyed 5 years prior, encoded as `hours_previous`.
+
+```{r}
+set.seed(1)
+
+gss_paired <- gss %>%
+   mutate(
+      hours_previous = hours + 5 - rpois(nrow(.), 4.8),
+      diff = hours - hours_previous
+   )
+
+gss_paired %>%
+   select(hours, hours_previous, diff)
+```
+
+The number of `hours` worked per week by a particular respondent has a special correspondence with the number of hours worked 5 years prior `hours_previous` by that same respondent. We'd like to test the null hypothesis that the `"mean"` hours worked per week did not change between the sampled time and five years prior.
+
+To carry out inference on paired data with infer, we pre-compute the difference between paired values at the beginning of the analysis, and use those differences as our values of interest.
+
+Here, we pre-compute the difference between paired observations as `diff`. The distribution of `diff` in the observed data looks like this:
+
+```{r plot-diff, echo = FALSE}
+unique_diff <- unique(gss_paired$diff)
+gss_paired %>%
+  ggplot2::ggplot() +
+  ggplot2::aes(x = diff) +
+  ggplot2::geom_histogram(bins = diff(range(unique_diff))) +
+  ggplot2::labs(x = "diff: Difference in Number of Hours Worked",
+                y = "Number of Responses") +
+  ggplot2::scale_x_continuous(breaks = c(range(unique_diff), 0))
+```
+
+From the looks of the distribution, most respondents worked a similar number of hours worked per week as they had 5 hours prior, though it seems like there may be a slight decline of number of hours worked per week in aggregate. (We know that the true effect is -.2 since we've simulated this data.)
+
+We calculate the observed statistic in the paired setting in the same way that we would outside of the paired setting. Using `specify()` and `calculate()`:
+
+```{r calc-obs-mean}
+# calculate the observed statistic
+observed_statistic <- 
+   gss_paired %>% 
+   specify(response = diff) %>% 
+   calculate(stat = "mean")
+```
+
+The observed statistic is `r observed_statistic`. Now, we want to compare this statistic to a null distribution, generated under the assumption that the true difference was actually zero, to get a sense of how likely it would be for us to see this observed difference if there were truly no change in hours worked per week in the population.
+
+Tests for paired data are carried out via the `null = "paired independence"` argument to `hypothesize()`.
+
+```{r generate-null}
+# generate the null distribution
+null_dist <- 
+   gss_paired %>% 
+   specify(response = diff) %>% 
+   hypothesize(null = "paired independence") %>%
+   generate(reps = 1000, type = "permute") %>%
+   calculate(stat = "mean")
+   
+null_dist
+```
+
+For each replicate, `generate()` carries out `type = "permute"` with `null = "paired independence"` by:
+
+* Randomly sampling a vector of signs (i.e. -1 or 1), probability .5 for either, with length equal to the input data, and
+* Multiplying the response variable by the vector of signs, "flipping" the observed values for a random subset of value in each replicate
+
+To get a sense for what this distribution looks like, and where our observed statistic falls, we can use `visualize()`:
+
+```{r visualize}
+# visualize the null distribution and test statistic
+null_dist %>%
+  visualize() + 
+  shade_p_value(observed_statistic,
+                direction = "two-sided")
+```
+
+It looks like our observed mean of `r observed_statistic` would be relatively unlikely if there were truly no change in mean number of hours worked per week over this time period.
+
+More exactly, we can calculate the p-value:
+
+```{r p-value}
+# calculate the p value from the test statistic and null distribution
+p_value <- null_dist %>%
+  get_p_value(obs_stat = observed_statistic,
+              direction = "two-sided")
+
+p_value
+```
+
+Thus, if the change in mean number of hours worked per week over this time period were truly zero, our approximation of the probability that we would see a test statistic as or more extreme than `r observed_statistic` is approximately `r p_value`.
+
+We can also generate a bootstrap confidence interval for the mean paired difference using `type = "bootstrap"` in `generate()`. As before, we use the pre-computed differences when generating bootstrap resamples:
+
+```{r generate-boot}
+# generate a bootstrap distribution
+boot_dist <- 
+   gss_paired %>% 
+   specify(response = diff) %>% 
+   hypothesize(null = "paired independence") %>%
+   generate(reps = 1000, type = "bootstrap") %>%
+   calculate(stat = "mean")
+   
+visualize(boot_dist)
+```
+
+Note that, unlike the null distribution of test statistics generated earlier with `type = "permute"`, this distribution is centered at `observed_statistic`. 
+
+Calculating a confidence interval:
+
+```{r confidence-interval}
+# calculate the confidence from the bootstrap distribution
+confidence_interval <- boot_dist %>%
+  get_confidence_interval(level = .95)
+
+confidence_interval
+```
+
+By default, `get_confidence_interval()` constructs the lower and upper bounds by taking the observations at the $(1 - .95) / 2$ and $1 - ((1-.95) / 2)$th percentiles. To instead build the confidence interval using the standard error of the bootstrap distribution, we can write:
+
+```{r}
+boot_dist %>%
+  get_confidence_interval(type = "se",
+                          point_estimate = observed_statistic,
+                          level = .95)
+```
+
+To learn more about randomization-based inference for paired observations, see the relevant chapter in [Introduction to Modern Statistics](https://openintro-ims.netlify.app/inference-paired-means.html).
diff --git a/vignettes/t_test.Rmd b/vignettes/t_test.Rmd
@@ -121,13 +121,6 @@ pt(observed_statistic, df = nrow(gss) - 1, lower.tail = FALSE)*2
 
 Note that the resulting $t$-statistics from these two theory-based approaches are the same.
 
-### Paired t-Test
-
-You might be interested in running a paired $t$-test. Paired $t$-tests can be used in situations when there is a natural pairing between values in distributions---a common example would be two columns, `before` and `after`, say, that contain measurements from a patient before and after some treatment. To compare these two distributions, then, we're not necessarily interested in how the two distributions look different altogether, but how these two measurements from each individual change across time. (Pairings don't necessarily have to be over time; another common usage is measurements from two married people, for example.) Thus, we can create a new column (see `mutate()` from the `dplyr` package if you're not sure how to do this) that is the difference between the two: `difference = after - before`, and then examine _this_ distribution to see how each individuals' measurements changed over time.
-
-Once we've `mutate()`d that new `difference` column, we can run a 1-sample $t$-test on it, where our null hypothesis is that `mu = 0` (i.e. the difference between these measurements before and after treatment is, on average, 0). To do so, we'd use the procedure outlined in the above section.
-
-
 ### 2-Sample t-Test
 
 2-Sample $t$-tests evaluate the difference in mean values of two populations using data randomly-sampled from the population that approximately follows a normal distribution. As an example, we'll test if Americans work the same number of hours a week regardless of whether they have a college degree or not using data from the `gss`. The `college` and `hours` variables allow us to do so:

Original file line number	Diff line number	Diff line change
`@@ -206,7 +206,7 @@ set.seed(1)`
`206`	`206`
`207`	`207`	`gss_paired <- gss %>%`
`208`	`208`	`mutate(`
`209`		`- hours_previous = hours + 1.8 - rpois(nrow(.), 2),`
	`209`	`+ hours_previous = hours + 5 - rpois(nrow(.), 4.8),`
`210`	`210`	`diff = hours - hours_previous`
`211`	`211`	`)`
`212`	`212`