merge pr #417: improve docs re: set.seed and reproducibility

simonpcouch · web-flow · commit 53a9666ec8ac · 2021-09-07T18:49:05.000-04:00
diff --git a/.Rbuildignore b/.Rbuildignore
@@ -26,3 +26,4 @@ README_files/
 ^pkgdown$
 ^\.github$
 ^LICENSE\.md$
+^man-roxygen$
diff --git a/R/calculate.R b/R/calculate.R
@@ -37,6 +37,7 @@
 #' cases will be NaN. The package will omit non-finite values from
 #' visualizations (with a warning) and raise an error in p-value calculations.
 #'
+#' @includeRmd man-roxygen/seeds.Rmd
 #' 
 #' @examples
 #'
diff --git a/R/fit.R b/R/fit.R
@@ -72,6 +72,8 @@ generics::fit
 #' multivariate analysis of variance and regression" (Marti J. Anderson,
 #' 2001), \doi{10.1139/cjfas-58-3-626}.
 #' 
+#' @includeRmd man-roxygen/seeds.Rmd
+#' 
 #' @examples
 #' # fit a linear model predicting number of hours worked per
 #' # week using respondent age and degree status.
diff --git a/R/generate.R b/R/generate.R
@@ -41,6 +41,8 @@
 #'   generation type was previously called `"simulate"`, which has been
 #'   superseded.
 #' }
+#' 
+#' @includeRmd man-roxygen/seeds.Rmd
 #'
 #' @examples
 #' # generate a null distribution by taking 200 bootstrap samples
diff --git a/man-roxygen/seeds.Rmd b/man-roxygen/seeds.Rmd
@@ -0,0 +1,32 @@
+# Reproducibility
+
+When using the infer package for research, or in other cases when exact reproducibility is a priority, be sure the set the seed for R's random number generator. infer will respect the random seed specified in the `set.seed()` function, returning the same result when `generate()`ing data given an identical seed. For instance, we can calculate the difference in mean `age` by `college` degree status using the `gss` dataset from 10 versions of the `gss` resampled with permutation using the following code.
+
+```{r, include = FALSE}
+library(infer)
+```
+
+```{r}
+set.seed(1)
+
+gss %>%
+  specify(age ~ college) %>%
+  hypothesize(null = "independence") %>%
+  generate(reps = 5, type = "permute") %>%
+  calculate("diff in means", order = c("degree", "no degree"))
+```
+
+Setting the seed to the same value again and rerunning the same code will produce the same result.
+
+```{r}
+# set the seed
+set.seed(1)
+
+gss %>%
+  specify(age ~ college) %>%
+  hypothesize(null = "independence") %>%
+  generate(reps = 5, type = "permute") %>%
+  calculate("diff in means", order = c("degree", "no degree"))
+```
+
+Please keep this in mind when writing infer code that utilizes resampling with `generate()`.
diff --git a/man/calculate.Rd b/man/calculate.Rd
diff --git a/man/fit.infer.Rd b/man/fit.infer.Rd
diff --git a/man/generate.Rd b/man/generate.Rd
diff --git a/vignettes/infer.Rmd b/vignettes/infer.Rmd
@@ -116,6 +116,8 @@ Once we've asserted our null hypothesis using `hypothesize()`, we can construct
 Continuing on with our example above, about the average number of hours worked a week, we might write:
 
 ```{r generate-point, warning = FALSE, message = FALSE}
+set.seed(1)
+
 gss %>%
   specify(response = hours) %>%
   hypothesize(null = "point", mu = 40) %>%
@@ -124,6 +126,8 @@ gss %>%
 
 In the above example, we take 1000 bootstrap samples to form our null distribution.
 
+Note that, before `generate()`ing, we've set the seed for random number generation with the `set.seed()` function. When using the infer package for research, or in other cases when exact reproducibility is a priority, this is good practice. infer will respect the random seed specified in the `set.seed()` function, returning the same result when `generate()`ing data given an identical seed.
+
 To generate a null distribution for the independence of two variables, we could also randomly reshuffle the pairings of explanatory and response variables to break any existing association. For instance, to generate 1000 replicates that can be used to create a null distribution under the assumption that political party affiliation is not affected by age:
 
 ```{r generate-permute, warning = FALSE, message = FALSE}