Skip to content

Commit 53a9666

Browse files
authored
merge pr #417: improve docs re: set.seed and reproducibility
2 parents cd82f0e + 6e5c076 commit 53a9666

File tree

9 files changed

+204
-0
lines changed

9 files changed

+204
-0
lines changed

.Rbuildignore

+1
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,4 @@ README_files/
2626
^pkgdown$
2727
^\.github$
2828
^LICENSE\.md$
29+
^man-roxygen$

R/calculate.R

+1
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
#' cases will be NaN. The package will omit non-finite values from
3838
#' visualizations (with a warning) and raise an error in p-value calculations.
3939
#'
40+
#' @includeRmd man-roxygen/seeds.Rmd
4041
#'
4142
#' @examples
4243
#'

R/fit.R

+2
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,8 @@ generics::fit
7272
#' multivariate analysis of variance and regression" (Marti J. Anderson,
7373
#' 2001), \doi{10.1139/cjfas-58-3-626}.
7474
#'
75+
#' @includeRmd man-roxygen/seeds.Rmd
76+
#'
7577
#' @examples
7678
#' # fit a linear model predicting number of hours worked per
7779
#' # week using respondent age and degree status.

R/generate.R

+2
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@
4141
#' generation type was previously called `"simulate"`, which has been
4242
#' superseded.
4343
#' }
44+
#'
45+
#' @includeRmd man-roxygen/seeds.Rmd
4446
#'
4547
#' @examples
4648
#' # generate a null distribution by taking 200 bootstrap samples

man-roxygen/seeds.Rmd

+32
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# Reproducibility
2+
3+
When using the infer package for research, or in other cases when exact reproducibility is a priority, be sure the set the seed for R's random number generator. infer will respect the random seed specified in the `set.seed()` function, returning the same result when `generate()`ing data given an identical seed. For instance, we can calculate the difference in mean `age` by `college` degree status using the `gss` dataset from 10 versions of the `gss` resampled with permutation using the following code.
4+
5+
```{r, include = FALSE}
6+
library(infer)
7+
```
8+
9+
```{r}
10+
set.seed(1)
11+
12+
gss %>%
13+
specify(age ~ college) %>%
14+
hypothesize(null = "independence") %>%
15+
generate(reps = 5, type = "permute") %>%
16+
calculate("diff in means", order = c("degree", "no degree"))
17+
```
18+
19+
Setting the seed to the same value again and rerunning the same code will produce the same result.
20+
21+
```{r}
22+
# set the seed
23+
set.seed(1)
24+
25+
gss %>%
26+
specify(age ~ college) %>%
27+
hypothesize(null = "independence") %>%
28+
generate(reps = 5, type = "permute") %>%
29+
calculate("diff in means", order = c("degree", "no degree"))
30+
```
31+
32+
Please keep this in mind when writing infer code that utilizes resampling with `generate()`.

man/calculate.Rd

+54
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/fit.infer.Rd

+54
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

man/generate.Rd

+54
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

vignettes/infer.Rmd

+4
Original file line numberDiff line numberDiff line change
@@ -116,6 +116,8 @@ Once we've asserted our null hypothesis using `hypothesize()`, we can construct
116116
Continuing on with our example above, about the average number of hours worked a week, we might write:
117117

118118
```{r generate-point, warning = FALSE, message = FALSE}
119+
set.seed(1)
120+
119121
gss %>%
120122
specify(response = hours) %>%
121123
hypothesize(null = "point", mu = 40) %>%
@@ -124,6 +126,8 @@ gss %>%
124126

125127
In the above example, we take 1000 bootstrap samples to form our null distribution.
126128

129+
Note that, before `generate()`ing, we've set the seed for random number generation with the `set.seed()` function. When using the infer package for research, or in other cases when exact reproducibility is a priority, this is good practice. infer will respect the random seed specified in the `set.seed()` function, returning the same result when `generate()`ing data given an identical seed.
130+
127131
To generate a null distribution for the independence of two variables, we could also randomly reshuffle the pairings of explanatory and response variables to break any existing association. For instance, to generate 1000 replicates that can be used to create a null distribution under the assumption that political party affiliation is not affected by age:
128132

129133
```{r generate-permute, warning = FALSE, message = FALSE}

0 commit comments

Comments
 (0)