Skip to content

Commit 378c8a7

Browse files
committed
Modify vignettes and articles to match tidy formatting
Added missing commas and formatting issues throughout the vignettes. Backticks for package names were removed, and missing parentheses for functions were added.
1 parent 1d069b3 commit 378c8a7

File tree

6 files changed

+167
-119
lines changed

6 files changed

+167
-119
lines changed

vignettes/anova.Rmd

+4-4
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,9 @@ library(dplyr)
1919
library(infer)
2020
```
2121

22-
In this vignette, we'll walk through conducting an analysis of variance (ANOVA) test using `infer`. ANOVAs are used to analyze differences in group means.
22+
In this vignette, we'll walk through conducting an analysis of variance (ANOVA) test using infer. ANOVAs are used to analyze differences in group means.
2323

24-
Throughout this vignette, we'll make use of the `gss` dataset supplied by `infer`, which contains a sample of data from the General Social Survey. See `?gss` for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let's suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:
24+
Throughout this vignette, we'll make use of the `gss` dataset supplied by infer, which contains a sample of data from the General Social Survey. See `?gss` for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let's suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:
2525

2626
```{r glimpse-gss-actual, warning = FALSE, message = FALSE}
2727
dplyr::glimpse(gss)
@@ -57,7 +57,7 @@ observed_f_statistic <- gss %>%
5757

5858
The observed $F$ statistic is `r observed_f_statistic`. Now, we want to compare this statistic to a null distribution, generated under the assumption that age and political party affiliation are not actually related, to get a sense of how likely it would be for us to see this observed statistic if there were actually no association between the two variables.
5959

60-
We can `generate` an approximation of the null distribution using randomization. The randomization approach permutes the response and explanatory variables, so that each person's party affiliation is matched up with a random age from the sample in order to break up any association between the two.
60+
We can `generate()` an approximation of the null distribution using randomization. The randomization approach permutes the response and explanatory variables, so that each person's party affiliation is matched up with a random age from the sample in order to break up any association between the two.
6161

6262
```{r generate-null-f, warning = FALSE, message = FALSE}
6363
# generate the null distribution using randomization
@@ -116,7 +116,7 @@ p_value
116116

117117
Thus, if there were really no relationship between age and political party affiliation, our approximation of the probability that we would see a statistic as or more extreme than `r observed_f_statistic` is approximately `r p_value`.
118118

119-
To calculate the p-value using the true $F$ distribution, we can use the `pf` function from base R. This function allows us to situate the test statistic we calculated previously in the $F$ distribution with the appropriate degrees of freedom.
119+
To calculate the p-value using the true $F$ distribution, we can use the `pf()` function from base R. This function allows us to situate the test statistic we calculated previously in the $F$ distribution with the appropriate degrees of freedom.
120120

121121
```{r}
122122
pf(observed_f_statistic$stat, 3, 496, lower.tail = FALSE)

vignettes/chi_squared.Rmd

+67-43
Original file line numberDiff line numberDiff line change
@@ -21,9 +21,9 @@ library(infer)
2121

2222
### Introduction
2323

24-
In this vignette, we'll walk through conducting a $\chi^2$ (chi-squared) test of independence and a chi-squared goodness of fit test using `infer`. We'll start out with a chi-squared test of independence, which can be used to test the association between two categorical variables. Then, we'll move on to a chi-squared goodness of fit test, which tests how well the distribution of one categorical variable can be approximated by some theoretical distribution.
24+
In this vignette, we'll walk through conducting a $\chi^2$ (chi-squared) test of independence and a chi-squared goodness of fit test using infer. We'll start out with a chi-squared test of independence, which can be used to test the association between two categorical variables. Then, we'll move on to a chi-squared goodness of fit test, which tests how well the distribution of one categorical variable can be approximated by some theoretical distribution.
2525

26-
Throughout this vignette, we'll make use of the `gss` dataset supplied by `infer`, which contains a sample of data from the General Social Survey. See `?gss` for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let's suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:
26+
Throughout this vignette, we'll make use of the `gss` dataset supplied by infer, which contains a sample of data from the General Social Survey. See `?gss` for more information on the variables included and their source. Note that this data (and our examples on it) are for demonstration purposes only, and will not necessarily provide accurate estimates unless weighted properly. For these examples, let's suppose that this dataset is a representative sample of a population we want to learn about: American adults. The data looks like this:
2727

2828
```{r glimpse-gss-actual, warning = FALSE, message = FALSE}
2929
dplyr::glimpse(gss)
@@ -41,10 +41,14 @@ gss %>%
4141
ggplot2::aes(x = finrela, fill = college) +
4242
ggplot2::geom_bar(position = "fill") +
4343
ggplot2::scale_fill_brewer(type = "qual") +
44-
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 45,
45-
vjust = .5)) +
46-
ggplot2::labs(x = "finrela: Self-Identification of Income Class",
47-
y = "Proportion")
44+
ggplot2::theme(axis.text.x = ggplot2::element_text(
45+
angle = 45,
46+
vjust = .5
47+
)) +
48+
ggplot2::labs(
49+
x = "finrela: Self-Identification of Income Class",
50+
y = "Proportion"
51+
)
4852
```
4953

5054
If there were no relationship, we would expect to see the purple bars reaching to the same height, regardless of income class. Are the differences we see here, though, just due to random noise?
@@ -61,7 +65,7 @@ observed_indep_statistic <- gss %>%
6165

6266
The observed $\chi^2$ statistic is `r observed_indep_statistic`. Now, we want to compare this statistic to a null distribution, generated under the assumption that these variables are not actually related, to get a sense of how likely it would be for us to see this observed statistic if there were actually no association between education and income.
6367

64-
We can `generate` the null distribution in one of two ways---using randomization or theory-based methods. The randomization approach approximates the null distribution by permuting the response and explanatory variables, so that each person's educational attainment is matched up with a random income from the sample in order to break up any association between the two.
68+
We can `generate()` the null distribution in one of two ways---using randomization or theory-based methods. The randomization approach approximates the null distribution by permuting the response and explanatory variables, so that each person's educational attainment is matched up with a random income from the sample in order to break up any association between the two.
6569

6670
```{r generate-null-indep, warning = FALSE, message = FALSE}
6771
# generate the null distribution using randomization
@@ -86,9 +90,10 @@ To get a sense for what these distributions look like, and where our observed st
8690
```{r visualize-indep, warning = FALSE, message = FALSE}
8791
# visualize the null distribution and test statistic!
8892
null_dist_sim %>%
89-
visualize() +
93+
visualize() +
9094
shade_p_value(observed_indep_statistic,
91-
direction = "greater")
95+
direction = "greater"
96+
)
9297
```
9398

9499
We could also visualize the observed statistic against the theoretical null distribution. To do so, use the `assume()` verb to define a theoretical null distribution and then pass it to `visualize()` like a null distribution outputted from `generate()` and `calculate()`.
@@ -98,28 +103,32 @@ We could also visualize the observed statistic against the theoretical null dist
98103
gss %>%
99104
specify(college ~ finrela) %>%
100105
assume(distribution = "Chisq") %>%
101-
visualize() +
106+
visualize() +
102107
shade_p_value(observed_indep_statistic,
103-
direction = "greater")
108+
direction = "greater"
109+
)
104110
```
105111

106112
To visualize both the randomization-based and theoretical null distributions to get a sense of how the two relate, we can pipe the randomization-based null distribution into `visualize()`, and further provide `method = "both"`.
107113

108114
```{r visualize-indep-both, warning = FALSE, message = FALSE}
109115
# visualize both null distributions and the test statistic!
110116
null_dist_sim %>%
111-
visualize(method = "both") +
117+
visualize(method = "both") +
112118
shade_p_value(observed_indep_statistic,
113-
direction = "greater")
119+
direction = "greater"
120+
)
114121
```
115122

116123
Either way, it looks like our observed test statistic would be quite unlikely if there were actually no association between education and income. More exactly, we can approximate the p-value with `get_p_value`:
117124

118125
```{r p-value-indep, warning = FALSE, message = FALSE}
119126
# calculate the p value from the observed statistic and null distribution
120127
p_value_independence <- null_dist_sim %>%
121-
get_p_value(obs_stat = observed_indep_statistic,
122-
direction = "greater")
128+
get_p_value(
129+
obs_stat = observed_indep_statistic,
130+
direction = "greater"
131+
)
123132
124133
p_value_independence
125134
```
@@ -149,8 +158,10 @@ gss %>%
149158
ggplot2::aes(x = finrela) +
150159
ggplot2::geom_bar() +
151160
ggplot2::geom_hline(yintercept = 466.3, col = "red") +
152-
ggplot2::labs(x = "finrela: Self-Identification of Income Class",
153-
y = "Number of Responses")
161+
ggplot2::labs(
162+
x = "finrela: Self-Identification of Income Class",
163+
y = "Number of Responses"
164+
)
154165
```
155166

156167
It seems like a uniform distribution may not be the most appropriate description of the data--many more people describe their income as average than than any of the other options. Lets now test whether this difference in distributions is statistically significant.
@@ -161,13 +172,17 @@ First, to carry out this hypothesis test, we would calculate our observed statis
161172
# calculating the null distribution
162173
observed_gof_statistic <- gss %>%
163174
specify(response = finrela) %>%
164-
hypothesize(null = "point",
165-
p = c("far below average" = 1/6,
166-
"below average" = 1/6,
167-
"average" = 1/6,
168-
"above average" = 1/6,
169-
"far above average" = 1/6,
170-
"DK" = 1/6)) %>%
175+
hypothesize(
176+
null = "point",
177+
p = c(
178+
"far below average" = 1 / 6,
179+
"below average" = 1 / 6,
180+
"average" = 1 / 6,
181+
"above average" = 1 / 6,
182+
"far above average" = 1 / 6,
183+
"DK" = 1 / 6
184+
)
185+
) %>%
171186
calculate(stat = "Chisq")
172187
```
173188

@@ -178,13 +193,17 @@ The observed statistic is `r observed_gof_statistic`. Now, generating a null dis
178193
# generating a null distribution, assuming each income class is equally likely
179194
null_dist_gof <- gss %>%
180195
specify(response = finrela) %>%
181-
hypothesize(null = "point",
182-
p = c("far below average" = 1/6,
183-
"below average" = 1/6,
184-
"average" = 1/6,
185-
"above average" = 1/6,
186-
"far above average" = 1/6,
187-
"DK" = 1/6)) %>%
196+
hypothesize(
197+
null = "point",
198+
p = c(
199+
"far below average" = 1 / 6,
200+
"below average" = 1 / 6,
201+
"average" = 1 / 6,
202+
"above average" = 1 / 6,
203+
"far above average" = 1 / 6,
204+
"DK" = 1 / 6
205+
)
206+
) %>%
188207
generate(reps = 1000, type = "draw") %>%
189208
calculate(stat = "Chisq")
190209
```
@@ -194,9 +213,10 @@ Again, to get a sense for what these distributions look like, and where our obse
194213
```{r visualize-indep-gof, warning = FALSE, message = FALSE}
195214
# visualize the null distribution and test statistic!
196215
null_dist_gof %>%
197-
visualize() +
216+
visualize() +
198217
shade_p_value(observed_gof_statistic,
199-
direction = "greater")
218+
direction = "greater"
219+
)
200220
```
201221

202222
This statistic seems like it would be quite unlikely if income class self-identification actually followed a uniform distribution! How unlikely, though? Calculating the p-value:
@@ -205,7 +225,8 @@ This statistic seems like it would be quite unlikely if income class self-identi
205225
# calculate the p-value
206226
p_value_gof <- null_dist_gof %>%
207227
get_p_value(observed_gof_statistic,
208-
direction = "greater")
228+
direction = "greater"
229+
)
209230
210231
p_value_gof
211232
```
@@ -218,17 +239,20 @@ To calculate the p-value using the true $\chi^2$ distribution, we can use the `p
218239
pchisq(observed_gof_statistic$stat, 5, lower.tail = FALSE)
219240
```
220241

221-
Again, equivalently to the theory-based approach shown above, the package supplies a wrapper function, `chisq_test`, to carry out Chi-Squared goodness of fit tests on tidy data. The syntax goes like this:
242+
Again, equivalently to the theory-based approach shown above, the package supplies a wrapper function, `chisq_test()`, to carry out Chi-Squared goodness of fit tests on tidy data. The syntax goes like this:
222243

223244
```{r chisq-gof-wrapper, message = FALSE, warning = FALSE}
224-
chisq_test(gss,
225-
response = finrela,
226-
p = c("far below average" = 1/6,
227-
"below average" = 1/6,
228-
"average" = 1/6,
229-
"above average" = 1/6,
230-
"far above average" = 1/6,
231-
"DK" = 1/6))
245+
chisq_test(gss,
246+
response = finrela,
247+
p = c(
248+
"far below average" = 1 / 6,
249+
"below average" = 1 / 6,
250+
"average" = 1 / 6,
251+
"above average" = 1 / 6,
252+
"far above average" = 1 / 6,
253+
"DK" = 1 / 6
254+
)
255+
)
232256
```
233257

234258

0 commit comments

Comments
 (0)