-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathHarvardX PH125.4x -- Data Science Inference and Modeling.Rmd
6675 lines (4170 loc) · 342 KB
/
HarvardX PH125.4x -- Data Science Inference and Modeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 'HarvardX PH125.4x Data Science: Inference and Modeling'
author: "John HHU"
date: '2022-07-02'
output: html_document
---
## Course / Section 1: Parameters and Estimates / Parameters and Estimates
# Sampling Model Parameters and Estimates
To help us understand the connection between polls and the probability theory that we have learned, let's construct a scenario that we can work through together and that is similar to the one that pollsters face. We will use an urn instead of voters. And because pollsters are competing with other pollsters for media attention, we will imitate that by having our competition with a $25 prize. The challenge is to guess the spread between the proportion of blue and red balls in this urn. Before making a prediction, you can take a sample, with replacement, from the urn.

To mimic the fact that running polls is expensive, it will cost you $0.10 per bead you sample. So if your sample size is 250 and you win, you'll break even, as you'll have to pay me $25 to collect your $25.

Your entry into the competition can be an interval. If the interval you submit contains the true proportion, you get half what you paid and pass to the second phase of the competition. In the second phase of the competition, the entry with the smallest interval is selected as the winner.


The dslabs package includes a function that shows a random draw from the urn that we just saw. Here's the code that you can write to see a sample. And here is a sample with 25 beads. OK, now that you know the rules, think about how you would construct your interval. How many beads would you sample, et cetera? Notice that we have just described a simple sampling model for opinion polls.



The beads inside the urn represent the individuals that will vote on election day. Those that will vote Republican are represented with red beads and the Democrats with blue beads. For simplicity, assume there are no other colors, that there are just two parties. ***We want to predict the proportion of blue beads in the urn***. Let's call this quantity p, which in turn tells us the proportion of red beads, 1 minus p, and the spread, p minus (1 minus p), which simplifies to 2p minus 1.



In statistical textbooks, *the beads in the urn are called the population*. *The proportion of blue beads in the population, p, is called a parameter*. *The 25 beads that we saw in an earlier plot after we sampled, that's called a sample*. [][**The task of statistical inference is to predict the parameter, p, using the observed data in the sample**].

Now, can we do this with just the 25 observations we showed you? Well, they are certainly informative. For example, given that we see 13 red and 12 blue, it is unlikely that p is bigger than 0.9 or smaller than 0.1. Because if they were, it would be un-probable to see 13 red and 12 blue. But are we ready to predict with certainty that there are more red beads than blue?

OK, **what we want to do is construct an estimate of p using only the information we observe**. An estimate can be thought of as a summary of the observed data that we think is informative about the parameter of interest. It seems intuitive to think that the proportion of blue beads in the sample, which in this case is 0.48, must be at least related to the actual proportion p. But do we simply predict p to be 0.48?

[][**First, note that the sample proportion is a random variable**]. If we run the command take_poll(25), say four times, we get four different answers. Each time the sample is different and the sample proportion is different. The sample proportion is a random variable. Note that in the four random samples we show, the sample proportion ranges from 0.44 to 0.6. **By describing the distribution of this random variable, we'll be able to gain insights into how good this estimate is and how we can make it better**.
[][Textbook link]
This video matches the textbook sections on the sampling model for polls and the first part of populations, samples, parameters and estimates.
https://rafalab.github.io/dsbook/inference.html#the-sampling-model-for-polls
https://rafalab.github.io/dsbook/inference.html#populations-samples-parameters-and-estimates
[][Key points]
[][* The task of statistical inference is to estimate an unknown population parameter using observed data from a sample. *]
In a sampling model, the collection of elements in the urn is called the population.
[][* A parameter is a number that summarizes data for an entire population. *]
A sample is observed data from a subset of the population.
An estimate is a summary of the observed data about a parameter that we believe is informative. It is a data-driven guess of the population parameter.
We want to predict the proportion of the blue beads in the urn, the parameter p . The proportion of red beads in the urn is 1 - p and the spread is 2p - 1.
The sample proportion is a random variable. Sampling gives random results drawn from the population distribution.
Code: Function for taking a random draw from a specific urn
The dslabs package includes a function for taking a random draw of size n from the urn described in the video:
library(tidyverse)
library(dslabs)
take_poll(25) # draw 25 beads
# The Sample Average
[][*Taking an opinion poll is being modeled as taking a random sample from an urn*]. **We are proposing the use of the proportion of blue beads in our sample as an estimate of the parameter p**. Once we have this estimate, we can easily report an estimate of the spread, 2p minus 1.

But for simplicity, we will illustrate the concept of statistical inference for estimating p. [][We will use our knowledge of probability to defend our use of the sample proportion, and quantify how close we think it is from the population proportion p]. *We start by defining the random variable X*. X is going to be 1 if we pick a blue bead at random, and 0 if it's red.


[][*This implies that we're assuming that the population, the beads in the urn, are a list of 0s and 1s*]. If we sample N beads, then the average of the draws X_1 through X_N is equivalent to the proportion of blue beads in our sample. *This is because adding the Xs is equivalent to counting the blue beads, and dividing by the total N turns this into a proportion*. We use the symbol X-bar to represent this average. In general, in statistics textbooks, a bar on top of a symbol means the average.


[][***The theory we just learned about the sum of draws becomes useful, because we know the distribution of the sum N times X-bar. We know the distribution of the average X-bar, because N is a non random constant***]. For simplicity, let's assume that the draws are independent. After we see each sample bead, we return it to the urn. It's a sample with replacement. *In this case, what do we know about the distribution of the sum of draws*? First, we know that the expected value of the sum of draws is N times the average of the values in the urn. We know that the average of the 0s and 1s in the urn must be the proportion p, the value we want to estimate.

Here, we encounter an important difference with what we did in the probability module. We don't know what is in the urn. We know there are blue and red beads, but we don't know how many of each. This is what we're trying to find out. We're trying to estimate p. *Just like we use variables to define unknowns in systems of equations, in statistical inference, we define parameters to define unknown parts of our models*. In the urn model we are using to mimic an opinion poll, we do not know the proportion of blue beads in the urn. We define the parameter p to represent this quantity. We are going to estimate this parameter.

Note that the ideas presented here, on how we estimate parameters and provide insights into how good these estimates are, extrapolate to many data science tasks. For example, we may ask, [][what is the difference in health improvement between patients receiving treatment and a control group]? We may ask, what are the health effects of smoking on a population? What are the differences in racial groups of fatal shootings by police? What is the rate of change in life expectancy in the US during the last 10 years? All these questions can be framed as a task of estimating a parameter from a sample.



[][Textbook ]
This video matches the textbook section on the sample average and the textbook section on parameters.
https://rafalab.github.io/dsbook/inference.html#the-sample-average
https://rafalab.github.io/dsbook/inference.html#parameters
[][Key points]
[][ Many common data science tasks can be framed as estimating a parameter from a sample. ]
We illustrate statistical inference by walking through the process to estimate p. From the estimate of p, we can easily calculate an estimate of the spread, 2p-1.
[][* Consider the random variable X that is 1 if a blue bead is chosen and 0 if a red bead is chosen. The proportion of blue beads in draws is the average of the draws X_1,,,,X_n. *]
X_bar is the sample average. In statistics, a bar on top of a symbol denotes the average. X_bar is a random variable because it is the average of random draws - each time we take a sample, X_bar is different. (X_1, X_2,,, are individual 1 or 0)
X_bar = (X_1 + X_2 + ... + X_n)/N
[][* The number of blue beads drawn in N draws, NX_bar, is N times the proportion of values in the urn. However, we do not know the true proportion: we are trying to estimate this parameter p.]
>>> list_a = ["a", "a", "a", "b", "b", "c", "d", "d", "d", "c", "e", "f", "f"]
>>> set_a = set(list_a)
>>> list_a_dist = [i for i in set_a]
>>> list_a
['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'c', 'e', 'f', 'f']
>>> list_a_dist
['d', 'c', 'b', 'f', 'a', 'e']
>>> num = {i:j for i in list_a_dist for j in [len([p for p in list_a if p==i])]}
>>> num
{'d': 3, 'c': 2, 'b': 2, 'f': 2, 'a': 3, 'e': 1}
>>>
# Polling versus Forecasting
Before we continue, let's make an important clarification related to the practical problem of forecasting the election. If a poll is conducted 4 months before the election, it is estimating the p for that moment, not for election day.


But, note that the p for election night might be different since people's opinions fluctuate through time. The polls provided the night before the election tend to be the most accurate since opinions don't change that much in a couple of days. *However, forecasters try to build tools that model how opinions vary across time and try to predict the election day result, taking into consideration the fact that opinions fluctuate*. We'll describe some approaches for doing this in a later section.
[][Textbook link]
This video corresponds to the textbook section on polling versus forecasting.
https://rafalab.github.io/dsbook/inference.html#polling-versus-forecasting
[][Key points]
A poll taken in advance of an election estimates p for that moment, not for election day.
In order to predict election results, forecasters try to use early estimates of p to predict p on election day. We discuss some approaches in later sections.
# Properties of Our Estimate
*To understand how good our estimate is, we'll describe the statistical properties of the random variable we just defined, the sample proportion*.

Note that if we multiply by N, [][**N times X bar is the sum of independent draws. So the rules we cover in the probability module apply**]. Using what we have learned, the expected value of the sum N times X bar is N times the average of the urn, p. So dividing by the non-random constant N gives us that the expected value of the average X bar is p. {*thus p is now a normal distributed random variable*}



We can write it using our mathematical notation like this. We also can use what we learned to figure out the standard error. [][***We know that the standard error of the sum is square root of N times a standard deviation of the values in the urn***]???. Can we compute the standard error of the urn? We learn a formula that tells us that it's 1 minus 0 times the square root of p times 1 minus p, which is the square root of p times 1 minus p. ??????

Because we are dividing by the sum N, we arrive at the following formula for the standard error of the average. [][***The standard error of the average is square root of p times 1 minus p divided by the square root of N***]. {*because we are multiple by squart root of N and then divide by N, this will end up with divide by squart root N*}

This result reveals the power of polls. *The expected value of the sample proportion X bar is the parameter of interest p*, and we can make the standard error as small as we want by increasing the sample size N. The law of large numbers tells us that with a large enough poll, our estimate converges to p. If we take a large enough poll to make our standard error, say, about 0.01, we'll be quite certain about who will win. But how large does the poll have to be for the standard error to be this small? One problem is that we do not know p, so we can't actually compute the standard error.

For illustrative purposes, let's assume that p is 0.51 and make a plot of the standard error versus a sample size N. Here it is. We can see that, obviously, it's dropping. From the plot we also see that we would need a poll of over 10,000 people to get the standard error as low as we want it to be.

We rarely see polls of this size, due in part to cost. We'll give other reasons later. From the RealClearPolitics table we saw earlier, we learn that the sample sizes in opinion polls range from 500 to 3,500. For a sample size of 1,000, if we set p to be 0.51, the standard error is about 0.15, or 1.5 percentage points.

So even with large polls for close elections, X bar can lead us astray if we don't realize it's a random variable. But we can actually say more about how close we can get to the parameter p. We'll do that in the next video.
[][Textbook link]
This video corresponds to the textbook section on properties of our estimate.
https://rafalab.github.io/dsbook/inference.html#properties-of-our-estimate-expected-value-and-standard-error
[][Key points]
[][* When interpreting values of X_bar, it is important to remember that X_bar is a random variable with an expected value and standard error that represents the sample proportion of positive events. *]
The expected value of X_bar is the parameter of interest p. This follows from the fact that X_bar is the sum of independent draws of a random variable times a constant 1/N.
E(X_bar) = p
As the number of draws N increases, the standard error of our estimate X_bar decreases. The standard error of the average of X_bar over N draws is:
SE(X_bar) = sqrt(p(1-p)/N)
In theory, we can get more accurate estimates of p by increasing N. In practice, there are limits on the size of N due to costs, as well as other factors we discuss later.
[][* We can also use other random variable equations to determine the expected value of the sum of draws E(S) and standard error of the sum of draws SE(S). *]
E(X) = Np
SE(S) = sqrt(Np(1-p))
# Assessment 1.1: Parameters and Estimates
DataCamp due Jul 8, 2022 02:35 AWST
In this assessment, you will learn about parameters and estimates using the example of election polling.
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy. Note that you might need to disable your pop-up blocker, or allow "www.datacamp.com" in your pop-up blocker allowed list. When you have completed the exercises, return to edX to continue your learning.
Assessment 1.1: Parameters and Estimates (External resource) (7.0 points possible)
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy.
Ask your questions about parameters and estimates or the related DataCamp assessment here. Remember to search the discussion board before posting to see if someone else has asked the same thing before asking a new question! You're also encouraged to answer each other's questions to help further your own learning.
Some reminders:
Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
Posting snippets of code is okay, but posting full code solutions is not.
If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.
Discussion: Assessment 1.1
Topic: Section 1 / Assessment 1.1: Parameters and Estimates
Filter:
Sort:
unanswered question
Troubleshooting, already finish the course exercises but, cant´ see the grades!
DataCamp linkage with Edx is working? Troubleshooting, already finish the course exercises but, cant´ see the grades It happens over additional sections too, 1, 2, 4 and 6. Help please!
2 comments (2 unread comments)
unanswered question
Why is Xbar is the sum of independent draws
The textbook in 16.2.4 says that "X bar is the sum of independent draws". Shouldn't it by definition be the sum of independent draws divided by N (or multiplied by 1/N)?
3 comments (3 unread comments)
unanswered question
Could someone explain the question in which we had to calculate the SE of the spread?
This was the only question that I had a doubt in and any help would be appreciated. Thank you so much!
2 comments (2 unread comments)
unanswered question
Which lesson covered the expectation algebra for random variables?
A couple of these questions went through derivations based on E[X(1-X)]. While it was fairly intuitive, I don't recall a lesson covering this at the same level of detail. Specifically, when taking the SE on expected values, there's a key step where a constant drops out of the SE. Makes sense, but I don't recall a discussion behind that. Anyone else remember where that was?
2 comments (2 unread comments)
discussion
Issue with exercise 6
The posted solution involves dividing the upper limit of the y-axis by sqrt(25) which is not indicated or requested anywhere in the question.
3 comments (3 unread comments)
answered question
Why Subtracting 1 does not affect the standard error?
In Exercise 8 (Standard error of d), my derivation included the constant -1, but it wasn't correct.
4 comments (4 unread comments)
discussion
Very nice introduction section!
It is an important introductory section that defines the difference actual probability and sampling estimates (e.g. in case of polls) Really cool how the definitions of the estimates of X-bar and SE of X-bar can give good view of how sample size can sufficient or not to derive a good estimate!
1 comments
## Exercise 1. Polling - expected value of S
# ========================================================================================================================
Suppose you poll a population in which a proportion p of voters are Democrats and 1-p are Republicans. Your sample size is N=25. [][Consider the random variable S, which is the total number of Democrats in your sample.]
What is the expected value of this random variable S?
Instructions
50 XP
Possible Answers
E(S) = 25(1-p)
E(S) = 25p
E(S) = sqrt(25p(1-p))
E(S) = p
## Exercise 2. Polling - standard error of S
# ======================================================================================================================
Again, consider the random variable S, which is the total number of Democrats in your sample of 25 voters. [][The variable p describes the proportion of Democrats in the sample,] whereas 1-p describes the proportion of Republicans.
What is the standard error of S?
Instructions
50 XP
Possible Answers
SE(S) = 25p(1-p)
SE(S) = sqrt(25p)
SE(S) = 25(1-p)
SE(S) = sqrt(25*p(1-p))
## Exercise 3. Polling - expected value of X-bar
# =====================================================================================================================
Consider the random variable S/N, which is equivalent to the sample average that we have been denoting as X_bar. The variable N represents the sample size and p is the proportion of Democrats in the population.
What is the expected value of X_bar?
Instructions
50 XP
Possible Answers
E(X_bar) = p
E(X_bar) = Np
E(X_bar) = N(1-p)
E(X_bar) = 1-p
## Exercise 4. Polling - standard error of X-bar
# ========================================================================================================================
What is the standard error of the sample average, X_bar?
The variable N represents the sample size and p is the proportion of Democrats in the population.
Instructions
50 XP
Possible Answers
SE(X_bar) = sqrt(Np(1-p))
SE(X_bar) = sqrt(p(1-p)/N)
SE(X_bar) = sqrt(p(1-p))
SE(X_bar) = sqrt(N)
## Exercise 5. se versus p
Write a line of code that calculates the standard error se of a sample average when you poll 25 people in the population. Generate a sequence of 100 proportions of Democrats p that vary from 0 (no Democrats) to 1 (all Democrats).
Plot se versus p for the 100 different proportions.
Instructions
100 XP
Use the seq function to generate a vector of 100 values of p that range from 0 to 1.
Use the sqrt function to generate a vector of standard errors for all values of p.
Use the plot function to generate a plot with p on the x-axis and se on the y-axis.
```{r}
# `N` represents the number of people polled
N <- 25
# Create a variable `p` that contains 100 proportions ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, 1/100)
p
# ==============================================================================================================
# Create a variable `se` that contains the standard error of each sample average
se <- sqrt(N*p*(1-p))
# Plot `p` on the x-axis and `se` on the y-axis
plot(p, se)
```
Incorrect submission
Check your call of seq(). Did you specify the argument length.out?
Incorrect submission
Use sqrt to calculate the standard error and save it as se. Make sure to specify the correct formula for standard error.
```{r}
# `N` represents the number of people polled
N <- 25
# Create a variable `p` that contains 100 proportions ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length.out=100)
# Create a variable `se` that contains the standard error of each sample average
se <- sqrt(p*(1-p)/N)
# Plot `p` on the x-axis and `se` on the y-axis
plot(p, se)
```
## Exercise 6. Multiple plots of se versus p
Using the same code as in the previous exercise, create a for-loop that generates three plots of p versus se when the sample sizes equal N = 25, N = 100, N = 1000.
Instructions
100 XP
Your for-loop should contain two lines of code to be repeated for three different values of N.
The first line within the for-loop should use the sqrt function to generate a vector of standard errors se for all values of p.
The second line within the for-loop should use the plot function to generate a plot with p on the x-axis and se on the y-axis.
Use the ylim argument to keep the y-axis limits constant across all three plots. The lower limit should be equal to 0 and the upper limit should equal 0.1 (it can be shown that this value is the highest calculated standard error across all values of p and N).
```{r}
# The vector `p` contains 100 proportions of Democrats ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length = 100)
# The vector `sample_sizes` contains the three sample sizes
sample_sizes <- c(25, 100, 1000)
# =======================================================================================================
# Write a for-loop that calculates the standard error `se` for every value of `p` for each of the three samples sizes `N` in the vector `sample_sizes`. Plot the three graphs, using the `ylim` argument to standardize the y-axis across all three plots.
se <- sqrt(p*(1-p))
plot(p, se, ylim=c(0, 0.1))
```
```{r}
# The vector `p` contains 100 proportions of Democrats ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length = 100)
# The vector `sample_sizes` contains the three sample sizes
sample_sizes <- c(25, 100, 1000)
# Write a for-loop that calculates the standard error `se` for every value of `p` for each of the three samples sizes `N` in the vector `sample_sizes`. Plot the three graphs, using the `ylim` argument to standardize the y-axis across all three plots.
se <- sqrt(p*(1-p)/sample_sizes)
plot(p, se, ylim=c(0, 0.1))
```
Incorrect submission
Make sure to write a for-loop using for.
```{r}
# The vector `p` contains 100 proportions of Democrats ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length = 100)
# The vector `sample_sizes` contains the three sample sizes
sample_sizes <- c(25, 100, 1000)
# Write a for-loop that calculates the standard error `se` for every value of `p` for each of the three samples sizes `N` in the vector `sample_sizes`. Plot the three graphs, using the `ylim` argument to standardize the y-axis across all three plots.
# =========================================================================================================================
for (N in sample_sizes) {
se <- sqrt(p*(1-p)/N)
plot(p, se, ylim=c(0, 0.1))
}
# for loop in R script ========================================================================
```
## Exercise 7. Expected value of d
# =======================================================================================================================
Our estimate for the difference in proportions of Democrats and Republicans is d = X_bar - (1-X_bar).
Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the expected value of d?
Instructions
50 XP
Possible Answers
E[X_bar-(1-X_bar)] = E[2X_bar-1] = 2E[X_bar]-1 = N(2p-1) = Np-N(1-p)
E[X_bar-(1-X_bar)] = E[X_bar-1] = E[X_bar]-1 = p-1
E[X_bar-(1-X_bar)] = E[2X_bar-1] = 2sqrt(p(1-p))-1 = p-(1-p)
E[X_bar-(1-X_bar)] = E[2X_bar-1] = 2p-1 = p-(1-p) O
## Exercise 8. Standard error of d
# =======================================================================================================================
Our estimate for the difference in proportions of Democrats and Republicans is d = X_bar - (1-X_bar).
Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the standard error of d?
Instructions
50 XP
Possible Answers
SE[X_bar-(1-X_bar)] = SE[2X_bar-1] = 2SE[X_bar] = 2sqrt(p/N)
SE[X_bar-(1-X_bar)] = SE[2X_bar-1] = 2SE[X_bar-1] = 2sqrt(p(1-p)/N)-1
SE[X_bar-(1-X_bar)] = SE[2X_bar-1] = 2SE[X_bar] = 2sqrt(p(1-p)/N) O
SE[X_bar-(1-X_bar)] = SE[X_bar-1] = SE[X_bar] = sqrt(p(1-p)/N)
Incorrect submission
[][Try again. Subtracting 1 does not affect the standard error. ]
## Exercise 9. Standard error of the spread
# =======================================================================================================================
Say the actual proportion of Democratic voters is p=0.45. In this case, the Republican party is winning by a relatively large margin of d=-0.1, or a 10% margin of victory. What is the standard error of the spread 2X_bar-1 in this case?
Instructions
100 XP
Use the sqrt function to calculate the standard error of the spread 2X_bar-1.
```{r}
# `N` represents the number of people polled
N <- 25
# `p` represents the proportion of Democratic voters
p <- 0.45
# =========================================================================================================================
# Calculate the standard error of the spread. Print this value to the console.
2*sqrt(p*(1-p)/N)
```
## Exercise 10. Sample size
# =====================================================================================================================
[][So far we have said that the difference between the proportion of Democratic voters and Republican voters is about 10% and that the standard error of this spread is about 0.2 when N=25. Select the statement that explains why this sample size is sufficient or not.]
Instructions
50 XP
Possible Answers
This sample size is sufficient because the expected value of our estimate 2X_bar-1 is d so our prediction will be right on.
[][* This sample size is too small because the standard error is larger than the spread. O*]
This sample size is sufficient because the standard error of about 0.2 is much smaller than the spread of 10%.
Without knowing p, we have no way of knowing that increasing our sample size would actually improve our standard error.
# =======================================================================================================
# ========================================================= What is the spread the instructor mentioned here and above
## End of Assessment
This is the end of the programming assignment for this section. Please DO NOT click through to additional assessments from this page. If you do click through, your scores may NOT be recorded.
Click "Got it!" and submit to get the "points" for this question.
You can close this window and return to Data Science: Inference.
Answer the question
50XP
Possible Answers
Got it!
press
1
## Course / Section 2: The Central Limit Theorem in Practice / Section 2 Overview
# Section 2 Overview
In Section 2, you will look at the Central Limit Theorem in practice.
After completing Section 2, you will be able to:
Use the Central Limit Theorem to calculate the probability that a sample estimate X_bar is close to the population proportion p.
Run a Monte Carlo simulation to corroborate theoretical results built using probability theory.
Estimate the spread based on estimates of X_bar and SE_hat(X_bar).
Understand why bias can mean that larger sample sizes aren't necessarily better.
There is 1 assignment that uses the DataCamp platform for you to practice your coding skills.
We encourage you to use R to interactively test out your answers and further your learning.
## Course / Section 2: The Central Limit Theorem in Practice / Central Limit Theorem in Practice
# The Central Limit Theorem in Practice




**The central limit theorem tells us that the distribution function for a sum of draws is approximately normal. We also learned that when dividing a normally distributed random variable by a nonrandom constant, the resulting random variable is also normally distributed. This implies that the distribution of X-bar is approximately normal**.


So in summary, we have that X-bar has an approximately normal distribution. And in a previous video, we determined that the expected value is p, and the standard error is the square root of p times 1 minus p divided by the sample size N. Now, how does this help us? Let's ask an example question.




*Suppose we want to know what is the probability that we are within one percentage point from p--that we made a very, very good estimate*? So we're basically asking, what's the probability that the distance between X-bar and p, the absolute value of X-bar minus p, is less than 0.01, 1 percentage point. We can use what we've learned to see that this is the same as asking, what is the probability of X-bar being less than or equal to p plus 0.01 minus the probability of X-bar being less than or equal to p minus 0.01. Now, can we answer the question now? Can we compute that probability?


Note that we can use the mathematical trick that we learned in the previous module. What was that trick? [][*We subtract the expected value and divide by the standard error on both sides of the equation*]. What this does is it gives us a standard normal variable, which we have been calling capital Z, on the left side. And we know how to make calculations for that.



Since p is the expected value, and the standard error of X-bar is the square root of p times 1 minus p divided by N, we get that the probability that we were just calculating is equivalent to probability of Z, our standard normal variable, being less than 0.01 divided by the standard error of X-bar minus the probability of Z being less than negative 0.01 divided by that standard error of X-bar. *OK, now can we compute this probability?* Not yet. Our problem is that we don't know p. So we can't actually compute the standard error of X-bar using just the data. **But it turns out--and this is something new we're showing you--that the CLT still works if we use an estimate of the standard error that, instead of p, uses X-bar in its place**. We say this is a plug-in estimate. We call this a [][*plug-in estimate*]. Our estimate of the standard error is therefore the square root of X-bar times 1 minus X-bar divided by N. Notice, we changed the p for the X-bar. In the mathematical formula we're showing you, you can see a hat on top of the SE.


[][**In statistics textbooks, we use a little hat like this to denote estimates**]. This is an estimate of the standard error, not the actual standard error. But like we said, the central limit theorem still works. Note that, importantly, that this estimate can actually be constructed using the observed data. Now, let's continue our calculations. But now *instead of dividing by the standard error, we're going to divide by this estimate of the standard error*. Let's compute this estimate of the standard error for the first sample that we took, in which we had 12 blue beads and 13 red beads. In that case, X-bar was 0.48.


So to compute the standard error, we simply write this code. And we get that it's about 0.1. So now, we can answer the question. Now, we can compute the probability of being as close to p as we wanted. We wanted to be 1 percentage point away. The answer is simply pnorm of 0.01--that's 1 percentage point--divided by this estimated se minus pnorm of negative 0.01 divided by the estimated se. We plug that into R, and we get the answer. The answer is that the probability of this happening is about 8%. So there is a very small chance that we'll be as close as this to the actual proportion.

Now, that wasn't very useful, but what it's going to do, what we're going to be able to do with the central limit theorem is determine what sample sizes are better. And once we have those larger sample sizes, we'll be able to provide a very good estimate and some very informative probabilities.
[][Textbook link]
This video corresponds to the textbook section on the Central Limit Theorem in practice.
https://rafalab.github.io/dsbook/inference.html#clt
[][Key points]
[][* Because X_bar is the sum of random draws divided by a constant, the distribution of X_bar is approximately normal. *]
We can convert X_bar to a standard normal random variable Z: ???Why doing this???
Z = (X_bar - E(X_bar))/SE(X_bar)
[][* The probability that X_bar is within .01 of the actual value of p is: *]
Pr(Z<= 0.01/sqrt(p(1-p)/N)) - Pr(Z<= -0.01/sqrt(p*(1-p)/N))
The Central Limit Theorem (CLT) still works if X_bar is used in place of p. This is called a plug-in estimate. Hats over values denote estimates. Therefore:
SE_hat(X_bar) = sqrt(X_bar(1-X_bar)/N)
Using the CLT, the probability that X_bar is within .01 of the actual value of p is:
[][ Pr(Z<= 0.01/sqrt(X_bar(1-X_bar)/N)) - Pr(X<= -o.o1/sqrt(x_bar(1-X_bar)/N)) ]
Code: Computing the probability of X_bar being within .01 of
X_hat <- 0.48
se <- sqrt(X_hat*(1-X_hat)/25)
pnorm(0.01/se) - pnorm(-0.01/se)
# Margin of Error

So a poll of only 25 people is not really very useful, at least for a close election. *Earlier we mentioned the margin of error. Now we can define it because it is simply 2 times the standard error, which we can now estimate*. In our case it was 2 times se, which is about 0.2. [][Why do we multiply by 2]?

This is because if you ask what is the probability that we're within 2 standard errors from p, using the same previous equations, we end up with an equation like this one. This one simplifies out, and [][**we're simply asking what is the probability of the standard normal distribution that has the expected value 0 and standard error one is within two values from 0, and we know that this is about 95%**]. So there's a 95% chance that X-bar will be within 2 standard errors. That's the margin of error, in our case, to p.

Now why do we use 95%? This is somewhat arbitrary. But traditionally, that's what's been used. It's the most common value that's used to define margins of errors. In summary, the central limit theorem tells us that our poll based on a sample of just 25 is not very useful. *We don't really learn much when the margin of error is this large* {Think, we have to [][compare the margin of errors with the expected value], and recall the lm() model output, do you remember something???}. All we can really say is that the popular vote will not be won by a large margin.

This is why pollsters tend to use larger sample sizes. From the table that we showed earlier from RealClearPolitics, we saw that a typical sample size was between 700 and 3,500.

To see how this gives us a much more practical result, note that *if we had obtained an X-bar of 0.48, but with a sample size of 2,000, the estimated standard error would have been about 0.01* (sqrt(0.48*(1-0.48)/2000)). So our result is an estimate of 48% blue beads with a margin of error of 2%. In this case, the result is much more informative and would make us think that there are more red beads than blue beads. But keep in mind, this is just hypothetical. We did not take a poll of 2,000 beads since we don't want to ruin the competition.


>>> p = 0.48
>>> N = 2000
>>> import math
>>> se = math.sqrt(p*(1-p)/N)
>>> se
0.011171392035015153
>>>
[][Textbook link]
The margin of error is discussed within the textbook section on the Central Limit Theorem in practice.
https://rafalab.github.io/dsbook/inference.html#clt
[][Key points]
The margin of error is defined as 2 times the standard error of the estimate X_bar.
There is about a 95% chance that X_bar will be within two standard errors of the actual parameter p.
# A Monte Carlo Simulation for the CLT
Suppose we want to use a Monte Carlo simulation to corroborate that the tools that we've been using to build estimates and margins of errors using probability theory actually work. To create the simulation, we would need to write code like this. We would simply [][*write the urn model, use replicate to construct a Monte Carlo simulation*]. The problem is, of course, that we don't know p. We can't run the code we just showed you because we don't know what p is.

However, we could construct an urn like the one we showed in a previous video and actually run an analog simulation. It would take a long time because you would be picking beads and counting them, but you could take 10,000 samples, count the beads each time, and keep track of the proportions that you see. We can use the function take poll with n of 1,000 instead of actually drawing from an urn, but it would still take time because you would have to count the beads and enter the results into R. So one thing we can do to corroborate theoretical results is to pick a value of p or several values of p and then run simulations using those.

As an example, let's set p to 0.45. *We can simulate one poll of 1,000 beads or people using this simple code*. Now we can take that into a Monte Carlo simulation. Do it 10,000 times, each time returning the proportion of blue beads that we get in our sample.

>>> math.sqrt(0.45*(1-0.45)/1000)
0.015732132722552274
To review, the theory tells us that X-bar has an approximately normal distribution with expected value 0.45 and a standard error of about 1.5%. The simulation confirms this. If we take the mean of the X-hats that we created, we indeed get a value of about 0.45. And if we compute the sd of the values that we just created, we get a value of about 1.5%.


A histogram and a qq plot of this X-hat data confirms that the normal approximation is accurate as well. Again, note that in real life, we would never be able to run such an experiment because we don't know p. But we could run it for various values of p and sample sizes N and see that the theory does indeed work well for most values. You can easily do this yourself by rerunning the code we showed you after changing p and N.
[][Textbook link]
This video corresponds to the textbook section on a Monte Carlo simulation for the CLT.
https://rafalab.github.io/dsbook/inference.html#a-monte-carlo-simulation
[][Key points]
We can run Monte Carlo simulations to compare with theoretical results assuming a value of p.
In practice, p is unknown. We can corroborate theoretical results by running Monte Carlo simulations with one or several values of p.
[][* One practical choice for p when modeling is X_bar, the observed value of X_hat in a sample. *]
Code: Monte Carlo simulation using a set value of p
p <- 0.45 # unknown p to estimate
N <- 1000
# simulate one poll of size N and determine x_hat
x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
x_hat <- mean(x)
# simulate B polls of size N and determine average x_hat
B <- 10000 # number of replicates
N <- 1000 # sample size per replicate
x_hat <- replicate(B, {
x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
mean(x)
})
Code: Histogram and QQ-plot of Monte Carlo results
library(tidyverse)
library(gridExtra)
p1 <- data.frame(x_hat = x_hat) %>%
ggplot(aes(x_hat)) +
geom_histogram(binwidth = 0.005, color = "black")
p2 <- data.frame(x_hat = x_hat) %>%
ggplot(aes(sample = x_hat)) +
stat_qq(dparams = list(mean = mean(x_hat), sd = sd(x_hat))) +
geom_abline() +
ylab("X_hat") +
xlab("Theoretical normal")
grid.arrange(p1, p2, nrow=1)
```{r}
B <- 10000
N <- 1000
p <- 0.48 # But since we do not know the p, the population parameter
X_hat <- replicate(B, {
X <- sample(c(0, 1), size=N, replace=T, prob=c(1-p, p))
mean(X)
})
mean(X_hat)
sd(X_hat) # ?????????????????????????????????????
```
```{r}
library(gridExtra)
library(tidyverse)
p1 <- data.frame(X_hat=X_hat) %>%
ggplot(aes(X_hat)) +
geom_histogram(bins=30, color="black")
# ======================================================================================================================
# ======================================================================================================================
p2 <- data.frame(X_hat=X_hat) %>%
ggplot(aes(sample=X_hat)) +
stat_qq(dparams=list(mean=mean(X_hat), sd=sd(X_hat))) +
geom_abline() +
ylab("X_hat") +
xlab("Theoretical normal")
grid.arrange(p1, p2, nrow=1)
```
# The Spread
[][*The competition is to predict the spread, not the proportion p*]. However, because we are assuming there are only two parties, we know that the spread is just p minus (1 minus p), which is equal to 2p minus 1.

So everything we have done can easily be adapted to estimate to 2p minus 1. Once we have our estimate, X-bar, and our estimate of our standard error of X-bar, we estimate the spread by 2 times X-bar minus 1, *just plugging in the X-bar where you should have a p*.


***And, since we're multiplying a random variable by 2, we know that the standard error goes up by 2***. So the standard error of this new random variable is 2 times the standard error of X-bar [][sqrt(2p*(1-2p)/N)]. Note that subtracting the 1 does not add any variability, so it does not affect the standard error.

[][*Earlier we mentioned the margin of error. Now we can define it because it is simply 2 times the standard error*]



So, for our first example, with just the 25 beads, our estimate of p was 0.48 with a margin of error of 0.2. *This means that our estimate of the spread is 4 percentage points, 0.04, with a margin of error of 40%, 0.4*. Again, not a very useful sample size. But the point is that once we have an estimate and standard error for p, we have it for the spread 2p minus 1.
[][Textbook link]
This video corresponds to the textbook section on the spread.
https://rafalab.github.io/dsbook/inference.html#the-spread
[][Key points]
The spread between two outcomes with probabilities p and 1-p is 2p-1.
The expected value of the spread is 2X_bar-1.
[][* The standard error of the spread is 2SE_hat(X_bar). *]
[][* The margin of error of the spread is 2 times the margin of error of X_bar. *]
# Bias: Why Not Run a Very Large Poll?
Note that for realistic values of p, say between 0.35 and 0.65 for the popular vote, if we run a very large poll with say 100,000 people, theory would tell us that we would predict the election almost perfectly, since *the largest possible margin of error is about 0.3%*. Here are the calculations that were used to determine that. We can see a graph showing us the standard error for several values of p if we fix N to be 100,000.


So why are there no pollsters that are conducting polls this large? One reason is that running polls with a sample size of 100,000 is very expensive. But [][***perhaps a more important reason is that theory has its limitations***]. Polling is much more complicated than picking beads from an urn. For example, while the beads are either red or blue, and you can see it with your eyes, people, when you ask them, might lie to you. Also, because you're conducting these polls usually by phone, you might miss people that don't have phones. And they might vote differently than those that do. But *perhaps the most different way an actual poll is from our urn model is that we actually don't know for sure who is in our population and who is not*.
How do we know who is going to vote? Are we reaching all possible voters? So, even if our margin of error is very small, it may not be exactly right that our expected value is p. [][*We call this bias*]. Historically, we observe that polls are, indeed, biased, although not by that much. The typical bias appears to be between 1% and 2%. This makes election forecasting a bit more interesting. And we'll talk about that in a later video.
[][Textbook link]
This video corresponds to the textbook section on bias.
https://rafalab.github.io/dsbook/inference.html#bias-why-not-run-a-very-large-poll
[][Key points]
An extremely large poll would theoretically be able to predict election results almost perfectly.
These sample sizes are not practical. In addition to cost concerns, polling doesn't reach everyone in the population (eventual voters) with equal probability, and it also may include data from outside our population (people who will not end up voting).
These systematic errors in polling are called bias. We will learn more about bias in the future.
Code: Plotting margin of error in an extremely large poll over a range of values of p
library(tidyverse)
N <- 100000
p <- seq(0.35, 0.65, length = 100)
SE <- sapply(p, function(x) 2*sqrt(x*(1-x)/N))
data.frame(p = p, SE = SE) %>%
ggplot(aes(p, SE)) +
geom_line()
```{r}
library(tidyverse)
N <- 100000
p <- seq(0.35, 0.65, length=100)
SE <- sapply(p, function(x) 2*sqrt(x*(1-x)/N))
data.frame(SE=SE) %>%
ggplot(aes(p, SE)) +
geom_line()
```
# Assessment 2.1: Introduction to Inference
DataCamp due Jul 14, 2022 07:55 AWST
In this assessment, you will learn about the central limit theorem in practice.
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy. Note that you might need to disable your pop-up blocker, or allow "www.datacamp.com" in your pop-up blocker allowed list. When you have completed the exercises, return to edX to continue your learning.
Assessment 2.1: Introduction to Inference (External resource) (12.5 points possible)
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy.
Ask your questions about the central limit theorem for inference or the related DataCamp assessment here. Remember to search the discussion board before posting to see if someone else has asked the same thing before asking a new question! You're also encouraged to answer each other's questions to help further your own learning.
Some reminders:
Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
Posting snippets of code is okay, but posting full code solutions is not.
If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.
## Exercise 1. Sample average
Write function called take_sample that takes the proportion of Democrats p and the sample size N as arguments and returns the sample average of Democrats (1) and Republicans (0).
Calculate the sample average if the proportion of Democrats equals 0.45 and the sample size is 100.
Instructions
100 XP
Define a function called take_sample that takes p and N as arguments.
Use the sample function as the first statement in your function to sample N elements from a vector of options where Democrats are assigned the value '1' and Republicans are assigned the value '0' in that order.
Use the mean function as the second statement in your function to find the average value of the random sample.
```{r}
# Write a function called `take_sample` that takes `p` and `N` as arguements and returns the average value of a randomly sampled population.
take_sample <- function(p, N) mean(sample(c(1, 0), size=N, prob=c(p, 1-p), replace=T))
# ===========================================================================================================================