HarvardX PH125.4x -- Data Science Inference and Modeling.Rmd

---
title: 'HarvardX PH125.4x Data Science: Inference and Modeling'
author: "John HHU"
date: '2022-07-02'
output: html_document
---







## Course  /  Section 1: Parameters and Estimates  /  Parameters and Estimates


# Sampling Model Parameters and Estimates


To help us understand the connection between polls and the probability theory that we have learned, let's construct a scenario that we can work through together and that is similar to the one that pollsters face.  We will use an urn instead of voters.  And because pollsters are competing with other pollsters for media attention, we will imitate that by having our competition with a $25 prize.  The challenge is to guess the spread between the proportion of blue and red balls in this urn.  Before making a prediction, you can take a sample, with replacement, from the urn.  
![the challenge is to guess the spread between proportion of blue and red in this urn](C:/Users/qp/Pictures/to help us understand the connection between polls and probability theory.png)
To mimic the fact that running polls is expensive, it will cost you $0.10 per bead you sample.  So if your sample size is 250 and you win, you'll break even, as you'll have to pay me $25 to collect your $25.  
![](C:/Users/qp/Pictures/running poll is expansive.png)
Your entry into the competition can be an interval.  If the interval you submit contains the true proportion, you get half what you paid and pass to the second phase of the competition.  In the second phase of the competition, the entry with the smallest interval is selected as the winner.  
![](C:/Users/qp/Pictures/the take poll dataset.png)
![](C:/Users/qp/Pictures/and here is the sample with 25 beads.png)
The dslabs package includes a function that shows a random draw from the urn that we just saw.  Here's the code that you can write to see a sample.  And here is a sample with 25 beads.  OK, now that you know the rules, think about how you would construct your interval.  How many beads would you sample, et cetera?  Notice that we have just described a simple sampling model for opinion polls.  

![](C:/Users/qp/Pictures/we want to predict the proportion of blue beads in the urn.png)
![consider we have only two parties in the urn, then](C:/Users/qp/Pictures/assuming the quantity is p for blue beads, then we get the proportion of read bead too.png)
![](C:/Users/qp/Pictures/and here is out sperad, can be simplifies as 2p-1.png)
The beads inside the urn represent the individuals that will vote on election day.  Those that will vote Republican are represented with red beads and the Democrats with blue beads.  For simplicity, assume there are no other colors, that there are just two parties.  ***We want to predict the proportion of blue beads in the urn***.  Let's call this quantity p, which in turn tells us the proportion of red beads, 1 minus p, and the spread, p minus (1 minus p), which simplifies to 2p minus 1.  
![](C:/Users/qp/Pictures/in statistical book, the beads in the urn are called the population.png)
![](C:/Users/qp/Pictures/the proportion of blue beads in the population is called a parameter.png)
![](C:/Users/qp/Pictures/the 25 beads we draw earlier is called a sample.png)
In statistical textbooks, *the beads in the urn are called the population*.  *The proportion of blue beads in the population, p, is called a parameter*.  *The 25 beads that we saw in an earlier plot after we sampled, that's called a sample*.  [][**The task of statistical inference is to predict the parameter, p, using the observed data in the sample**].  
![](C:/Users/qp/Pictures/the task of statistical inference is to predict the parameter p using the observed data in the sample.png)
Now, can we do this with just the 25 observations we showed you?  Well, they are certainly informative.  For example, given that we see 13 red and 12 blue, it is unlikely that p is bigger than 0.9 or smaller than 0.1.  Because if they were, it would be un-probable to see 13 red and 12 blue.  But are we ready to predict with certainty that there are more red beads than blue?  
![](C:/Users/qp/Pictures/the estimate is a summary of observed data that we think is informative about the parameter of interest.png)
OK, **what we want to do is construct an estimate of p using only the information we observe**.  An estimate can be thought of as a summary of the observed data that we think is informative about the parameter of interest.  It seems intuitive to think that the proportion of blue beads in the sample, which in this case is 0.48, must be at least related to the actual proportion p.  But do we simply predict p to be 0.48?   
![the sample proportion is a random variable](C:/Users/qp/Pictures/if we run the command take_poll 25 many times, each time, the sample proportion is different.png)
[][**First, note that the sample proportion is a random variable**].  If we run the command take_poll(25), say four times, we get four different answers.  Each time the sample is different and the sample proportion is different.  The sample proportion is a random variable.  Note that in the four random samples we show, the sample proportion ranges from 0.44 to 0.6.  **By describing the distribution of this random variable, we'll be able to gain insights into how good this estimate is and how we can make it better**.  



[][Textbook link]

This video matches the textbook sections on the sampling model for polls and the first part of populations, samples, parameters and estimates.
https://rafalab.github.io/dsbook/inference.html#the-sampling-model-for-polls
https://rafalab.github.io/dsbook/inference.html#populations-samples-parameters-and-estimates



[][Key points]

[][*    The task of statistical inference is to estimate an unknown population parameter using observed data from a sample. *]

    In a sampling model, the collection of elements in the urn is called the population. 
    
[][*    A parameter is a number that summarizes data for an entire population. *]

    A sample is observed data from a subset of the population.

    An estimate is a summary of the observed data about a parameter that we believe is informative. It is a data-driven guess of the population parameter. 

    We want to predict the proportion of the blue beads in the urn, the parameter p . The proportion of red beads in the urn is 1 - p and the spread is 2p - 1. 

    The sample proportion is a random variable. Sampling gives random results drawn from the population distribution. 


Code: Function for taking a random draw from a specific urn

The dslabs package includes a function for taking a random draw of size n from the urn described in the video:

library(tidyverse)
library(dslabs)
take_poll(25)      # draw 25 beads













# The Sample Average


[][*Taking an opinion poll is being modeled as taking a random sample from an urn*].  **We are proposing the use of the proportion of blue beads in our sample as an estimate of the parameter p**.  Once we have this estimate, we can easily report an estimate of the spread, 2p minus 1.  
![](C:/Users/qp/Pictures/the spread id 2p-1.png)

But for simplicity, we will illustrate the concept of statistical inference for estimating p.  [][We will use our knowledge of probability to defend our use of the sample proportion, and quantify how close we think it is from the population proportion p].  *We start by defining the random variable X*.  X is going to be 1 if we pick a blue bead at random, and 0 if it's red.  
![](C:/Users/qp/Pictures/we start by definding a random variable X, with each value represent each outcome from the urn.png)
![](C:/Users/qp/Pictures/if we sample N beads, then the average of draw x1 through xn is equivalent to the population of the bead we use 1 to represent.png)
[][*This implies that we're assuming that the population, the beads in the urn, are a list of 0s and 1s*].  If we sample N beads, then the average of the draws X_1 through X_N is equivalent to the proportion of blue beads in our sample.  *This is because adding the Xs is equivalent to counting the blue beads, and dividing by the total N turns this into a proportion*.  We use the symbol X-bar to represent this average.  In general, in statistics textbooks, a bar on top of a symbol means the average.  
![](C:/Users/qp/Pictures/adding the Xs is equivalent to counting blue beads, divide by the total n turns this into proportion.png)
![assuming throw die 100 tmes, how do you think it, X_1 through X_n and whats X_bar](C:/Users/qp/Pictures/now the distribution of the sum N times X bar.png)
[][***The theory we just learned about the sum of draws becomes useful, because we know the distribution of the sum N times X-bar.  We know the distribution of the average X-bar, because N is a non random constant***].  For simplicity, let's assume that the draws are independent.  After we see each sample bead, we return it to the urn.  It's a sample with replacement.  *In this case, what do we know about the distribution of the sum of draws*?  First, we know that the expected value of the sum of draws is N times the average of the values in the urn.  We know that the average of the 0s and 1s in the urn must be the proportion p, the value we want to estimate.  
![](C:/Users/qp/Pictures/we dont know what is in the urn.png)
Here, we encounter an important difference with what we did in the probability module.  We don't know what is in the urn.  We know there are blue and red beads, but we don't know how many of each.  This is what we're trying to find out.  We're trying to estimate p.  *Just like we use variables to define unknowns in systems of equations, in statistical inference, we define parameters to define unknown parts of our models*.  In the urn model we are using to mimic an opinion poll, we do not know the proportion of blue beads in the urn.  We define the parameter p to represent this quantity.  We are going to estimate this parameter.  
![](C:/Users/qp/Pictures/in statistical inference we define parameters to define the unknow part of our models.png)
Note that the ideas presented here, on how we estimate parameters and provide insights into how good these estimates are, extrapolate to many data science tasks.  For example, we may ask, [][what is the difference in health improvement between patients receiving treatment and a control group]?  We may ask, what are the health effects of smoking on a population?  What are the differences in racial groups of fatal shootings by police?  What is the rate of change in life expectancy in the US during the last 10 years?  All these questions can be framed as a task of estimating a parameter from a sample.  
![](C:/Users/qp/Pictures/we may ask this question with a statistical solution.png)
![](C:/Users/qp/Pictures/or this similar question with same goal which is to estimate the population.png)
![](C:/Users/qp/Pictures/or this question, think about them.png)



[][Textbook ]

This video matches the textbook section on the sample average and the textbook section on parameters.
https://rafalab.github.io/dsbook/inference.html#the-sample-average
https://rafalab.github.io/dsbook/inference.html#parameters


[][Key points]

[][    Many common data science tasks can be framed as estimating a parameter from a sample.  ]
    
    We illustrate statistical inference by walking through the process to estimate p.  From the estimate of p, we can easily calculate an estimate of the spread, 2p-1. 

[][*    Consider the random variable X that is 1 if a blue bead is chosen and 0 if a red bead is chosen.  The proportion of blue beads in draws is the average of the draws X_1,,,,X_n. *] 

    X_bar is the sample average. In statistics, a bar on top of a symbol denotes the average. X_bar is a random variable because it is the average of random draws - each time we take a sample, X_bar is different. (X_1, X_2,,, are individual 1 or 0)
        X_bar = (X_1 + X_2 + ... + X_n)/N
        
[][*    The number of blue beads drawn in N draws, NX_bar, is N times the proportion of values in the urn. However, we do not know the true proportion: we are trying to estimate this parameter p.]


>>> list_a = ["a", "a", "a", "b", "b", "c", "d", "d", "d", "c", "e", "f", "f"]
>>> set_a = set(list_a)
>>> list_a_dist = [i for i in set_a]
>>> list_a
['a', 'a', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'c', 'e', 'f', 'f']
>>> list_a_dist
['d', 'c', 'b', 'f', 'a', 'e']
>>> num = {i:j for i in list_a_dist for j in [len([p for p in list_a if p==i])]}
>>> num
{'d': 3, 'c': 2, 'b': 2, 'f': 2, 'a': 3, 'e': 1}
>>>











# Polling versus Forecasting


Before we continue, let's make an important clarification related to the practical problem of forecasting the election.  If a poll is conducted 4 months before the election, it is estimating the p for that moment, not for election day.  
![](C:/Users/qp/Pictures/if a poll is conducted 4 months ago before the election day, it is estimating p for that moment not for the election day outcome.png)
![](C:/Users/qp/Pictures/however forecasters try to build tools that model how opinion vary across time and try to predict the election day result, taking into consideration the opinion fluctute.png)
But, note that the p for election night might be different since people's opinions fluctuate through time.  The polls provided the night before the election tend to be the most accurate  since opinions don't change that much in a couple of days.  *However, forecasters try to build tools that model how opinions vary across time and try to predict the election day result, taking into consideration the fact that opinions fluctuate*.  We'll describe some approaches for doing this in a later section.  



[][Textbook link]

This video corresponds to the textbook section on polling versus forecasting.
https://rafalab.github.io/dsbook/inference.html#polling-versus-forecasting


[][Key points]

    A poll taken in advance of an election estimates p for that moment, not for election day.
    In order to predict election results, forecasters try to use early estimates of p to predict p on election day. We discuss some approaches in later sections.












# Properties of Our Estimate


*To understand how good our estimate is, we'll describe the statistical properties of the random variable we just defined, the sample proportion*.  
![](C:/Users/qp/Pictures/do you remember the statistical probabilities of random variable we defined earlier.png)
Note that if we multiply by N, [][**N times X bar is the sum of independent draws.  So the rules we cover in the probability module apply**].  Using what we have learned, the expected value of the sum N times X bar is N times the average of the urn, p.  So dividing by the non-random constant N gives us that the expected value of the average X bar is p.  {*thus p is now a normal distributed random variable*}
![](C:/Users/qp/Pictures/note that if we multiply by n, n times x_bar gives us the sun of independent draws.png)
![](C:/Users/qp/Pictures/N times the average of the urn.png)
![](C:/Users/qp/Pictures/so dividing by the non-random constant is p.png)
We can write it using our mathematical notation like this.  We also can use what we learned to figure out the standard error.  [][***We know that the standard error of the sum is square root of N times a standard deviation of the values in the urn***]???.  Can we compute the standard error of the urn?  We learn a formula that tells us that it's 1 minus 0 times the square root of p times 1 minus p, which is the square root of p times 1 minus p.  ??????
![the standard error of the urn](C:/Users/qp/Pictures/can we compute the standard error of the urn.png)
Because we are dividing by the sum N, we arrive at the following formula for the standard error of the average.  [][***The standard error of the average is square root of p times 1 minus p divided by the square root of N***].  {*because we are multiple by squart root of N and then divide by N, this will end up with divide by squart root N*}
![standard error of thr average](C:/Users/qp/Pictures/the standard error of the average is this one, squart root of p times 1 minus p devided by square root of N.png)
This result reveals the power of polls.  *The expected value of the sample proportion X bar is the parameter of interest p*, and we can make the standard error as small as we want by increasing the sample size N.  The law of large numbers tells us that with a large enough poll, our estimate converges to p.  If we take a large enough poll to make our standard error, say, about 0.01, we'll be quite certain about who will win.  But how large does the poll have to be for the standard error to be this small?  One problem is that we do not know p, so we can't actually compute the standard error.  
![](C:/Users/qp/Pictures/and here is our problem, although we can increase the sample size N, but p is unknow.png)
For illustrative purposes, let's assume that p is 0.51 and make a plot of the standard error versus a sample size N. Here it is.  We can see that, obviously, it's dropping.  From the plot we also see that we would need a poll of over 10,000 people to get the standard error as low as we want it to be.  
![](C:/Users/qp/Pictures/assuming the p is given, we can see the standard error decrease when sample size N increase.png)
We rarely see polls of this size, due in part to cost.  We'll give other reasons later.  From the RealClearPolitics table we saw earlier, we learn that the sample sizes in opinion polls range from 500 to 3,500.  For a sample size of 1,000, if we set p to be 0.51, the standard error is about 0.15, or 1.5 percentage points.  
![](C:/Users/qp/Pictures/for a sample size of 1000, if we set p to 0.51, then the standard error is about 0.015.png)
So even with large polls for close elections, X bar can lead us astray if we don't realize it's a random variable.  But we can actually say more about how close we can get to the parameter p.  We'll do that in the next video.  



[][Textbook link]

This video corresponds to the textbook section on properties of our estimate.
https://rafalab.github.io/dsbook/inference.html#properties-of-our-estimate-expected-value-and-standard-error


[][Key points]

[][*    When interpreting values of X_bar, it is important to remember that X_bar is a random variable with an expected value and standard error that represents the sample proportion of positive events. *]
    
    The expected value of X_bar is the parameter of interest p. This follows from the fact that X_bar is the sum of independent draws of a random variable times a constant 1/N.
        E(X_bar) = p
        
    As the number of draws N increases, the standard error of our estimate X_bar decreases. The standard error of the average of X_bar over N draws is: 
        SE(X_bar) = sqrt(p(1-p)/N)

    In theory, we can get more accurate estimates of p by increasing N. In practice, there are limits on the size of N due to costs, as well as other factors we discuss later.
        
[][*    We can also use other random variable equations to determine the expected value of the sum of draws E(S) and standard error of the sum of draws SE(S). *]
        E(X) = Np
        SE(S) = sqrt(Np(1-p)) 











# Assessment 1.1: Parameters and Estimates


DataCamp due Jul 8, 2022 02:35 AWST

In this assessment, you will learn about parameters and estimates using the example of election polling.

By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy. Note that you might need to disable your pop-up blocker, or allow "www.datacamp.com" in your pop-up blocker allowed list. When you have completed the exercises, return to edX to continue your learning.

Assessment 1.1: Parameters and Estimates (External resource) (7.0 points possible)
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy.

Ask your questions about parameters and estimates or the related DataCamp assessment here. Remember to search the discussion board before posting to see if someone else has asked the same thing before asking a new question! You're also encouraged to answer each other's questions to help further your own learning.

Some reminders:

    Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
    Posting snippets of code is okay, but posting full code solutions is not.
    If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.

Discussion: Assessment 1.1
Topic: Section 1 / Assessment 1.1: Parameters and Estimates
Filter:
Sort:

    unanswered question
    Troubleshooting, already finish the course exercises but, cant´ see the grades!
    DataCamp linkage with Edx is working? Troubleshooting, already finish the course exercises but, cant´ see the grades It happens over additional sections too, 1, 2, 4 and 6. Help please!
    2 comments (2 unread comments)
    unanswered question
    Why is Xbar is the sum of independent draws
    The textbook in 16.2.4 says that "X bar is the sum of independent draws". Shouldn't it by definition be the sum of independent draws divided by N (or multiplied by 1/N)?
    3 comments (3 unread comments)
    unanswered question
    Could someone explain the question in which we had to calculate the SE of the spread?
    This was the only question that I had a doubt in and any help would be appreciated. Thank you so much!
    2 comments (2 unread comments)
    unanswered question
    Which lesson covered the expectation algebra for random variables?
    A couple of these questions went through derivations based on E[X(1-X)]. While it was fairly intuitive, I don't recall a lesson covering this at the same level of detail. Specifically, when taking the SE on expected values, there's a key step where a constant drops out of the SE. Makes sense, but I don't recall a discussion behind that. Anyone else remember where that was?
    2 comments (2 unread comments)
    discussion
    Issue with exercise 6
    The posted solution involves dividing the upper limit of the y-axis by sqrt(25) which is not indicated or requested anywhere in the question.
    3 comments (3 unread comments)
    answered question
    Why Subtracting 1 does not affect the standard error?
    In Exercise 8 (Standard error of d), my derivation included the constant -1, but it wasn't correct.
    4 comments (4 unread comments)
    discussion
    Very nice introduction section!
    It is an important introductory section that defines the difference actual probability and sampling estimates (e.g. in case of polls) Really cool how the definitions of the estimates of X-bar and SE of X-bar can give good view of how sample size can sufficient or not to derive a good estimate!
    1 comments



## Exercise 1. Polling - expected value of S

# ========================================================================================================================
Suppose you poll a population in which a proportion p of voters are Democrats and 1-p are Republicans. Your sample size is N=25. [][Consider the random variable S, which is the total number of Democrats in your sample.]

What is the expected value of this random variable S?
Instructions
50 XP
Possible Answers

E(S) = 25(1-p)
E(S) = 25p
E(S) = sqrt(25p(1-p))
E(S) = p



## Exercise 2. Polling - standard error of S

# ======================================================================================================================
Again, consider the random variable S, which is the total number of Democrats in your sample of 25 voters. [][The variable p describes the proportion of Democrats in the sample,] whereas 1-p describes the proportion of Republicans.

What is the standard error of S?
Instructions
50 XP
Possible Answers

SE(S) = 25p(1-p)
SE(S) = sqrt(25p)
SE(S) = 25(1-p)
SE(S) = sqrt(25*p(1-p))



## Exercise 3. Polling - expected value of X-bar

# =====================================================================================================================
Consider the random variable S/N, which is equivalent to the sample average that we have been denoting as X_bar. The variable N represents the sample size and p is the proportion of Democrats in the population.

What is the expected value of X_bar?
Instructions
50 XP
Possible Answers

E(X_bar) = p
E(X_bar) = Np
E(X_bar) = N(1-p)
E(X_bar) = 1-p



## Exercise 4. Polling - standard error of X-bar

# ========================================================================================================================
What is the standard error of the sample average, X_bar?

The variable N represents the sample size and p is the proportion of Democrats in the population.
Instructions
50 XP
Possible Answers

SE(X_bar) = sqrt(Np(1-p))
SE(X_bar) = sqrt(p(1-p)/N)
SE(X_bar) = sqrt(p(1-p))
SE(X_bar) = sqrt(N)



## Exercise 5. se versus p

Write a line of code that calculates the standard error se of a sample average when you poll 25 people in the population. Generate a sequence of 100 proportions of Democrats p that vary from 0 (no Democrats) to 1 (all Democrats).

Plot se versus p for the 100 different proportions.
Instructions
100 XP

    Use the seq function to generate a vector of 100 values of p that range from 0 to 1.
    Use the sqrt function to generate a vector of standard errors for all values of p.
    Use the plot function to generate a plot with p on the x-axis and se on the y-axis.


```{r}
# `N` represents the number of people polled
N <- 25

# Create a variable `p` that contains 100 proportions ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, 1/100)
p

# ==============================================================================================================
# Create a variable `se` that contains the standard error of each sample average
se <- sqrt(N*p*(1-p))

# Plot `p` on the x-axis and `se` on the y-axis
plot(p, se)
```
Incorrect submission
Check your call of seq(). Did you specify the argument length.out? 
Incorrect submission
Use sqrt to calculate the standard error and save it as se. Make sure to specify the correct formula for standard error. 

```{r}
# `N` represents the number of people polled
N <- 25

# Create a variable `p` that contains 100 proportions ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length.out=100)

# Create a variable `se` that contains the standard error of each sample average
se <- sqrt(p*(1-p)/N)

# Plot `p` on the x-axis and `se` on the y-axis
plot(p, se)
```



## Exercise 6. Multiple plots of se versus p

Using the same code as in the previous exercise, create a for-loop that generates three plots of p versus se when the sample sizes equal N = 25, N = 100, N = 1000.
Instructions
100 XP

    Your for-loop should contain two lines of code to be repeated for three different values of N.
    The first line within the for-loop should use the sqrt function to generate a vector of standard errors se for all values of p.
    The second line within the for-loop should use the plot function to generate a plot with p on the x-axis and se on the y-axis.
    Use the ylim argument to keep the y-axis limits constant across all three plots. The lower limit should be equal to 0 and the upper limit should equal 0.1 (it can be shown that this value is the highest calculated standard error across all values of p and N).

```{r}
# The vector `p` contains 100 proportions of Democrats ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length = 100)

# The vector `sample_sizes` contains the three sample sizes
sample_sizes <- c(25, 100, 1000)

# =======================================================================================================
# Write a for-loop that calculates the standard error `se` for every value of `p` for each of the three samples sizes `N` in the vector `sample_sizes`. Plot the three graphs, using the `ylim` argument to standardize the y-axis across all three plots.
se <- sqrt(p*(1-p))
plot(p, se, ylim=c(0, 0.1))

```

```{r}
# The vector `p` contains 100 proportions of Democrats ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length = 100)

# The vector `sample_sizes` contains the three sample sizes
sample_sizes <- c(25, 100, 1000)

# Write a for-loop that calculates the standard error `se` for every value of `p` for each of the three samples sizes `N` in the vector `sample_sizes`. Plot the three graphs, using the `ylim` argument to standardize the y-axis across all three plots.
se <- sqrt(p*(1-p)/sample_sizes)
plot(p, se, ylim=c(0, 0.1))
```
Incorrect submission
Make sure to write a for-loop using for. 

```{r}
# The vector `p` contains 100 proportions of Democrats ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length = 100)

# The vector `sample_sizes` contains the three sample sizes
sample_sizes <- c(25, 100, 1000)

# Write a for-loop that calculates the standard error `se` for every value of `p` for each of the three samples sizes `N` in the vector `sample_sizes`. Plot the three graphs, using the `ylim` argument to standardize the y-axis across all three plots.
# =========================================================================================================================
for (N in sample_sizes) {
  se <- sqrt(p*(1-p)/N)
  plot(p, se, ylim=c(0, 0.1))
}
# for loop in R script ========================================================================
```



## Exercise 7. Expected value of d

# =======================================================================================================================
Our estimate for the difference in proportions of Democrats and Republicans is d = X_bar - (1-X_bar).

Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the expected value of d?
Instructions
50 XP
Possible Answers

E[X_bar-(1-X_bar)] = E[2X_bar-1] = 2E[X_bar]-1 = N(2p-1) = Np-N(1-p)
E[X_bar-(1-X_bar)] = E[X_bar-1] = E[X_bar]-1 = p-1
E[X_bar-(1-X_bar)] = E[2X_bar-1] = 2sqrt(p(1-p))-1 = p-(1-p)
E[X_bar-(1-X_bar)] = E[2X_bar-1] = 2p-1 = p-(1-p)       O



## Exercise 8. Standard error of d

# =======================================================================================================================
Our estimate for the difference in proportions of Democrats and Republicans is d = X_bar - (1-X_bar).

Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the standard error of d?
Instructions
50 XP
Possible Answers

SE[X_bar-(1-X_bar)] = SE[2X_bar-1] = 2SE[X_bar] = 2sqrt(p/N)
SE[X_bar-(1-X_bar)] = SE[2X_bar-1] = 2SE[X_bar-1] = 2sqrt(p(1-p)/N)-1
SE[X_bar-(1-X_bar)] = SE[2X_bar-1] = 2SE[X_bar] = 2sqrt(p(1-p)/N)         O
SE[X_bar-(1-X_bar)] = SE[X_bar-1] = SE[X_bar] = sqrt(p(1-p)/N)

Incorrect submission
[][Try again. Subtracting 1 does not affect the standard error. ]



## Exercise 9. Standard error of the spread

# =======================================================================================================================
Say the actual proportion of Democratic voters is p=0.45. In this case, the Republican party is winning by a relatively large margin of d=-0.1, or a 10% margin of victory. What is the standard error of the spread 2X_bar-1 in this case?
Instructions
100 XP

    Use the sqrt function to calculate the standard error of the spread 2X_bar-1.

```{r}
# `N` represents the number of people polled
N <- 25

# `p` represents the proportion of Democratic voters
p <- 0.45

# =========================================================================================================================
# Calculate the standard error of the spread. Print this value to the console.
2*sqrt(p*(1-p)/N)

```


## Exercise 10. Sample size

# =====================================================================================================================
[][So far we have said that the difference between the proportion of Democratic voters and Republican voters is about 10% and that the standard error of this spread is about 0.2 when N=25. Select the statement that explains why this sample size is sufficient or not.]
Instructions
50 XP
Possible Answers

    This sample size is sufficient because the expected value of our estimate 2X_bar-1 is d so our prediction will be right on.
[][*    This sample size is too small because the standard error is larger than the spread.   O*]
    This sample size is sufficient because the standard error of about 0.2 is much smaller than the spread of 10%.
    Without knowing p, we have no way of knowing that increasing our sample size would actually improve our standard error.
# =======================================================================================================
# ========================================================= What is the spread the instructor mentioned here and above


## End of Assessment

This is the end of the programming assignment for this section. Please DO NOT click through to additional assessments from this page. If you do click through, your scores may NOT be recorded.

Click "Got it!" and submit to get the "points" for this question.

You can close this window and return to Data Science: Inference.
Answer the question
50XP
Possible Answers

    Got it!
    press
    1












## Course  /  Section 2: The Central Limit Theorem in Practice  /  Section 2 Overview


# Section 2 Overview


In Section 2, you will look at the Central Limit Theorem in practice.

After completing Section 2, you will be able to:

        Use the Central Limit Theorem to calculate the probability that a sample estimate X_bar is close to the population proportion p.
        Run a Monte Carlo simulation to corroborate theoretical results built using probability theory.
        Estimate the spread based on estimates of X_bar and SE_hat(X_bar).
        Understand why bias can mean that larger sample sizes aren't necessarily better.

There is 1 assignment that uses the DataCamp platform for you to practice your coding skills.

We encourage you to use R to interactively test out your answers and further your learning.













## Course  /  Section 2: The Central Limit Theorem in Practice  /  Central Limit Theorem in Practice


# The Central Limit Theorem in Practice

![](C:/Users/qp/Pictures/the clt tells us that the distribution function of sum of draws is approximately normal.png)
![](C:/Users/qp/Pictures/normal distributed random variable all we need is its mean and standard deviation.png)
![](C:/Users/qp/Pictures/and we also knows when dividing a normal distributed random variable by a non-random constant the result is also normal distributed.png)
![](C:/Users/qp/Pictures/this implies that the distribution of X-bar is approximately normal.png)
**The central limit theorem tells us that the distribution function for a sum of draws is approximately normal.  We also learned that when dividing a normally distributed random variable by a nonrandom constant, the resulting random variable is also normally distributed.  This implies that the distribution of X-bar is approximately normal**.  
![](C:/Users/qp/Pictures/in a previous video, we determined that the expected value is p.png)
![](C:/Users/qp/Pictures/and the standard error is the square root of p times 1 minus p.png)
So in summary, we have that X-bar has an approximately normal distribution.  And in a previous video, we determined that the expected value is p, and the standard error is the square root of p times 1 minus p divided by the sample size N.  Now, how does this help us?  Let's ask an example question.  
![](C:/Users/qp/Pictures/suppose we are going to answering this question, whats the probability that we ar within 1 percentage point of p.png)
![you have to do the trick now, as using X_bar to plug in p will ruin the equation](C:/Users/qp/Pictures/with above question, we are really trying to get the distance between x_bar and p is less than 0.01.png)
![](C:/Users/qp/Pictures/and this is the same asking what is the probability of x_bar being less than or equal to p pluss 0.01.png)
![](C:/Users/qp/Pictures/and minus the probability of x_bar being less than or equal to p minus 0.01.png)
*Suppose we want to know what is the probability that we are within one percentage point from p--that we made a very, very good estimate*?  So we're basically asking, what's the probability that the distance between X-bar and p, the absolute value of X-bar minus p, is less than 0.01, 1 percentage point.  We can use what we've learned to see that this is the same as asking, what is the probability of X-bar being less than or equal to p plus 0.01 minus the probability of X-bar being less than or equal to p minus 0.01.  Now, can we answer the question now?  Can we compute that probability?  
![](C:/Users/qp/Pictures/we subtract the expected value and divide by the standard error.png)
![](C:/Users/qp/Pictures/the equalation can be transfered to this one, replace with z.png)
Note that we can use the mathematical trick that we learned in the previous module.  What was that trick?  [][*We subtract the expected value and divide by the standard error on both sides of the equation*].  What this does is it gives us a standard normal variable, which we have been calling capital Z, on the left side.  And we know how to make calculations for that.  
![](C:/Users/qp/Pictures/and the standard error of x_bar equal to square root of p times 1 minus p divide by n.png)
![important knowledge, think about it](C:/Users/qp/Pictures/so the probability we just calculated is equivalent to this equation.png)
![](C:/Users/qp/Pictures/20220815_234216[1].jpg)
Since p is the expected value, and the standard error of X-bar is the square root of p times 1 minus p divided by N, we get that the probability that we were just calculating is equivalent to probability of Z, our standard normal variable, being less than 0.01 divided by the standard error of X-bar minus the probability of Z being less than negative 0.01 divided by that standard error of X-bar.  *OK, now can we compute this probability?*  Not yet.  Our problem is that we don't know p.  So we can't actually compute the standard error of X-bar using just the data.  **But it turns out--and this is something new we're showing you--that the CLT still works if we use an estimate of the standard error that, instead of p, uses X-bar in its place**.  We say this is a plug-in estimate.  We call this a [][*plug-in estimate*].  Our estimate of the standard error is therefore the square root of X-bar times 1 minus X-bar divided by N. Notice, we changed the p for the X-bar.  In the mathematical formula we're showing you, you can see a hat on top of the SE.  
![](C:/Users/qp/Pictures/since we dno't know p, but clt still work if we use x_bar in its place.png)
![Notice that this estimate can actually be constructured using the observed data](C:/Users/qp/Pictures/now its our estimate of the standard error is square root of the x_bar times 1 minus x_bar divide by n.png)
[][**In statistics textbooks, we use a little hat like this to denote estimates**].  This is an estimate of the standard error, not the actual standard error.  But like we said, the central limit theorem still works.  Note that, importantly, that this estimate can actually  be constructed using the observed data.  Now, let's continue our calculations.  But now *instead of dividing by the standard error, we're going to divide by this estimate of the standard error*.  Let's compute this estimate of the standard error for the first sample that we took, in which we had 12 blue beads and 13 red beads.  In that case, X-bar was 0.48.  
![](C:/Users/qp/Pictures/in that case the x_bar was 0.48.png)
![](C:/Users/qp/Pictures/so to compute the first standard error, we simple write this code.png)
So to compute the standard error, we simply write this code.  And we get that it's about 0.1.  So now, we can answer the question.  Now, we can compute the probability of being as close to p as we wanted.  We wanted to be 1 percentage point away.  The answer is simply pnorm of 0.01--that's 1 percentage point--divided by this estimated se minus pnorm of negative 0.01 divided by the estimated se.  We plug that into R, and we get the answer.  The answer is that the probability of this happening is about 8%.  So there is a very small chance that we'll be as close as this to the actual proportion.  
![so where does that Z goes??? thaink about it](C:/Users/qp/Pictures/now, we can compute the probability of being as close to p as we wanted.png)
Now, that wasn't very useful, but what it's going to do, what we're going to be able to do with the central limit theorem is determine what sample sizes are better.  And once we have those larger sample sizes, we'll be able to provide a very good estimate and some very informative probabilities.  



[][Textbook link]

This video corresponds to the textbook section on the Central Limit Theorem in practice.
https://rafalab.github.io/dsbook/inference.html#clt


[][Key points]

[][*    Because X_bar is the sum of random draws divided by a constant, the distribution of X_bar is approximately normal. *] 
    
    We can convert X_bar to a standard normal random variable Z:     ???Why doing this???
        Z = (X_bar - E(X_bar))/SE(X_bar)

[][*    The probability that X_bar is within .01 of the actual value of p is: *]
        Pr(Z<= 0.01/sqrt(p(1-p)/N)) - Pr(Z<= -0.01/sqrt(p*(1-p)/N))

    The Central Limit Theorem (CLT) still works if X_bar is used in place of p. This is called a plug-in estimate. Hats over values denote estimates. Therefore: 
        SE_hat(X_bar) = sqrt(X_bar(1-X_bar)/N)

    Using the CLT, the probability that X_bar is within .01 of the actual value of p is:
[][        Pr(Z<= 0.01/sqrt(X_bar(1-X_bar)/N)) - Pr(X<= -o.o1/sqrt(x_bar(1-X_bar)/N)) ]
    

Code: Computing the probability of X_bar being within .01 of

X_hat <- 0.48
se <- sqrt(X_hat*(1-X_hat)/25)
pnorm(0.01/se) - pnorm(-0.01/se)














# Margin of Error

![](C:/Users/qp/Pictures/early we mentioned the margin of error.png)
So a poll of only 25 people is not really very useful, at least for a close election.  *Earlier we mentioned the margin of error.  Now we can define it because it is simply 2 times the standard error, which we can now estimate*.  In our case it was 2 times se, which is about 0.2.  [][Why do we multiply by 2]?  
![](C:/Users/qp/Pictures/why 2 se, this is because when asking whats the probability we are in 2 standard errors from p, we end up with this equation.png)
This is because if you ask what is the probability that we're within 2 standard errors from p, using the same previous equations, we end up with an equation like this one.  This one simplifies out, and [][**we're simply asking what is the probability of the standard normal distribution that has the expected value 0 and standard error one is within two values from 0, and we know that this is about 95%**].  So there's a 95% chance that X-bar will be within 2 standard errors.  That's the margin of error, in our case, to p.  
![](C:/Users/qp/Pictures/simplifies above equaltion, we are asking whats the probability of standard normal distribution tha has expected value 0 and standard error 1 is within 2 values from 0.png)
Now why do we use 95%?  This is somewhat arbitrary.  But traditionally, that's what's been used.  It's the most common value that's used to define margins of errors.  In summary, the central limit theorem tells us that our poll based on a sample of just 25 is not very useful.  *We don't really learn much when the margin of error is this large* {Think, we have to [][compare the margin of errors with the expected value], and recall the lm() model output, do you remember something???}.  All we can really say is that the popular vote will not be won by a large margin.  
![keep interest yourself](C:/Users/qp/Pictures/From the table that we showed earlier from RealClearPolitics, we saw.png)
This is why pollsters tend to use larger sample sizes.  From the table that we showed earlier from RealClearPolitics, we saw that a typical sample size was between 700 and 3,500.  
![](C:/Users/qp/Pictures/note that if we had obtained an X-bar of 0.48, but with a sample size of 2,000, the estimated of standard error woule be this value.png)
To see how this gives us a much more practical result, note that *if we had obtained an X-bar of 0.48, but with a sample size of 2,000, the estimated standard error would have been about 0.01* (sqrt(0.48*(1-0.48)/2000)).  So our result is an estimate of 48% blue beads with a margin of error of 2%.  In this case, the result is much more informative and would make us think that there are more red beads than blue beads.  But keep in mind, this is just hypothetical.  We did not take a poll of 2,000 beads since we don't want to ruin the competition.  
![](C:/Users/qp/Pictures/So our result is an estimate of 48% blue beads with a margin of error of 2% 1.png)
![in this case, the result is much more informative, really???](C:/Users/qp/Pictures/So our result is an estimate of 48% blue beads with a margin of error of 2% 2.png)

>>> p = 0.48
>>> N = 2000
>>> import math
>>> se = math.sqrt(p*(1-p)/N)
>>> se
0.011171392035015153
>>>



[][Textbook link]

The margin of error is discussed within the textbook section on the Central Limit Theorem in practice.
https://rafalab.github.io/dsbook/inference.html#clt


[][Key points]

    The margin of error is defined as 2 times the standard error of the estimate X_bar.

    There is about a 95% chance that X_bar will be within two standard errors of the actual parameter p. 













# A Monte Carlo Simulation for the CLT


Suppose we want to use a Monte Carlo simulation to corroborate that the tools that we've been using to build estimates and margins of errors using probability theory actually work.  To create the simulation, we would need to write code like this.  We would simply [][*write the urn model, use replicate to construct a Monte Carlo simulation*].  The problem is, of course, that we don't know p.  We can't run the code we just showed you because we don't know what p is.  
![](C:/Users/qp/Pictures/do you remember how we did a monte carlo simulation, we need to write code like this.png)
However, we could construct an urn like the one we showed in a previous video and actually run an analog simulation.  It would take a long time because you would be picking beads and counting them, but you could take 10,000 samples, count the beads each time, and keep track of the proportions that you see.  We can use the function take poll with n of 1,000 instead of actually drawing from an urn, but it would still take time because you would have to count the beads and enter the results into R.  So one thing we can do to corroborate theoretical results is to pick a value of p or several values of p and then run simulations using those.   
![](C:/Users/qp/Pictures/lets set p equal to 0.45, we can simulate one poll of 1,000 beads or people using this simple code.png)
As an example, let's set p to 0.45.  *We can simulate one poll of 1,000 beads or people using this simple code*.  Now we can take that into a Monte Carlo simulation.  Do it 10,000 times, each time returning the proportion of blue beads that we get in our sample.  
![I assume the sd func is used for population?](C:/Users/qp/Pictures/then we found out that the simulation comfirms this.png)
>>> math.sqrt(0.45*(1-0.45)/1000)
0.015732132722552274

To review, the theory tells us that X-bar has an approximately normal distribution with expected value 0.45  and a standard error of about 1.5%.  The simulation confirms this.  If we take the mean of the X-hats that we created, we indeed get a value of about 0.45.  And if we compute the sd of the values that we just created, we get a value of about 1.5%.  
![](C:/Users/qp/Pictures/we can visualize it with a histogram and a qq-plot of this X_hat data.png)
![](C:/Users/qp/Pictures/A histogram and a qq plot of this X-hat data confirms that the normal approximation is accurate.png)
A histogram and a qq plot of this X-hat data confirms that the normal approximation is accurate as well.  Again, note that in real life, we would never be able to run such an experiment because we don't know p.  But we could run it for various values of p and sample sizes N and see that the theory does indeed work well for most values.  You can easily do this yourself by rerunning the code we showed you after changing p and N.  



[][Textbook link]

This video corresponds to the textbook section on a Monte Carlo simulation for the CLT.
https://rafalab.github.io/dsbook/inference.html#a-monte-carlo-simulation


[][Key points]

     We can run Monte Carlo simulations to compare with theoretical results assuming a value of p.
     
     In practice, p is unknown. We can corroborate theoretical results by running Monte Carlo simulations with one or several values of p.
     
[][*     One practical choice for p when modeling is X_bar, the observed value of X_hat in a sample. *]


Code: Monte Carlo simulation using a set value of p

p <- 0.45    # unknown p to estimate
N <- 1000

# simulate one poll of size N and determine x_hat
x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
x_hat <- mean(x)

# simulate B polls of size N and determine average x_hat
B <- 10000    # number of replicates
N <- 1000    # sample size per replicate
x_hat <- replicate(B, {
    x <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
    mean(x)
})

Code: Histogram and QQ-plot of Monte Carlo results

library(tidyverse)
library(gridExtra)
p1 <- data.frame(x_hat = x_hat) %>%
    ggplot(aes(x_hat)) +
    geom_histogram(binwidth = 0.005, color = "black")
p2 <- data.frame(x_hat = x_hat) %>%
    ggplot(aes(sample = x_hat)) +
    stat_qq(dparams = list(mean = mean(x_hat), sd = sd(x_hat))) +
    geom_abline() +
    ylab("X_hat") +
    xlab("Theoretical normal")
grid.arrange(p1, p2, nrow=1)



```{r}
B <- 10000
N <- 1000
p <- 0.48   # But since we do not know the p, the population parameter


X_hat <- replicate(B, {
  X <- sample(c(0, 1), size=N, replace=T, prob=c(1-p, p))
  mean(X)
})


mean(X_hat)
sd(X_hat)     # ?????????????????????????????????????
```

```{r}
library(gridExtra)
library(tidyverse)


p1 <- data.frame(X_hat=X_hat) %>%
  ggplot(aes(X_hat)) +
  geom_histogram(bins=30, color="black")

# ======================================================================================================================
# ======================================================================================================================
p2 <- data.frame(X_hat=X_hat) %>%
  ggplot(aes(sample=X_hat)) +
  stat_qq(dparams=list(mean=mean(X_hat), sd=sd(X_hat))) +
  geom_abline() +
  ylab("X_hat") +
  xlab("Theoretical normal")


grid.arrange(p1, p2, nrow=1)  
```














# The Spread


[][*The competition is to predict the spread, not the proportion p*].  However, because we are assuming there are only two parties, we know that the spread is just p minus (1 minus p), which is equal to 2p minus 1.  
![the spread](C:/Users/qp/Pictures/because we are assuming there are just 2 parties, thus the spread is just this.png)
So everything we have done can easily be adapted to estimate to 2p minus 1.  Once we have our estimate, X-bar, and our estimate of our standard error of X-bar, we estimate the spread by 2 times X-bar minus 1, *just plugging in the X-bar where you should have a p*.  
![](C:/Users/qp/Pictures/once we have our estimate x_bar and estimate of standard error, we just plug in x_bar where we should have p.png)
![](C:/Users/qp/Pictures/sincw we are multiple a random variable by 2, we know the standard error goes up by 2.png)
***And, since we're multiplying a random variable by 2, we know that the standard error goes up by 2***.  So the standard error of this new random variable is 2 times the standard error of X-bar [][sqrt(2p*(1-2p)/N)].  Note that subtracting the 1 does not add any variability, so it does not affect the standard error.  
![12/25](C:/Users/qp/Pictures/so for our first example with just 25 beads, our estimate of p was 0.48.png)
 [][*Earlier we mentioned the margin of error.  Now we can define it because it is simply 2 times the standard error*]
![2*sqrt(p*(1-p)/N) or 2*sqrt(0.48*(1-0.48)/25)](C:/Users/qp/Pictures/and the margin of error for our first example is 0.2.png)
![2*p-1 or 2*0.48-1](C:/Users/qp/Pictures/thus our estimate of the spread is 0.04.png)
![margin of the error of the spread](C:/Users/qp/Pictures/and our margin of error is 40% in this cases after we have the estimate and standard error of p.png)
So, for our first example, with just the 25 beads, our estimate of p was 0.48 with a margin of error of 0.2.  *This means that our estimate of the spread is 4 percentage points, 0.04, with a margin of error of 40%, 0.4*.  Again, not a very useful sample size.  But the point is that once we have an estimate and standard error for p, we have it for the spread 2p minus 1.  



[][Textbook link]

This video corresponds to the textbook section on the spread.
https://rafalab.github.io/dsbook/inference.html#the-spread


[][Key points]

    The spread between two outcomes with probabilities  p and 1-p is 2p-1. 
    
    The expected value of the spread is 2X_bar-1. 
    
[][*    The standard error of the spread is 2SE_hat(X_bar). *]
    
[][*    The margin of error of the spread is 2 times the margin of error of X_bar. *]













# Bias: Why Not Run a Very Large Poll?

Note that for realistic values of p, say between 0.35 and 0.65 for the popular vote, if we run a very large poll with say 100,000 people, theory would tell us that we would predict the election almost perfectly, since *the largest possible margin of error is about 0.3%*.  Here are the calculations that were used to determine that.  We can see a graph showing us the standard error for several values of p if we fix N to be 100,000.  
![](C:/Users/qp/Pictures/the throry tells us the largest possible margin of error is just 0.3% with this big sample size.png)
![](C:/Users/qp/Pictures/here are the calculation we used to determining that with visualization.png)
So why are there no pollsters that are conducting polls this large?  One reason is that running polls with a sample size of 100,000 is very expensive.  But [][***perhaps a more important reason is that theory has its limitations***].  Polling is much more complicated than picking beads from an urn.  For example, while the beads are either red or blue, and you can see it with your eyes, people, when you ask them, might lie to you.  Also, because you're conducting these polls usually by phone, you might miss people that don't have phones.  And they might vote differently than those that do.  But *perhaps the most different way an actual poll is from our urn model is that we actually don't know for sure who is in our population and who is not*.  

How do we know who is going to vote?  Are we reaching all possible voters?  So, even if our margin of error is very small, it may not be exactly right that our expected value is p.  [][*We call this bias*].  Historically, we observe that polls are, indeed, biased, although not by that much.  The typical bias appears to be between 1% and 2%.  This makes election forecasting a bit more interesting.  And we'll talk about that in a later video.  



[][Textbook link]

This video corresponds to the textbook section on bias.
https://rafalab.github.io/dsbook/inference.html#bias-why-not-run-a-very-large-poll


[][Key points]

    An extremely large poll would theoretically be able to predict election results almost perfectly.
    These sample sizes are not practical. In addition to cost concerns, polling doesn't reach everyone in the population (eventual voters) with equal probability, and it also may include data from outside our population (people who will not end up voting).
    These systematic errors in polling are called bias. We will learn more about bias in the future.

Code: Plotting margin of error in an extremely large poll over a range of values of p

library(tidyverse)
N <- 100000
p <- seq(0.35, 0.65, length = 100)
SE <- sapply(p, function(x) 2*sqrt(x*(1-x)/N))
data.frame(p = p, SE = SE) %>%
    ggplot(aes(p, SE)) +
    geom_line()



```{r}
library(tidyverse)

N <- 100000
p <- seq(0.35, 0.65, length=100)


SE <- sapply(p, function(x) 2*sqrt(x*(1-x)/N))

data.frame(SE=SE) %>%
  ggplot(aes(p, SE)) +
  geom_line()
```















# Assessment 2.1: Introduction to Inference


DataCamp due Jul 14, 2022 07:55 AWST

In this assessment, you will learn about the central limit theorem in practice.

By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy. Note that you might need to disable your pop-up blocker, or allow "www.datacamp.com" in your pop-up blocker allowed list. When you have completed the exercises, return to edX to continue your learning.

Assessment 2.1: Introduction to Inference (External resource) (12.5 points possible)
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy.

Ask your questions about the central limit theorem for inference or the related DataCamp assessment here. Remember to search the discussion board before posting to see if someone else has asked the same thing before asking a new question! You're also encouraged to answer each other's questions to help further your own learning.

Some reminders:

    Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
    Posting snippets of code is okay, but posting full code solutions is not.
    If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.



## Exercise 1. Sample average

Write function called take_sample that takes the proportion of Democrats p and the sample size N as arguments and returns the sample average of Democrats (1) and Republicans (0).

Calculate the sample average if the proportion of Democrats equals 0.45 and the sample size is 100.
Instructions
100 XP

    Define a function called take_sample that takes p and N as arguments.
    Use the sample function as the first statement in your function to sample N elements from a vector of options where Democrats are assigned the value '1' and Republicans are assigned the value '0' in that order.
    Use the mean function as the second statement in your function to find the average value of the random sample.


```{r}
# Write a function called `take_sample` that takes `p` and `N` as arguements and returns the average value of a randomly sampled population.
take_sample <- function(p, N) mean(sample(c(1, 0), size=N, prob=c(p, 1-p), replace=T))
# ===========================================================================================================================
# Does this means I can't handle limited instruction tasks? How can I overcome it??? ++++++++++++++++++++++++++++++++++++++++



# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind="Rounding")

# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45

# Define `N` as the number of people polled
N <- 100

# Call the `take_sample` function to determine the sample average of `N` randomly selected people from a population containing a proportion of Democrats equal to `p`. Print this value to the console.
take_sample(p, N)
```
```{r}
# Write a function called `take_sample` that takes `p` and `N` as arguements and returns the average value of a randomly sampled population.
take_sample <- function(p, N) {
    sample <- sample(c(1, 0), size=N, prob=c(p, 1-p), replace=T)
    mean(sample)
}


# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind="Rounding")

# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45

# Define `N` as the number of people polled
N <- 100

# Call the `take_sample` function to determine the sample average of `N` randomly selected people from a population containing a proportion of Democrats equal to `p`. Print this value to the console.
take_sample(p, N)
```



## Exercise 2. Distribution of errors - 1

# =====================================================================================================================
Assume the proportion of Democrats in the population p equals 0.45 and that your sample size N is 100 polled voters. The take_sample function you defined previously generates our estimate, X_bar.

Replicate the random sampling 10,000 times and calculate p-X_bar for each random sample. Save these differences as a vector called errors. Find the average of errors and plot a histogram of the distribution.
Instructions
100 XP

    The function take_sample that you defined in the previous exercise has already been run for you.
    Use the replicate function to replicate subtracting the result of take_sample from the value of p 10,000 times.
    Use the mean function to calculate the average of the differences between the sample average and actual value of p.

```{r}
# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45

# Define `N` as the number of people polled
N <- 100

# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind="Rounding")

# Create an objected called `errors` that replicates subtracting the result of the `take_sample` function from `p` for `B` replications
errors <- p - replicate(B, take_sample(p, N))


# Calculate the mean of the errors. Print this value to the console.
mean(errors)
```



## Exercise 3. Distribution of errors - 2

In the last exercise, you made a vector of differences between the actual value for
and an estimate,

. We called these differences between the actual and estimated values errors.

The errors object has already been loaded for you. Use the hist function to plot a histogram of the values contained in the vector errors. Which statement best describes the distribution of the errors?
Instructions
50 XP
Possible Answers

    The errors are all about 0.05.
    The error are all about -0.05.
    The errors are symmetrically distributed around 0.
    The errors range from -1 to 1.



## Exercise 4. Average size of error

The error p-X_bar is a random variable. In practice, the error is not observed because we do not know the actual proportion of Democratic voters, p. However, we can describe the size of the error by constructing a simulation.

[][What is the average size of the error if we define the size by taking the absolute value |p-X_bar|?]
Instructions
100 XP

    Use the sample code to generate errors, a vector of |p-X_bar|.
    Calculate the absolute value of errors using the abs function.
    Calculate the average of these values using the mean function.

```{r}
# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45

# Define `N` as the number of people polled
N <- 100

# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind="Rounding")

# We generated `errors` by subtracting the estimate from the actual proportion of Democratic voters
errors <- replicate(B, p - take_sample(p, N))

# Calculate the mean of the absolute value of each simulated error. Print this value to the console.
mean(abs(errors))
```



## Exercise 5. Standard deviation of the spread

The standard error is related to the typical size of the error we make when predicting. We say size because, as we just saw, the errors are centered around 0. In that sense, the typical error is 0. For mathematical reasons related to the central limit theorem, we actually use the [][standard deviation] of errors rather than the average of the absolute values.

[][As we have discussed, the standard error is the square root of the average squared distance (X_bar-p)^2. The standard deviation is defined as the square root of the distance squared.]  {I assume that is when we know the population p or sample p?}

Calculate the standard deviation of the spread.
Instructions
100 XP

    Use the sample code to generate errors, a vector of |p-X_bar|.
    Use ^2 to square the distances.
    Calculate the average squared distance using the mean function.
    Calculate the square root of these values using the sqrt function.

```{r}
# Define `p` as the proportion of Democrats in the population being polled  ++++++++++++++++++++++
p <- 0.45

# Define `N` as the number of people polled
N <- 100

# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind="Rounding")

# We generated `errors` by subtracting the estimate from the actual proportion of Democratic voters
errors <- replicate(B, p - take_sample(p, N))

# Calculate the standard deviation of `errors`
sqrt(mean(errors^2))     # ====================================================== why not calculate a series of values ???
```



## Exercise 6. Estimating the standard error

The theory we just learned tells us what this standard deviation is going to be because it is the standard error of X_bar. 

[][Estimate the standard error given an expected value of 0.45 and a sample size of 100.] +++++++++++++++++++++++++++++

**************************************************************************************************************************
Standard error and standard deviation are both measures of variability. The standard deviation reflects variability within a sample, while the standard error estimates the variability across samples of a population.11 Dec 2020

or 

How Are Standard Deviation and Standard Error of the Mean Different? Standard deviation measures the variability from specific data points to the mean. Standard error of the mean measures the precision of the sample mean to the population mean that it is meant to estimate.
**************************************************************************************************************************

Instructions
100 XP

    Calculate the standard error using the sqrt function

```{r}
# Define `p` as the expected value equal to 0.45
p <- 0.45

# Define `N` as the sample size
N <- 100

# Calculate the standard error   ======================================================================
sqrt(p*(1-p)/N)
```



## Exercise 7. Standard error of the estimate

In practice, we don't know p, so we construct an estimate of the theoretical prediction based by plugging in X_bar for p. Calculate the standard error of the estimate: SE_har(X_bar)
Instructions
100 XP

    Simulate a poll X using the sample function.
    When using the sample function, create a vector using c() that contains all possible polling options where '1' indicates a Democratic voter and '0' indicates a Republican voter.
    When using the sample function, use replace = TRUE within the sample function to indicate that sampling from the vector should occur with replacement.
    When using the sample function, use prob = within the sample function to indicate the probabilities of selecting either element (0 or 1) within the vector of possibilities.
    Use the mean function to calculate the average of the simulated poll, X_bar.
    Calculate the standard error of the X_bar using the sqrt function and print the result.

```{r}
# Define `p` as a proportion of Democratic voters to simulate
p <- 0.45

# Define `N` as the sample size
N <- 100

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind="Rounding")

# Define `X` as a random sample of `N` voters with a probability of picking a Democrat ('1') equal to `p`
X <- sample(c(1, 0), N, prob=c(p, 1-p), replace=T)

# Define `X_bar` as the average sampled proportion
X_bar <- mean(X)

# Calculate the standard error of the estimate. Print the result to the console.
sqrt(p*(1-p)/N)
# We are simulate one poll here, thus the result can be different each time +++++++++++++++++++++++++++++++++++++++++++
```
Incorrect submission
You are not providing a calculation that gives the correct answer. Make sure you are dividing by the sample size before taking the square root. 

```{r}
# Define `p` as a proportion of Democratic voters to simulate
p <- 0.45

# Define `N` as the sample size
N <- 100

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind="Rounding")

# Define `X` as a random sample of `N` voters with a probability of picking a Democrat ('1') equal to `p`
X <- sample(c(1, 0), N, prob=c(p, 1-p), replace=T)

# Define `X_bar` as the average sampled proportion
X_bar <- mean(X)

# Calculate the standard error of the estimate. Print the result to the console.
sqrt(X_bar*(1-X_bar)/N)
#####################################################################################################################
```



## Exercise 8. Plotting the standard error

# =====================================================================================================================
[][The standard error estimates obtained from the Monte Carlo simulation(sqrt(mean(errors^2))), the theoretical prediction(sqrt(p*(1-p)/N)), and the estimate of the theoretical prediction are all very close](*THINK, Think, think*), which tells us that the theory is working. This gives us a practical approach to knowing the typical error we will make if we predict p with X_hat. The theoretical result gives us an idea of how large a sample size is required to obtain the precision we need. Earlier we learned that the largest standard errors occur for p=0.5.

Create a plot of the largest standard error for N ranging from 100 to 5,000. Based on this plot, how large does the sample size have to be to have a standard error of about 1%?

N <- seq(100, 5000, len = 100)
p <- 0.5
se <- sqrt(p*(1-p)/N)

Instructions
50 XP
Possible Answers

    100
    500
    2,500
    4,000

```{r}
library(tidyverse)


N <- seq(100, 5000, len = 100)
p <- 0.5
se <- sqrt(p*(1-p)/N)


data.frame(se=se) %>%
  ggplot(aes(N, se)) +
  geom_line()
```



## Exercise 9. Distribution of X-hat

For N=100, the central limit theorem tells us that the distribution of X_hat is...
Instructions
50 XP
Possible Answers

    practically equal to p.
    approximately normal with expected value p and standard error sqrt(p(1-p)/N).
    approximately normal with expected value X_bar and standard error sqrt(p(1-p)/N).   ???
    not a random variable.

Incorrect submission
[][Try again. The expected value is equal to the theoretical value. ]



## Exercise 10. Distribution of the errors

We calculated a vector errors that contained, for each simulated sample, the difference between the actual value p and our estimate X_hat.

The errors X_bar-p are:
Instructions
50 XP
Possible Answers

    practically equal to 0.
    approximately normal with expected value 0 and standard error sqrt(p(1-p)/N).   ???
    approximately normal with expected value p and standard error sqrt(p(1-p)/N).
    not a random variable.



## Exercise 11. Plotting the errors

Make a qq-plot of the errors you generated previously to see if they follow a normal distribution.
Instructions
100 XP

    Run the supplied code
[][    Use the qqnorm function to produce a qq-plot of the errors. ]
[][    Use the qqline function to plot a line showing a normal distribution. ]

```{r}
library(tidyverse)

# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45

# Define `N` as the number of people polled
N <- 100

# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind="Rounding")

# Generate `errors` by subtracting the estimate from the actual proportion of Democratic voters
errors <- replicate(B, p - take_sample(p, N))

# Generate a qq-plot of `errors` with a qq-line showing a normal distribution
qqnorm(errors)  
qqline(errors)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^



# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
data.frame(errors=errors) %>%
  ggplot(aes(sample=errors)) +
  stat_qq(dparams=list(mean=mean(errors), sd=sd(errors))) +
  geom_abline() 
```



## Exercise 12. Estimating the probability of a specific value of X-bar

# ===========================================================================================================================
If p=0.45 and N=100, use the central limit theorem to estimate the probability that X_bar > 0.5.
Instructions
100 XP

[][    Use pnorm to define the probability that a value will be greater than 0.5. ]  its not interval, be careful

```{r}
# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45

# Define `N` as the number of people polled
N <- 100

# Calculate the probability that the estimated proportion of Democrats in the population is greater than 0.5. Print this value to the console.  ========================================================================================================
1 - pnorm(0.5, mean=p, sd=sqrt(p*(1-p)/N))
```



## Exercise 13. Estimating the probability of a specific error size

# ============================================================================================================================
Assume you are in a practical situation and you don't know p. Take a sample of size N=100 and obtain a sample average of X_bar=0.51.
[][So what is the sd of the population?]
What is the CLT approximation for the probability that your error size is equal or larger than 0.01?
Instructions
100 XP

    Calculate the standard error of the sample average using the sqrt function.
    Use pnorm twice to define the probabilities that a value will be less than -0.01 or greater than 0.01.
    Combine these results to calculate the probability that the error size will be 0.01 or larger.

```{r}
# Define `N` as the number of people polled
N <-100

# Define `X_hat` as the sample average
X_hat <- 0.51

# Define `se_hat` as the standard error of the sample average +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
se_hat <- sqrt(X_hat*(1-X_hat)/N)


# ==========================================================================================================================
# Calculate the probability that the error is 0.01 or larger
pnorm(0.01, mean=X_hat, sd=se_hat) - pnorm(-0.01, mean=X_hat, sd=se_hat)
# Pr(Z<= 0.01/sqrt(X_bar(1-X_bar)/N)) - Pr(X<= -o.o1/sqrt(x_bar(1-X_bar)/N))  ###############################################
```
Incorrect submission
You are not providing a calculation that gives the correct answer. Make sure you account for the probability that the error is less than 0.01 or less than -0.01. 

Hint

    The standard error of X_hat = sqrt(X_hat*(1-X_hat)/N)

[][*Recall that the expected value of the error is 0.  *] ============================= and standard error not 1 =============
Remember to subtract the probability calculated using pnorm from 1 to determine the probability that a value will be 0.01 or higher.
Don't forget to add the probability that the error will be less than -0.01.

```{r}
# Define `N` as the number of people polled
N <-100

# Define `X_hat` as the sample average
X_hat <- 0.51

# Define `se_hat` as the standard error of the sample average
se_hat <- sqrt(X_hat*(1-X_hat)/N)

# Calculate the probability that the error is 0.01 or larger   ==============================================================
1 - pnorm(.01, 0, se_hat) + pnorm(-0.01, 0, se_hat)
```



## End of Assessment

This is the end of the programming assignment for this section. Please DO NOT click through to additional assessments from this page. If you do click through, your scores may NOT be recorded.

Click "Got it!" and submit to get the "points" for this question.

You can close this window and return to Data Science: Inference.
Answer the question
50XP
Possible Answers

    Got it!
    press
    1














# Section 3 Overview


In Section 3, you will look at confidence intervals and p-values.

After completing Section 3, you will be able to:

        Calculate confidence intervals of difference sizes around an estimate.
[][*        Understand that a confidence interval is a random interval with the given probability of falling on top of the parameter. *]
[][*        Explain the concept of "power" as it relates to inference. *]
[][*        Understand the relationship between p-values and confidence intervals and explain why reporting confidence intervals is often preferable. *]

There is 1 assignment that uses the DataCamp platform for you to practice your coding skills.

We encourage you to use R to interactively test out your answers and further your learning.














## Course  /  Section 3: Confidence Intervals and p-Values  /  Confidence Intervals and p-Values


# Confidence Intervals

![](C:/Users/qp/Pictures/we are ready to learn about confidence intervals.png)
We are ready to learn about confidence intervals.  Confidence intervals are a very useful concept that are widely used by data scientists.  A version of these that are very commonly seen come from the ggplot geometry *geom_smooth()*.  Here's an example using some weather data.  We write the code like this and we get a picture that looks like this.  We will later learn how this curve is formed, but note the shaded area around the curve.  This shaded area is created using the concept of confidence intervals.  
![](C:/Users/qp/Pictures/here is an example using geom_smooth with some dataset.png)
![](C:/Users/qp/Pictures/this shaded ares is created using the concept of confidence interval.png)
*In our competition, we were asked to give an interval*.  If the interval you submit includes the actual proportion p, you get half the money you spent on your poll back and pass to the next stage of the competition.  One way to pass to the second round of the competition to report a very large interval--for example, the interval 0 to 1.  This is guaranteed to include p.  However, with an interval this big, we have no chance of winning the competition.  Similarly, if you are an election forecaster and predict the spread will be between negative 100 and 100, you'll be ridiculed for stating the obvious.  Even a smaller interval such as saying that the spread will be between minus 10% and 10% will not be considered serious.  On the other hand, the smaller the interval we report, the smaller our chance of passing to the second round.  Similarly, a bold pollster that reports very small intervals and misses the mark most of the time will not be considered a good pollster.  We want to be somewhere in between.  
![](C:/Users/qp/Pictures/the 95% chance including p are called 95% confidence intervals.png)
[][Confidence intervals will help us get there].  We can use the statistical theory we have learned to compute, for any given interval, the probability that it includes p.  Similarly if we are asked to create an interval with, say, a 95% chance of including p, we can do that as well.  These are called 95% confidence intervals.  Note that when pollsters report an estimate and a margin of error, they are, in a way, reporting a 95% confidence interval.  Let's show how this works mathematically.  
![](C:/Users/qp/Pictures/we want to know the probability that the interval X_bar minus or plus 2 times extimated standard error.png)
![](C:/Users/qp/Pictures/to determine the probability that the interval includes p, we need to compute this probability.png)
***We want to know the probability that the interval X-bar minus 2 times the estimated standard error to X-bar plus 2 times its estimated standard error contains the actual proportion p***.  First, note that the start and end of this interval are random variables.  Every time that we take a sample, they change.  To illustrate this, we're going to run a Monte Carlo simulation.  We're going to do it just twice first.  So we're going to use these parameters.  We're going to make p 0.45 and N 1,000.  Note that the interval we get when we write this code is different from what we get if we run that same code again.  If we keep sampling and creating intervals, we will see that this is due to random variation.  
![](C:/Users/qp/Pictures/note that the start and end of this interval are random variables.  Every time that we take a sample, they change.png)
![](C:/Users/qp/Pictures/the term in the middle is an approximately normal random variable with expected value 0 and standard error 1.png)
![](C:/Users/qp/Pictures/which we have been denoting with capital Z.png)
[][*To determine the probability that the interval includes p, we need to compute this probability*].  By subtracting and dividing the same quantities in all parts of the equation, we get that the equation is equivalent to this.  The term in the middle is an approximately normal random variable with expected value 0 and standard error 1, which we have been denoting with capital Z.  *So what we have is, what is the probability of a standard normal variable being between minus 2 and 2*?  And this is, we know, about 95%.  So we have a 95% confidence interval.  
![](C:/Users/qp/Pictures/we can have any confidence interval by multiply by whatever z satisfies the following equation.png)
![](C:/Users/qp/Pictures/using the quantity that we get by typing this code, which is about 2.576, will do.png)
Note that if we want to have a larger probability, say 99%, a 99% confidence interval, we need to multiply by whatever z satisfies the following equation.  Note that by using the quantity that we get by typing this code, which is about 2.576, will do it, because by definition the pnorm of what we get when we type qnorm(0.995) is by definition 0.995.  And by symmetry, pnorm of 1 minus qnorm(0.995) is 1 minus 0.995.  So now we compute pnorm minus pnorm minus z, we get 99%.  
![](C:/Users/qp/Pictures/so not we compute pnorm z minus pnorm minus z is 0.99.png)
This is what we wanted.  We can use this approach for any percentile q.  We use 1 minus (1 minus q) divided by 2.  Why this number-- because of what we just saw, 1 minus (1 minus q) divided by 2 plus(less) (1 minus q) divided by 2 equals q.  And we get what we want.  Also note that to get exactly 0.95, we actually use a slightly smaller number than 2.  How do we know?  We type qnorm of 0.975 and we see that the value that we should be using to get exactly a 95% confidence interval is 1.96.  
![??????](C:/Users/qp/Pictures/we use 1 minus 1 minus q divide by 2, and where does this equation comes from.png)


> pnorm(qnorm(0.95))
[1] 0.95
> pnorm(qnorm(1-0.995))
[1] 0.005



[][Corrections]

There are two minor errors in the video:

[][*    At 4:28 the formula should be pnorm(qnorm(1-0.995)) instead of pnorm(1-qnorm(0.995)). This has been corrected in the code below. *]

[][*    At 4:50, the equation should be 1 - (1-q)/2 - (1-q)/2 = 1 - (1-q) = q. *]


[][Textbook link]

This video corresponds to the first part of the textbook section on confidence intervals.
https://rafalab.github.io/dsbook/inference.html#confidence-intervals


[][Key points]

    We can use statistical theory to compute the probability that a given interval contains the true parameter p.
    
    95% confidence intervals are intervals constructed to have a 95% chance of including p. The margin of error is approximately a 95% confidence interval.
    
    The start and end of these confidence intervals are random variables.

[][*    To calculate any size confidence interval, we need to calculate the value z for which Pr(-z <= Z <= z) equals the desired confidence.  For example, a 99% confidence interval requires calculating z for Pr(-z <= Z <= z) = 0.99. *]
    
[][*    For a confidence interval of size q, we solve for z = 1 - (1-q)/2 *]. ?????? Think about this equation
    
    To determine a 95% confidence interval, use z <- qnorm(0.975). This value is slightly smaller than 2 times the standard error.


[][Code: geom_smooth confidence interval example]

The shaded area around the curve is related to the concept of confidence intervals.

data("nhtemp")
data.frame(year = as.numeric(time(nhtemp)), temperature = as.numeric(nhtemp)) %>%
    ggplot(aes(year, temperature)) +
    geom_point() +
    geom_smooth() +
    ggtitle("Average Yearly Temperatures in New Haven")

Code: Monte Carlo simulation of confidence intervals

Note that to compute the exact 95% confidence interval, we would use qnorm(.975)*SE_hat instead of 2*SE_hat.

# ==========================================================================================================================
p <- 0.45
N <- 1000
X <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))    # generate N observations
X_hat <- mean(X)    # calculate X_hat
SE_hat <- sqrt(X_hat*(1-X_hat)/N)    # calculate SE_hat, SE of the mean of N observations
c(X_hat - 2*SE_hat, X_hat + 2*SE_hat)    # build interval of 2*SE above and below mean

Code: Solving for z with qnorm

z <- qnorm(0.995)    # calculate z to solve for 99% confidence interval
pnorm(qnorm(0.995))    # demonstrating that qnorm gives the z value for a given probability
pnorm(qnorm(1-0.995))    # demonstrating symmetry of 1-qnorm
pnorm(z) - pnorm(-z)    # demonstrating that this z value gives correct probability for interval
# ==========================================================================================================================



```{r}
library(tidyverse)
library(ggplot2)
data("nhtemp")


data.frame(year=as.numeric(time(nhtemp)), temperature=as.numeric(nhtemp)) %>%
  ggplot(aes(year, temperature)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Average Yearly Temperatures in New Haven")
```

```{r}
n <- 1000
p <- 0.45


X <- sample(c(0, 1), n, prob=c(1-p, p), replace=T)

X_hat <- mean(X)
SE_hat <- sqrt(X_hat*(1-X_hat)/n)   # =======================================================???????????????


print(c(X_hat-2*SE_hat, X_hat+2*SE_hat))
```















# A Visual Clarification of Confidence Intervals

Recall that to define a confidence interval of size q, we solve for [][z = 1 - (1-q)/2]. For example, to find a 95% confidence interval, we solve for z = qnorm(.975).  
>>> q = 0.95
>>> z = 1 - (1-q)/2
>>> z
0.975

A common source of confusion is why qnorm(.975) is used rather than qnorm(.95) to find the 95% confidence interval. This is because the normal distribution is symmetric and our confidence interval should cover the middle 95% of the distribution:

![](C:/Users/qp/Pictures/the 95% confidence interval visualization with 0.025.png)

Normal distribution with middle 95% shaded. The left and right tail each represent 2.5% of the most extreme observations..

The upper limit of this 95% confidence interval will be X_bar + qnorm(.975)*\{SE_hat\), which removes the 2.5% highest observations.

The lower limit of this 95% confidence interval will be X_bar + qnorm(.975)*\{SE_hat\), which removes the 2.5% lowest observations.












# A Monte Carlo Simulation for Confidence Intervals


We can run a Monte Carlo simulation to confirm that, [][in fact, a 95% confidence interval includes p 95% of the time].  We write the simulation like this.  
![](C:/Users/qp/Pictures/we can do a monte carlo simulation to confirm that the 95% confidence interval includes p 95% of time.png)
We're going to write the simulation we've been writing.  But now, we're going to actually construct the confidence interval inside the call to replicate.  And in the very final line, we're going to ask, is p included in the interval.  We're going to return either true or false.  To compute how often it happened, we compute the mean of that vector of true and false.  We run a simulation and we get 0.9522.  
![](C:/Users/qp/Pictures/this plot shows you the first few confidence intervals that were generated in our monte carlo simulation.png)
This plot shows you the first few confidence intervals that were generated in our Monte Carlo simulation.  In this case, we created simulations so we know what p is.  In the plot, it's represented with a vertical black line.  Notice that you can see the confidence intervals varying.  Each time, they fall in slightly different places.  This is because they're random variables.  We also know that most of the times, p is included inside the confidence interval. p is not moving, of course, because p is not a random variable.  

We also see that, every once in a while, we actually miss p.  These confidence intervals are shown in red.  [][We should only see about 5% of the intervals in red because they're 95% confidence intervals].  This plot should help you understand what confidence intervals are and what they mean.  



[][Textbook link]

This video corresponds to the textbook section on a Monte Carlo simulation for confidence intervals External link.
https://rafalab.github.io/dsbook/inference.html#a-monte-carlo-simulation-1


[][Key points]

    We can run a Monte Carlo simulation to confirm that a 95% confidence interval contains the true value of p 95% of the time.
    
    A plot of confidence intervals from this simulation demonstrates that most intervals include p, but roughly 5% of intervals miss the true value of p.

Code: Monte Carlo simulation

Note that to compute the exact 95% confidence interval, we would use qnorm(.975)*SE_hat instead of 2*SE_hat.

B <- 10000
inside <- replicate(B, {
    X <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
    X_hat <- mean(X)
    SE_hat <- sqrt(X_hat*(1-X_hat)/N)
    between(p, X_hat - 2*SE_hat, X_hat + 2*SE_hat)    # TRUE if p in confidence interval ====================================
})
mean(inside)


***************************************************************************************************************************
```{r}
B <- 10000
N <- 1000
p <- 0.45
library(tidyverse)

inside <- replicate(B, {
  X <- sample(c(0, 1), size=N, replace=T, prob=c(1-p, p))
  X_hat <- mean(X)
  SE_hat <- sqrt(X_hat*(1-X_hat)/N)   # ========================================================================
  between(p, X_hat-2*SE_hat, X_hat+2*SE_hat)
})


mean(inside)
```
***************************************************************************************************************************
















# The Correct Language


When using the theory we just described, it is important to remember that it is the intervals that are at random, not p.  
![](C:/Users/qp/Pictures/it is important to remember that it is the intervals that are random, not the p.png)
![](C:/Users/qp/Pictures/we show the plot where we can see the random intervals that are moving up and down of p.png)
We showed a plot where we could see *the random intervals that were moving around.  And we also saw p. p was not moving around*.  It was fixed, and it was represented with a vertical line.  It was staying in the same place.  
![](C:/Users/qp/Pictures/so that the 95% relates to the probability that the random interval falls on top of p.png)
So the 95% relates to the probability that the random interval falls on top of p.  Saying that p has a 95% chance of being between this and that is technically an incorrect statement-- again, [][because p is not random.  ]
![](C:/Users/qp/Pictures/that is saying that p has a 95% of chance of being between this and that is totally incorrect.png)



[][Textbook link]

This video corresponds to the textbook section on the correct language for confidence intervals External link.
https://rafalab.github.io/dsbook/inference.html#the-correct-language
https://rafalab.github.io/dsbook/inference.html#the-correct-language


[][Key points]

    The 95% confidence intervals are random, but p is not random.
    
    95% refers to the probability that the random interval falls on top of p.
    
    It is technically incorrect to state that p has a 95% chance of being in between two values because that implies p is random.
















# Power

Note: There is an error in the code in the video at 0:10. Use the code below the video instead.

# =========================================================================================================================
[][Pollsters do not become successful for providing correct confidence intervals, but rather for predicting who will win].  
![](C:/Users/qp/Pictures/when we took a sample size of 25, the confidence interval for the spread was from negative 0.93 to 0.85.png)
When we took a sample of size 25, the confidence interval for the spread was-- and we can reconstruct it here--from negative 0.93 to 0.85.  [][This includes 0].  If we were pollsters and we were forced to make a declaration about the election, we would have no choice but to say it's a tossup.  
![](C:/Users/qp/Pictures/and this interval constructed from 25 samples includes 0.png)
A problem with our poll results is that given the sample size and the value of p, we would have to sacrifice on the probability of an incorrect call to create an interval that does not include 0, an interval that makes a call of who's going to win.  *The fact that our interval includes 0, it does not mean that this election is close*.  It only means that we have a small sample size.  In statistical textbooks, this is [][called lack of power].  
![](C:/Users/qp/Pictures/in polls, power can be thought of as the probability of detecting a spread different from 0.png)
In the context of polls, power can be thought of as the probability of detecting a spread different from 0.  By increasing our sample size, we lower our standard error, and therefore have a much better chance of detecting the direction of the spread.  



[][Textbook link]

This video corresponds to the textbook section on power External link.
https://rafalab.github.io/dsbook/inference.html#power


[][Key points]

    If we are trying to predict the result of an election, then a confidence interval that includes a spread of 0 (a tie) is not helpful. 
    
[][*    A confidence interval that includes a spread of 0 does not imply a close election, it means the sample size is too small. *]
    
    Power is the probability of detecting an effect when there is a true effect to find. Power increases as sample size increases, because larger sample size means smaller standard error.


Code: Confidence interval for the spread with sample size of 25

[][**Note that to compute the exact 95% confidence interval, we would use c(-qnorm(.975), qnorm(.975)) instead of 1.96.**]

N <- 25
X_hat <- 0.48

(2*X_hat - 1) + c(-2, 2)*2*sqrt(X_hat*(1-X_hat)/N)















# p-Values


p-values are very common in, for example, the [][scientific literature].  They are related to confidence intervals, so we introduce the concept here.  Let's consider the blue and red bead example again.  
![](C:/Users/qp/Pictures/say now we are interest in question are there more blue beads than red beads.png)
Suppose that rather than wanting to estimate the spread or the proportion of blue, [][I'm interested only in the question, are there more blue beads than red beads]?  Another way to ask that is, *is 2p minus 1 bigger than 0*?  
![](C:/Users/qp/Pictures/another way to asking this question would be is 2p minus 1 bigger than 0.png)
Is the spread bigger than 0?  So [][suppose we take a random sample of, say, 100 beads, and we observe 52 blue beads].  This gives us a spread of 4% {52/100 - (1 - 52/100) = 104/100 - 100/100 = 4/100}.  This seems to be pointing to there being more blue beads than red beads, because 4% is larger than 0.  52% is larger than 48%.  {but we all know its just one sample}
![](C:/Users/qp/Pictures/there is chance involved, so we can get a 52 even when the actual spread is 0.png)
However, as data scientists, we need to be skeptical.  We know there is chance involved in this process, and we can get a 52 even when the actual spread is 0.  
![](C:/Users/qp/Pictures/the null hypothesis is skeptic's hypothesis, in this case, that the spread is 0.png)
![](C:/Users/qp/Pictures/and we have observed a random variable 2 times X_bar  minus 1 which equal to 0.04.png)
![](C:/Users/qp/Pictures/and the p-value is the answer to the question, how likely is it to see a value this large when the null hypothesis is true.png)
The null hypothesis is the skeptic's hypothesis.  In this case, it would be the spread is 0.  We have observed a random variable 2 times X-bar minus 1, which in this case is 4%, and the p-value is the answer to the question, [][how likely is it to see a value this large when the null hypothesis is true]?  So we write, what's the probability of X-bar minus 0.5 being bigger than 2%?  **That's the same as asking, what's the chance that the spread is 4 or more*?  [][The null hypothesis is that the spread is 0 or that p is a half].  
![](C:/Users/qp/Pictures/so we write whats the probability of X_bar  minus 0.5 being bigger than 2%.png)
![Where does that sqrt(N) comes form???](C:/Users/qp/Pictures/under the null hypothesis, we know that the quantity here is tandard normal.png)
Under the null hypothesis, we know that this quantity here, the square root of n times X-bar minus 0.5 divided by the square root of 0.5 times 1 minus 0.5, is a standard normal.  We've taken a random variable and divided it by its standard error after subtracting its expected value.  
![](C:/Users/qp/Pictures/So we can compute the probability, which is a p-value, using this equation.png)
*So we can compute the probability, which is a p-value, using this equation*, which reduces to this equation, where z is a standard normal.  And now we can use code to compute this.  We compute the probability, which is equal to 69% in this case.  This is the p-value.  
![??????!!!!!!](C:/Users/qp/Pictures/and then we can use the code to compute the probability, which is about 69%.png)
*In this case, there's actually a large chance of seeing 52 blue beads or more under the null hypothesis that there is the same amount of blue beads as red beads.  So the 52 blue beads are not very strong evidence, are not very convincing, if we want to make the case that there's more blue beads than red beads*.  [][Note that there's a close connection between p-values and confidence intervals].  If a 95% confidence interval of the spread does not include 0, we can do a little bit of math to see that this implies that the p-value must be smaller than 1 minus 95%, or 0.05.  
![can you explain that](C:/Users/qp/Pictures/If a 95% confidence interval of the spread does not include 0, the p-value must smaller than 0.05.png)
To learn more about p-values, you can consult any statistics textbook.  However, [][**in general, we prefer reporting confidence intervals over p-values, since it gives us an idea of the size of the estimate.  The p-value simply reports a probability and says nothing about the significance of the finding in the context of the problem**] {Think, think about this, use your gray cells}.  



[][Textbook link]

This video corresponds to the textbook section on p-values External link.
https://rafalab.github.io/dsbook/inference.html#p-values


[][Key points]

     The null hypothesis is the hypothesis that there is no effect. In this case, the null hypothesis is that the spread is 0, or p = 0.5. 
     
[][*    The p-value is the probability of detecting an effect of a certain size or larger when the null hypothesis is true. *]
    
[][*    We can convert the probability of seeing an observed value under the null hypothesis into a standard normal random variable. We compute the value of z that corresponds to the observed result, and then use that z to compute the p-value. *]
    
    If a 95% confidence interval does not include our observed value, then the p-value must be smaller than 0.05.
    
    It is preferable to report confidence intervals instead of p-values, as confidence intervals give information about the size of the estimate and p-values do not.


Code: Computing a p-value for observed spread of 0.02

N <- 100    # sample size
z <- sqrt(N) * 0.02/0.5    # spread of 0.02
1 - (pnorm(z) - pnorm(-z))















# Another Explanation of p-Values


[][*The p-value is the probability of observing a value as extreme or more extreme than the result given that the null hypothesis is true.*]

**In the context of the normal distribution, this refers to the probability of observing a Z-score whose absolute value is as high or higher than the Z-score of interest.**

Suppose we want to find the p-value of an observation 2 standard deviations larger than the mean. This means we are looking for anything with |z| >= 2.  

Graphically, the p-value gives the probability of an observation that's at least as far away from the mean or further. This plot shows a standard normal distribution (centered at z=0 with a standard deviation of 1). The shaded tails are the region of the graph that are 2 standard deviations or more away from the mean.

The p-value is the proportion of area under a normal curve that has z-scores as extreme or more extreme than the given value - the tails on this plot of a normal distribution are shaded to show the region corresponding to the p-value.

![](C:/Users/qp/Pictures/The shaded tails are the region of the graph that are 2 standard deviations or more away from the mean.png)

# =====================================================================================================================
The right tail can be found with 1-pnorm(2). We want to have both tails, though, because we want to find the probability of any observation as far away from the mean or farther, in either direction. (This is what's meant by a two-tailed p-value.) Because the distribution is symmetrical, the right and left tails are the same size and we know that our desired value is just 2*(1-pnorm(2)).

Recall that, by default, pnorm() gives the CDF for a normal distribution with a mean of mu=0 and standard deviation of Sigma=1. [][*To find p-values for a given z-score z in a normal distribution with mean mu and standard deviation sigma, use 2*(1-pnorm(z, mu, sigma)) instead. *]















# Assessment 3.1: Confidence Intervals and p-Values


DataCamp due Jul 20, 2022 13:15 AWST

In this assessment, you will learn about confidence intervals and p-values using actual polls from the 2016 US Presidential election.

By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy. Note that you might need to disable your pop-up blocker, or allow "www.datacamp.com" in your pop-up blocker allowed list. When you have completed the exercises, return to edX to continue your learning.

Assessment 3.1: Confidence Intervals and p-Values (External resource) (9.0 points possible)
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy.



## Exercise 1. Confidence interval for p

For the following exercises, we will use actual poll data from the 2016 election. The exercises will contain pre-loaded data from the dslabs package.

library(dslabs)
data("polls_us_election_2016")

We will use all the national polls that ended within a few weeks before the election.

Assume there are only two candidates and construct a 95% confidence interval for the election night proportion p.
Instructions
100 XP

###############################################################################################################################
    Use filter to subset the data set for the poll data you want. Include polls that ended on or after October 31, 2016 (enddate). Only include polls that took place in the United States. Call this filtered object polls.
    Use nrow to make sure you created a filtered object polls that contains the correct number of rows.
    Extract the sample size N from the first poll in your subset object polls.
    Convert the percentage of Clinton voters (rawpoll_clinton) from the first poll in polls to a proportion, X_hat. Print this value to the console.
    Find the standard error of X_hat given N. Print this result to the console.
    Calculate the 95% confidence interval of this estimate using the qnorm function.
    Save the lower and upper confidence intervals as an object called ci. Save the lower confidence interval first.
###############################################################################################################################

```{r}
library(tidyverse)
library(dslabs)
data("polls_us_election_2016")



# Load the data
data(polls_us_election_2016)

# Generate an object `polls` that contains data filtered for polls that ended on or after October 31, 2016 in the United States
head(polls_us_election_2016)
dim(polls_us_election_2016)

polls <- polls_us_election_2016 %>%
  filter(enddate>'2016-10-31') 

dim(polls)


# How many rows does `polls` contain? Print this value to the console.
length(polls[, 1])
nrow(polls)


# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the console.
N <- polls$samplesize[1]
N


# For the first poll in `polls`, assign the estimated percentage of Clinton voters to a variable called `X_hat`. Print this value to the console.
X_hat <- polls$rawpoll_clinton[1]/100
X_hat

# Calculate the standard error of `X_hat` and save it to a variable called `se_hat`. Print this value to the console.
se_hat <- sqrt(X_hat*(1-X_hat)/N)
se_hat


# think, think, think, what is the interval, what value should be the interval
# =======================================================================================================================
# Use `qnorm` to calculate the 95% confidence interval for the proportion of Clinton voters. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- qnorm((1-0.95)/2)
ci

ci <- qnorm(1-(1-0.95)/2)
ci
```

Incorrect submission
Your polls object does not have the correct number of rows. Be sure to filter for 'enddate' greater than or equal to '2016-10-31' and 'state=='U.S.'. 

```{r}
library(tidyverse)
library(dslabs)
data("polls_us_election_2016")



# Load the data
data(polls_us_election_2016)

# Generate an object `polls` that contains data filtered for polls that ended on or after October 31, 2016 in the United States
head(polls_us_election_2016)
dim(polls_us_election_2016)

polls <- polls_us_election_2016 %>%
  filter(enddate >= '2016-10-31' & state=="U.S.") 

dim(polls)


# How many rows does `polls` contain? Print this value to the console.
length(polls[, 1])
nrow(polls)


# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the console.
N <- polls$samplesize[1]
N


# For the first poll in `polls`, assign the estimated percentage of Clinton voters to a variable called `X_hat`. Print this value to the console.
X_hat <- polls$rawpoll_clinton[1]/100
X_hat

# Calculate the standard error of `X_hat` and save it to a variable called `se_hat`. Print this value to the console.
se_hat <- sqrt(X_hat*(1-X_hat)/N)
se_hat

# Use `qnorm` to calculate the 95% confidence interval for the proportion of Clinton voters. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- qnorm((1-0.95)/2, X_hat, se_hat)
ci

ci <- qnorm(1-(1-0.95)/2, X_hat, se_hat)
ci
```

Incorrect submission
The values contained in the object 'ci' are not correct. Are you using the correct formula to calculate the confidence intervals? Make sure to save the lower confidence interval first. 

Incorrect submission
The values contained in the object 'ci' are not correct. Are you using the correct formula to calculate the confidence intervals? Make sure to save the lower confidence interval first. 

```{r}
library(tidyverse)
library(dslabs)
data("polls_us_election_2016")



# Load the data
data(polls_us_election_2016)

# Generate an object `polls` that contains data filtered for polls that ended on or after October 31, 2016 in the United States
head(polls_us_election_2016)
dim(polls_us_election_2016)

polls <- polls_us_election_2016 %>%
  filter(enddate >= '2016-10-31' & state=="U.S.") 

dim(polls)


# How many rows does `polls` contain? Print this value to the console.
length(polls[, 1])
nrow(polls)


# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the console.
N <- polls$samplesize[1]/100
N


# For the first poll in `polls`, assign the estimated percentage of Clinton voters to a variable called `X_hat`. Print this value to the console.
X_hat <- polls$rawpoll_clinton[1]
X_hat

# Calculate the standard error of `X_hat` and save it to a variable called `se_hat`. Print this value to the console.
se_hat <- sqrt(X_hat*(1-X_hat)/N)
se_hat


# ==========================================================================================================================
# Use `qnorm` to calculate the 95% confidence interval for the proportion of Clinton voters. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- X_hat - qnorm(1-(1-0.95)/2)*se_hat
ci

ci <- X_hat + qnorm(1-(1-0.95)/2)*se_hat
ci
```

# ================================================================================================================
Hint

    Indicate which object you want to filter by using the object name followed by "%>%" then the function you want to perform. For example

[][===============================================================================================================]
[][new_object <- old_object %in% filter(first_filter_logic & second_filter_logic)]

    Remember the formula for standard error:
        SE[X_bar] = sqrt(X_bar*(1-X_bar)/N)

[][*The lower bound of the 95% confidence interval is equal to X_bar - qnorm(0.975)*SE[X_bar].*]
[][*The upper bound of the 95% confidence interval is equal to X_bar + qnorm(0.0975)*SE[X_bar].*]

```{r}
library(tidyverse)

# Load the data
data("polls_us_election_2016")

# Generate an object `polls` that contains data filtered for polls that ended on or after October 31, 2016 in the United States
polls <- polls_us_election_2016 %>% 
  filter(enddate >= "2016-10-31" & state == "U.S.") 

# How many rows does `polls` contain? Print this value to the console.
nrow(polls)

# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the console.
N <- polls$samplesize[1]
N

# For the first poll in `polls`, convert the percentage to a proportion of Clinton voters and assign it to a variable called `X_hat`. Print this value to the console.
X_hat <- polls$rawpoll_clinton[1]/100
X_hat

# Calculate the standard error of `X_hat` and save it to a variable called `se_hat`. Print this value to the console.
se_hat <- sqrt(X_hat*(1-X_hat)/N)
se_hat

# Use `qnorm` to calculate the 95% confidence interval for the proportion of Clinton voters. Save the lower and then the upper confidence interval to a variable called `ci`.
ci<- c(X_hat - qnorm(0.975)*se_hat, X_hat + qnorm(0.975)*se_hat)  
ci
# ====================================================================================================================
# What the hell, you didn't mentioned to save the value in a list, did you ???
```
[][=========================================================================================================================]



## Exercise 2. Pollster results for p

Create a new object called pollster_results that contains the pollster's name, the end date of the poll, the proportion of voters who declared a vote for Clinton, the standard error of this estimate, and the lower and upper bounds of the confidence interval for the estimate.
Instructions
100 XP

    Use the mutate function to define four new columns: X_hat, se_hat, lower, and upper. Temporarily add these columns to the polls object that has already been loaded for you.
    In the X_hat column, convert the raw poll results for Clinton to a proportion.
    In the se_hat column, calculate the standard error of X_hat for each poll using the sqrt function.
    In the lower column, calculate the lower bound of the 95% confidence interval using the qnorm function.
    In the upper column, calculate the upper bound of the 95% confidence interval using the qnorm function.
    Use the select function to select the columns from polls to save to the new object pollster_results.

```{r}
# The `polls` object that filtered all the data by date and nation has already been loaded. Examine it using the `head` function.
head(polls)

# Create a new object called `pollster_results` that contains columns for pollster name, end date, X_hat, se_hat, lower confidence interval, and upper confidence interval for each poll.
pollster_results <- polls %>%
  mutate(X_hat = polls$rawpoll_clinton/100, 
         se_hat = sqrt(polls$rawpoll_clinton/100*(1-polls$rawpoll_clinton/100)/polls$samplesize), 
         lower = X_hat - qnorm(0.975)*se_hat, 
         upper = X_hat + qnorm(0.975)*se_hat)

pollster_results
```
Incorrect submission
[][Use the select function to specify which columns to save to pollster_results: 'pollster', 'enddate', 'X_hat', 'se_hat', 'upper', and 'lower'. ]

```{r}
# The `polls` object that filtered all the data by date and nation has already been loaded. Examine it using the `head` function.
head(polls)

# Create a new object called `pollster_results` that contains columns for pollster name, end date, X_hat, se_hat, lower confidence interval, and upper confidence interval for each poll.
pollster_results <- polls %>%
  mutate(X_hat = polls$rawpoll_clinton/100, 
         se_hat = sqrt(polls$rawpoll_clinton/100*(1-polls$rawpoll_clinton/100)/polls$samplesize), 
         lower = X_hat - qnorm(0.975)*se_hat, 
         upper = X_hat + qnorm(0.975)*se_hat) %>%
  select(pollster, enddate, X_hat, se_hat, upper, lower)


pollster_results
```
Incorrect submission
The contents of 'pollster_results' are not correct. Make sure you are selecting the correct columns and calculating the confidence intervals correctly. 

```{r}
# The `polls` object that filtered all the data by date and nation has already been loaded. Examine it using the `head` function.
head(polls)

# Create a new object called `pollster_results` that contains columns for pollster name, end date, X_hat, se_hat, lower confidence interval, and upper confidence interval for each poll.
pollster_results <- polls %>%
  mutate(X_hat = polls$rawpoll_clinton/100, 
         se_hat = sqrt(X_hat*(1-X_hat)/polls$samplesize),    # I cant image the error is this "polls$"
         lower = X_hat - qnorm(0.975)*se_hat, 
         upper = X_hat + qnorm(0.975)*se_hat) %>%
  select(pollster, enddate, X_hat, se_hat, upper, lower)


pollster_results
```

Hint

    Indicate which object you want to mutate by using the object name followed by "%>%" then the function you want to perform.
    When using the mutate function, supply the name of the variable you wish to create and the function to perform. For example:

data %>% mutate(double = existing_variable*2, triple = existing_variable*3)

    Remember the formula for standard error:

The lower bound of the 95% confidence interval is equal to
.
The upper bound of the 95% confidence interval is equal to
.

```{r}
# The `polls` object that filtered all the data by date and nation has already been loaded. Examine it using the `head` function.
head(polls)

# Create a new object called `pollster_results` that contains columns for pollster name, end date, X_hat, se_hat, lower confidence interval, and upper confidence interval for each poll.

pollster_results <- polls %>% 
  mutate(X_hat = rawpoll_clinton/100, 
         se_hat = sqrt(X_hat*(1-X_hat)/samplesize),   #<====================================================
         lower = X_hat - qnorm(0.975)*se_hat, 
         upper = X_hat + qnorm(0.975)*se_hat) %>% 
  select(pollster, enddate, X_hat, se_hat, lower, upper)


pollster_results
```



## Exercise 3. Comparing to actual results - p

The final tally for the popular vote was Clinton 48.2% and Trump 46.1%. Add a column called hit to pollster_results that states [][if the confidence interval included the true proportion p = 0.482 or not]. What proportion of confidence intervals included p?
Instructions
100 XP

    Finish the code to create a new object called avg_hit by following these steps.
    Use the mutate function to define a new variable called 'hit'.
[][    Use logical expressions to determine if each values in lower and upper span the actual proportion.]
[][    Use the mean function to determine the average value in hit and summarize the results using summarize.]

```{r}
# The `pollster_results` object has already been loaded. Examine it using the `head` function.
head(pollster_results)

# Add a logical variable called `hit` that indicates whether the actual value exists within the confidence interval of each poll. Summarize the average `hit` result to determine the proportion of polls with confidence intervals include the actual value. Save the result as an object called `avg_hit`.
avg_hit <- pollster_results %>%
  mutate(hit =  lower <= 0.482 & 0.482 <= upper) %>%
  #mutate(hit = between(0.482, lower, upper)) %>%
  mean(hit) %>%
  summarize

    #SE_hat <- sqrt(X_hat*(1-X_hat)/N)
    #between(p, X_hat - 2*SE_hat, X_hat + 2*SE_hat)    # TRUE if p in confidence interval
    
avg_hit
```

```{r}
# The `pollster_results` object has already been loaded. Examine it using the `head` function.
head(pollster_results)

# Add a logical variable called `hit` that indicates whether the actual value exists within the confidence interval of each poll. Summarize the average `hit` result to determine the proportion of polls with confidence intervals include the actual value. Save the result as an object called `avg_hit`.
avg_hit <- pollster_results %>% 
  mutate(hit = lower <= 0.482 & 0.482 <= upper) %>%
  summarize(mean(hit))

avg_hit
```



## Exercise 4. Theory of confidence intervals

If these confidence intervals are constructed correctly, and the theory holds up, what proportion of confidence intervals should include p?
Instructions
50 XP
Possible Answers

    0.05
    0.31
    0.50
    0.95
    
    
    
## Exercise 5. Confidence interval for d
[][#####################################################################################################################]

*A much smaller proportion of the polls than expected produce confidence intervals containing p*. Notice that most polls that fail to include p are underestimating. The rationale for this is that undecided voters historically divide evenly between the two main candidates on election day.

[][In this case, it is more informative to estimate the spread or the difference between the proportion of two candidates d, or 0.482 - 0.461 = 0.021 for this election.]

Assume that there are only two parties and that d = 2p - 1. Construct a 95% confidence interval for difference in proportions on election night.
Instructions
100 XP

[][    Use the mutate function to define a new variable called 'd_hat' in polls as the proportion of Clinton voters minus the proportion of Trump voters.]
    Extract the sample size N from the first poll in your subset object polls.
    Extract the difference in proportions of voters d_hat from the first poll in your subset object polls.
    Use the formula above to calculate p from d_hat. Assign p to the variable X_hat.
    Find the standard error of the spread given N. Save this as se_hat.
    Calculate the 95% confidence interval of this estimate of the difference in proportions, d_hat, using the qnorm function. 
    Save the lower and upper confidence intervals as an object called ci. Save the lower confidence interval first.

```{r}
# Add a statement to this line of code that will add a new column named `d_hat` to `polls`. The new column should contain the difference in the proportion of voters.
polls <- polls_us_election_2016 %>% 
  filter(enddate >= "2016-10-31" & state == "U.S.") %>%
  mutate(d_hat = 2*rawpoll_clinton/100 - 1)       # XXXXXXXXXXXXXX Read before you code please


# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the console.
N <- polls$samplesize[1]

# Assign the difference `d_hat` of the first poll in `polls` to a variable called `d_hat`. Print this value to the console.
d_hat <- polls$upper[1] - polls$lower[1]      # XXXXXXXXXXXXXXX Read the question, what is the d-hat

# Assign proportion of votes for Clinton to the variable `X_hat`.
X_hat <- 2*polls$rawpoll_clinton[1]/100 - 1

# Calculate the standard error of the spread and save it to a variable called `se_hat`. Print this value to the console.
se_hat <- sqrt(X_hat*(1-X_hat)/N)


# Use `qnorm` to calculate the 95% confidence interval for the difference in the proportions of voters. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(qnorm((1-0.95)/2, X_hat, se_hat), qnorm(1-(1-0.95)/2, X_hat, se_hat))     # ????????????? how about this, think
ci
```
Incorrect submission
**The values contained in the object d_hat are not correct. Make sure you are calculating the proportion of Clinton voters minus the proportion of Trump voters**. You must divide the raw poll results by 100 to obtain the proportions. 

```{r}
d_hat <- polls$rawpoll_clinton[1]/100 - polls$rawpoll_trump[1]/100
```
Incorrect submission
The values contained in the object X_hat are not correct. [][Did you use the formula X_hat <- (d_hat+1)/2? Because we want to calculate the spread assuming there are only 2 candidates, we can't use the raw poll values and must calculate them from d_hat.]

Incorrect submission
[][The values contained in the object se_hat are not correct. The standard error of the spread d_hat is 2 times the standard error of X_hat. Remember that the formula for standard error equals the square root of the variance divided by the sample size. ]

```{r}
# Add a statement to this line of code that will add a new column named `d_hat` to `polls`. The new column should contain the difference in the proportion of voters.
polls <- polls_us_election_2016 %>% 
  filter(enddate >= "2016-10-31" & state == "U.S.") %>%
  mutate(d_hat = 2*rawpoll_clinton/100 - 1)


# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the console.
N <- polls$samplesize[1]
N

# Assign the difference `d_hat` of the first poll in `polls` to a variable called `d_hat`. Print this value to the console.
d_hat <- polls$rawpoll_clinton[1]/100 - polls$rawpoll_trump[1]/100
d_hat

# Assign proportion of votes for Clinton to the variable `X_hat`.
X_hat <- (d_hat+1)/2
X_hat

# Calculate the standard error of the spread and save it to a variable called `se_hat`. Print this value to the console.
se_hat <- 2*sqrt(X_hat*(1-X_hat)/N)     #<------  This is actually d_hat's standard error, with 2 before it
se_hat


# Use `qnorm` to calculate the 95% confidence interval for the difference in the proportions of voters. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(X_hat - qnorm(0.975)*se_hat, X_hat + qnorm(0.975)*se_hat)
ci
```
Incorrect submission
The values contained in the object 'ci' are not correct. Are you using the correct formula to calculate the confidence intervals? Make sure to save the lower confidence interval first. 

```{r}
# Add a statement to this line of code that will add a new column named `d_hat` to `polls`. The new column should contain the difference in the proportion of voters.
polls <- polls_us_election_2016 %>% 
  filter(enddate >= "2016-10-31" & state == "U.S.") %>%
  mutate(d_hat = 2*rawpoll_clinton/100 - 1)     # XXXXXXXXXXXXXXXXXXXXXXXXX


# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the console.
N <- polls$samplesize[1]
N

# Assign the difference `d_hat` of the first poll in `polls` to a variable called `d_hat`. Print this value to the console.
d_hat <- polls$rawpoll_clinton[1]/100 - polls$rawpoll_trump[1]/100
d_hat

# Assign proportion of votes for Clinton to the variable `X_hat`.
X_hat <- (d_hat + 1)/2      ### We need X_hat in order to calculate the se_hat of X_hat and d+hat
X_hat

# Calculate the standard error of the spread and save it to a variable called `se_hat`. Print this value to the console.
se_hat <- 2*sqrt(X_hat*(1-X_hat)/N)     ### This se_hat really serves for d_hat, thus need to multiply by 2
se_hat


# Use `qnorm` to calculate the 95% confidence interval for the difference in the proportions of voters. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(d_hat - qnorm(0.975)*se_hat, d_hat + qnorm(0.975)*se_hat)
ci
```
[][* ====================================================================================================== *]
Hint

[][    Remember the formula for standard error of x_hat:]
        SE[X_bar] = sqrt(X_bar(1-X_bar)/N)

[][Remember that the standard error of the spread d_hat will be two times the standard error of X_hat.]
The lower bound of the 95% confidence interval is equal to
    d_bar - qnorm(0.975)*SE[X_bar], where d_bar is the difference.
The upper bound of the 95% confidence interval is equal to
    d_bar + qnorm(0.975)*SE[X_bar], where d_bar is the difference.

```{r}
library(tidyverse)


# Add a statement to this line of code that will add a new column named `d_hat` to `polls`. The new column should contain the difference in the proportion of voters.
polls <- polls_us_election_2016 %>% 
  filter(enddate >= "2016-10-31" & state == "U.S.")  %>%
  mutate(d_hat = rawpoll_clinton/100 - rawpoll_trump/100)   #<++++++++++++++++++++

# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the console.
N <- polls$samplesize[1]

# Assign the difference `d_hat` of the first poll in `polls` to a variable called `d_hat`. Print this value to the console.
d_hat <- polls$d_hat[1]
d_hat

# Assign proportion of votes for Clinton to the variable `X_hat`.
X_hat <- (d_hat + 1)/2

#               vvvvvvvvvvvvvvvvvvvvvvvvvvvv
# Calculate the standard error of the spread and save it to a variable called `se_hat`. Print this value to the console.
se_hat <- 2*sqrt(X_hat*(1-X_hat)/N)
se_hat

# Use `qnorm` to calculate the 95% confidence interval for the difference in the proportions of voters. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(d_hat - qnorm(0.975)*se_hat, d_hat + qnorm(0.975)*se_hat)
ci


##############################################################################################################################
cii <- c(qnorm((1-0.95)/2, d_hat, se_hat), qnorm(1-(1-0.95)/2, d_hat, se_hat))      # <----think about it
cii
```



## Exercise 6. Pollster results for d

Create a new object called pollster_results that contains the pollster's name, the end date of the poll, the difference in the proportion of voters who declared a vote either, and the lower and upper bounds of the confidence interval for the estimate.
Instructions
100 XP

    Use the mutate function to define four new columns: 'X_hat', 'se_hat', 'lower', and 'upper'. Temporarily add these columns to the polls object that has already been loaded for you.
[][    In the X_hat column, calculate the proportion of voters for Clinton using d_hat.]
[][    In the se_hat column, calculate the standard error of the spread for each poll using the sqrt function.]
    In the lower column, calculate the lower bound of the 95% confidence interval using the qnorm function.
    In the upper column, calculate the upper bound of the 95% confidence interval using the qnorm function.
    Use the select function to select the pollster, enddate, d_hat, lower, upper columns from polls to save to the new object pollster_results.

# This is whats asking "lower confidence interval of d_hat, and upper confidence interval of d_hat for each poll"
```{r}
# The subset `polls` data with 'd_hat' already calculated has been loaded. Examine it using the `head` function.
head(polls)

###################################################################################################################
# Create a new object called `pollster_results` that contains columns for pollster name, end date, d_hat, lower confidence interval of d_hat, and upper confidence interval of d_hat for each poll.
pollster_results <- polls %>%
  mutate(X_hat=(rawpoll_clinton/100 - rawpoll_trump/100 + 1)/2,
         se_hat=2*sqrt(X_hat*(1-X_hat)/samplesize),
         lower=rawpoll_clinton/100 - qnorm(0.975*se_hat),     ######< you se_hat is for d_hat, and you using it to cal X_hat ??
         upper=rawpoll_clinton/100 + qnorm(0.975*se_hat)) %>%
  select(pollster, enddate, d_hat, lower, upper)

pollster_results
```
Incorrect submission
Make sure your se_hat is calculated correctly as 2 times the standard error of X_hat, check that you calculated the confidence interval correctly, and make sure you select the five columns in the order specified. 

```{r}
# The subset `polls` data with 'd_hat' already calculated has been loaded. Examine it using the `head` function.
head(polls)

# Create a new object called `pollster_results` that contains columns for pollster name, end date, d_hat, lower confidence interval of d_hat, and upper confidence interval of d_hat for each poll.
pollster_results <-  polls %>% 
  mutate(X_hat = (d_hat + 1)/2, 
         se_hat = 2*sqrt(X_hat*(1-X_hat)/samplesize), 
         lower = d_hat - qnorm(0.975)*se_hat, 
         upper = d_hat + qnorm(0.975)*se_hat) %>% 
  select(pollster, enddate, d_hat, lower, upper)

pollster_results
```



## Exercise 7. Comparing to actual results - d

[][What proportion of confidence intervals for the difference between the proportion of voters included d, the actual difference in election day?]   ### Why we care about this ???
Instructions
100 XP

    Use the mutate function to define a new variable within pollster_results called hit.
    Use logical expressions to determine if each values in lower and upper span the actual difference in proportions of voters.
    Use the mean function to determine the average value in hit and summarize the results using summarize.
    Save the result of your entire line of code as an object called avg_hit.

```{r}
# The `pollster_results` object has already been loaded. Examine it using the `head` function.
head(pollster_results)
#                                                                vvvvvvvvvvvvvvvvvvvv
# Add a logical variable called `hit` that indicates whether the actual value (0.021) exists within the confidence interval of each poll. Summarize the average `hit` result to determine the proportion of polls with confidence intervals include the actual value. Save the result as an object called `avg_hit`.


# ========================================================================================================
# Same as before, but if you didn't understand what we are doing, you are wasting your time
# ========================================================================================================


avg_hit <- pollster_results %>% 
  mutate(hit = lower <= 0.482 & 0.482 <= upper) %>%
  #mean(hit) %>%
  summarize(mean(hit))

avg_hit
```

```{r}
# The `pollster_results` object has already been loaded. Examine it using the `head` function.
head(pollster_results)

# Add a logical variable called `hit` that indicates whether the actual value exists within the confidence interval of each poll. Summarize the average `hit` result to determine the proportion of polls with confidence intervals include the actual value. Save the result as an object called `avg_hit`.
avg_hit <- pollster_results %>% 
  mutate(hit = lower<=0.021 & upper>=0.021) %>% 
  summarize(mean(hit))


avg_hit
```




## Exercise 8. Comparing to actual results by pollster

[][Although the proportion of confidence intervals that include the actual difference between the proportion of voters increases substantially, it is still lower that 0.95. In the next chapter, we learn the reason for this.]

# ==================================================================================================================
To motivate our next exercises, calculate the difference between each poll's estimate d_bar and the actual d=0.021. Stratify this difference, or error, by pollster in a plot.
Instructions
100 XP

    Define a new variable errors that contains the difference between the estimated difference between the proportion of voters and the actual difference on election day, 0.021.
    To create the plot of errors by pollster, add a layer with the function geom_point. The aesthetic mappings require a definition of the x-axis and y-axis variables. So the code looks like the example below, but you fill in the variables for x and y.
    The last line of the example code adjusts the x-axis labels so that they are easier to read.

data %>% ggplot(aes(x = , y = )) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

```{r}
# The `polls` object has already been loaded. Examine it using the `head` function.
head(polls)

# Add variable called `error` to the object `polls` that contains the difference between d_hat and the actual difference on election day. Then make a plot of the error stratified by pollster.
error <- polls %>%
  mutate(error = d_hat - 0.021) %>%
  select(error)

error %>% 
  ggplot(aes(x = error, y = error)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```



## Exercise 9. Comparing to actual results by pollster - multiple polls

Remake the plot you made for the previous exercise, but only for pollsters that took five or more polls.

You can use dplyr tools group_by and n to group data by a variable of interest and then count the number of observations in the groups. The function filter filters data piped into it by your specified condition.

For example:

data %>% group_by(variable_for_grouping) 
    %>% filter(n() >= 5)

Instructions
100 XP

    Define a new variable errors that contains the difference between the estimated difference between the proportion of voters and the actual difference on election day, 0.021.
    Group the data by pollster using the group_by function.
    Filter the data by pollsters with 5 or more polls.
[][    Use ggplot to create the plot of errors by pollster.]
    Add a layer with the function geom_point.

```{r}
# The `polls` object has already been loaded. Examine it using the `head` function.
head(polls)

# Add variable called `error` to the object `polls` that contains the difference between d_hat and the actual difference on election day. Then make a plot of the error stratified by pollster, but only for pollsters who took 5 or more polls.
error <- polls %>%
  mutate(error = d_hat - 0.021) %>%
  group_by(pollster) %>%
  filter(n()>=5) %>%
  select(error)

error %>% 
  ggplot(aes(x = error, y = error)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

```

```{r}
library(tidyverse)

# The `polls` object has already been loaded. Examine it using the `head` function.
head(polls)

# Add variable called `error` to the object `polls` that contains the difference between d_hat and the actual difference on election day. Then make a plot of the error stratified by pollster, but only for pollsters who took 5 or more polls.

polls %>% 
  mutate(error = d_hat - 0.021) %>%
  group_by(pollster) %>%
  filter(n() >= 5) %>%
  ggplot(aes(pollster, error)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
```

Hint

    Use the pipe %>% to pass the data in polls to the mutate function that adds a column to calculate the error equal to d_hat - 0.021.
    Use the group_by function to group the data by a variable of interest, like 'group_by(pollster)'.
    Use the filter function to filter the data by a variable of interest.
    Nest the dplyr function n to count the number of observations within the current group within filter, as in the example code.
    Use the function ggplot for plotting. Define the aesthetics according to which variable you want on the x- and y-axis.
    Use the sample code to fix your graph axis labels.

data %>% ggplot(aes(x = , y = )) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))



## End of Assessment

This is the end of the programming assignment for this section. Please DO NOT click through to additional assessments from this page. If you do click through, your scores may NOT be recorded.

Click "Got it!" and submit to get the "points" for this question.

You can close this window and return to Data Science: Inference.
Answer the question
50XP
Possible Answers

    Got it!












## Section 4 Overview


In Section 4, you will look at statistical models in the context of election polling and forecasting.

After completing Section 4, you will be able to:

[][        Understand how aggregating data from different sources, as poll aggregators do for poll data, can improve the precision of a prediction.]
[][        Understand how to fit a multilevel model to the data to forecast, for example, election results.]
[][        Explain why a simple aggregation of data is insufficient to combine results because of factors such as pollster bias.]
        Use a data-driven model to account for additional types of sampling variability such as pollster-to-pollster variability.

There is 1 assignment that uses the DataCamp platform for you to practice your coding skills.

We encourage you to use R to interactively test out your answers and further your learning.












## Course  /  Section 4: Statistical Models  /  Statistical Models


# Poll Aggregators


In the 2012 presidential election, Barack Obama won the electoral college and he won the popular vote by a margin of 3.9%.  Let's go back to the week before the election before we knew this outcome.  Nate Silver was giving Obama a 90% chance of winning.  Yet, none of the individual polls were nearly that sure.  In fact, political commentator Joe Scarborough said during his show, "Anybody that thinks that this race is anything but a tossup right now is such an ideologue-- they're jokes."  To which Nate Silver responded, "If you think it's a tossup, let's bet.  If Obama wins, you donate $1,000 to the American Red Cross.  If Romney wins, I do.  Deal?"  [][How is Mr. Silver so confident?  ]

We'll illustrate what Nate Silver saw that Joe Scarborough and other pundits did not.  We're going to use a Monte Carlo simulation.  We're going to generate results for 12 polls taken the week before the election.  We're going to mimic the sample sizes from actual polls.  *We're going to construct and report 95% confidence intervals for each of these 12 polls*.  
![](C:/Users/qp/Pictures/we are going to mimic the sample size form actual polls, and construct and report 95% confidence interval for each of these 12 polls.png)
Here's the code we're going to use.  We're going to generate the data using the actual outcome, 3.9%.  So d, the difference, the spread, is 0.039.  The sample sizes were selected to mimic regular polls.  So we see that the first one is 1,298, the second one is 533, et cetera.  We're also going to define p, the proportion of Democrats--or actually, the proportion of people voting for Obama, as the spread plus 1 divided by 2.  That's the formula we've seen before.  Then we're going to use the sapply function to construct the confidence intervals.  For each sample size for each poll, we're going to generate a sample.  We're going to take a sample of size N. Then we're going to compute the proportion of people voting for Obama in that sample--that's X_hat--construct a standard error, and then return the estimate X_hat as well as the beginning and end of the confidence interval.  We're going to do this and then we're going to generate a data frame that has all the results.  
![](C:/Users/qp/Pictures/We're going to do this and then we're going to generate a data.frame.png)
![](C:/Users/qp/Pictures/here is the visualization of the intervals of these polls.png)
Here are the results of the 12 polls that we generated with the Monte Carlo simulation.  Here's a visualization of what the intervals of these pollstersn would have reported for the difference between Obama and Romney.  Not surprisingly, all 12 polls report confidence intervals that include the election night result, which is shown with the dashed line.  This is the case because these are 95% confidence intervals.  However, all 12 poll intervals include 0, which is shown with a solid black line.  Therefore, individually if we asked for a prediction from the pollsters, from each individual pollster, they would have to agree with Scarborough.  It's a tossup.  Now we're going to describe how pundits are missing a key insight.  [][Poll aggregators, such as Nate Silver, realize that by combining the results of different polls, you could greatly improve precision].  By doing this, effectively we're conducting a poll with a huge sample size.  
![](C:/Users/qp/Pictures/by combining the results of different polls, you could greatly improve precision.png)
As a result, we can report a smaller 95% confidence interval, and therefore a more precise prediction.  Although as aggregators we do not have access to the raw poll data, we can use mathematics to reconstruct what we would have obtained had we made one large poll with, in this case, 11,269 people, participants.  Basically we construct an estimate of the spread-- let's call it d-- with a weighted average in the following way.  
![](C:/Users/qp/Pictures/We basically multiply each individual spread by the sample size.png)
[][We basically multiply each individual spread by the sample size.  That's going to give us a total spread.  And then we're going to divide by the total number of participants in our aggregated poll.  This gives us d_hat, which is an estimate of d].  Once we have an estimate of d, we can construct an estimate for the proportion voting for Obama, which we can then use to estimate the standard error.  Once we do this, we see that our margin of error of the aggregated poll is 0.018.  Thus, using the weighted average, we can predict that the spread will be 3.1% plus or minus 1.8%, which not only includes the actual result but is quite far from including 0.  
![](C:/Users/qp/Pictures/Once we have an estimate of d, we can construct an estimate for the proportion.png)
![](C:/Users/qp/Pictures/In this figure, you can see, in red, the interval that was created using the combined polls.png)
Once we combine the 12 polls, we become quite certain that Obama will win the popular vote.  In this figure, you can see, in red, the interval that was created using the combined polls.  Nate Silver and other aggregators use the same approach to predict the electoral college.  And they did very well in 2008 and 2012.  However, note that this was just a simulation to illustrate the idea.  The actual data science exercise of forecasting elections is much more complicated and it involves statistical modeling.  
![](C:/Users/qp/Pictures/statistical modeling.png)



[][Textbook link]

This video corresponds to the textbook chapter introduction for statistical models External link and the textbook section on poll aggregators External link.
https://rafalab.github.io/dsbook/models.html
https://rafalab.github.io/dsbook/models.html#poll-aggregators


[][Key points]

[][    Poll aggregators combine the results of many polls to simulate polls with a large sample size and therefore generate more precise estimates than individual polls. ]
    Polls can be simulated with a Monte Carlo simulation and used to construct an estimate of the spread and confidence intervals.
    The actual data science exercise of forecasting elections involves more complex statistical modeling, but these underlying ideas still apply.


Code: Simulating polls

Note that to compute the exact 95% confidence interval, we would use qnorm(.975)*SE_hat instead of 2*SE_hat.

d <- 0.039
Ns <- c(1298, 533, 1342, 897, 774, 254, 812, 324, 1291, 1056, 2172, 516)
p <- (d+1)/2

# calculate confidence intervals of the spread
confidence_intervals <- sapply(Ns, function(N){
    X <- sample(c(0,1), size=N, replace=TRUE, prob = c(1-p, p))
    X_hat <- mean(X)
    SE_hat <- sqrt(X_hat*(1-X_hat)/N)
    2*c(X_hat, X_hat - 2*SE_hat, X_hat + 2*SE_hat) - 1
})

# generate a data frame storing results
polls <- data.frame(poll = 1:ncol(confidence_intervals),
                    t(confidence_intervals), sample_size = Ns)
names(polls) <- c("poll", "estimate", "low", "high", "sample_size")
polls

Code: Calculating the spread of combined polls

Note that to compute the exact 95% confidence interval, we would use qnorm(.975) instead of 1.96.

d_hat <- polls %>%
    summarize(avg = sum(estimate*sample_size) / sum(sample_size)) %>%
    .$avg

p_hat <- (1+d_hat)/2
moe <- 2*1.96*sqrt(p_hat*(1-p_hat)/sum(polls$sample_size))   
round(d_hat*100,1)
round(moe*100, 1)



```{r}
d <- 0.039 
Ns <- c(1298, 533, 1342, 897, 774, 254, 812, 324, 1291, 1056, 2172, 516)

p <- (d+1)/2


confidence_intervals <- sapply(Ns, function(N) {
  X <- sample(c(0, 1), size=N, replace=T, prob=c(1-p, p))
  X_hat <- mean(X)
  SE_hat <- sqrt(X_hat*(1-X_hat)/N)
  2*c(X_hat, X_hat - 2* SE_hat, X_hat + 2* SE_hat) - 1
})

print(confidence_intervals)


# ===========================================================================================================================
polls <- data.frame(poll=1:ncol(confidence_intervals), 
                    t(confidence_intervals), 
                    sample_size=Ns)

names(polls) <- c("poll", "estimate", "low", "high", "sample_size")


polls
```

```{r}
library(tidyverse)

d_hat <- polls %>%
  summarize(avg = sum(estimate*sample_size)/sum(sample_size)) %>%     #==============================================
  .$avg

d_hat
```














# Pollsters and Multilevel Models


*Now we're going to get ready to explain how pollsters fit multilevel models to public poll data and use this to forecast election results*.  In the 2008 and 2012 US presidential elections, Nate Silver used this approach to make an almost perfect prediction and silenced the pundits.  Since the 2008 election, other organizations have started their own election forecasting groups that, like Nate Silver, aggregate polling data and use statistical models to make predictions.  
![](C:/Users/qp/Pictures/In 2016, forecasters greatly underestimated Trump's chances of winning the electionpng.png)
In 2016, forecasters greatly underestimated Trump's chances of winning the election.  For example, the Princeton Election Consortium gave Trump less than 1% chance of winning the election, while the Huffington Post gave him a 2% chance (*what is this chance, how does they calculated this value?*).  In contrast, FiveThirtyEight had Trump's chances of winning at 29%.  Although they didn't correctly predict him to have a higher probability, note that 29% is a higher probability than the probability of tossing two coins and getting two heads.  It's also much, much bigger than what the other pollsters had predicted.  
![](C:/Users/qp/Pictures/we will start by looking at the predictions for the popular vote.png)
[][By understanding statistical models and how these forecasters use them, we will start to understand how this happened].  Although not nearly as interesting as predicting the electoral college, the actual outcome of the election, for illustrative purposes we will start by looking at the predictions for the popular vote.  *FiveThirtyEight predicted a 3.6% advantage for Clinton*.  Their interval, their prediction interval, included the actual result of 2.1%, 48.2% for Clinton compared to 46.1% for Trump.  They were much more confident about Clinton winning this, the popular vote, giving her a 81.4% chance of winning.  
![](C:/Users/qp/Pictures/their prediction interval, included the actual result of 2.1%, 48.2% for Clinton compared to 46.1%.png)
Next, we're going to look at actual public polling data from the 2016 US presidential election to show how models are motivated and built to produce these predictions.  End of transcript. Skip to the start.  
![](C:/Users/qp/Pictures/think how this 81% chance of cliton winning the populator vote.png)



[][Textbook link]

This video corresponds to material within the textbook section on poll aggregators External link.
https://rafalab.github.io/dsbook/models.html#poll-aggregators


[][Key points]

[][    Different poll aggregators generate different models of election results from the same poll data. This is because they use different statistical models. ]
    We will use actual polling data about the popular vote from the 2016 US presidential election to learn the principles of statistical modeling.














# Poll Data and Pollster Bias


In this video we use public polling data organized by FiveThirtyEight for the 2016 presidential election.  The data is included as part of the dslabs package.  You can get the data by typing this code.  Here, we show you the column names of the table. 
![](C:/Users/qp/Pictures/the data was organized by FiveThirtyEight for the 2016 presidential election..png)
*The table includes results for national polls, as well as state polls, taken in the year before the election*.  For this first illustrative example, we will filter the data to include national polls that happened during the week before the election.  We also remove polls that FiveThirtyEight has determined not to be reliable, and they have graded them with a B or less.  Some polls have not been graded.  And we're going to leave these in.  Here's the code we used to filter as we just described.  
![](C:/Users/qp/Pictures/here is the code we are going to use to filter these poll data.png)
![](C:/Users/qp/Pictures/we also add the spread estimate, do you remember that.png)
We also add a spread estimate.  [][Remember, the spread is what we're really interested in estimating].  So, we type this code to get the spread in proportions.  *For illustrative purpose, we will assume that there are only two parties, and call p the proportion voting for Clinton, and 1 minus p the proportion voting for Trump.  We're interested in the spread, which we've shown is 2p minus 1*.  Let's call this spread d.  d is for difference.  
![](C:/Users/qp/Pictures/For illustrative purpose, we will assume that there are only two parties,.png)
![](C:/Users/qp/Pictures/why we are interesting in the spread, and why we are using d as 2p - 1.png)
![](C:/Users/qp/Pictures/now we take the spread as the rnadom variable, the expected value is the election night d.png)
*Note that we have several estimates of this spread coming from the different polls*.  The theory we learned tells us that these estimates are a random variable with probability distribution that is approximately normal.  The expected value is the election night spread, d.  And the standard error is 2 times the square root of p times 1 minus p divided by the sample size N.  Assuming the urn model we described earlier are useful models, we can use this information to construct a confidence interval based on the aggregated data.  
![](C:/Users/qp/Pictures/And the standard error is 2 times the square root of p times.png)
![](C:/Users/qp/Pictures/The estimated spread is now computed like this because now the sample size.png)
![](C:/Users/qp/Pictures/and we use this, typing this code, that then leads us to a margin of error of .0066,.png)
*The estimated spread is now computed like this because now the sample size is the sum of all the sample sizes*.  And if we use this, we get a standard error, typing this code, that then leads us to a margin of error of .0066, a very small margin of error.  So, if we were going to use this data, we would report a spread of 1.43% with a margin of error of 0.66%.  On election night, we find out that the actual percentage is 2.1%, which is outside of the 95% confidence interval.  [][So, what happened]?  Was this just bad luck?  [][A histogram of the reported spreads shows another problem].  With this code, we can quickly make a histogram of the spreads that we're looking at.  The data does not appear to be normally distributed, and the standard error appears to be larger than 0.0066.  The theory is not quite working here.  
![](C:/Users/qp/Pictures/A histogram of the reported spreads shows another problem..png)
![](C:/Users/qp/Pictures/the data of spread seems not to be normally distributed.png)
![](C:/Users/qp/Pictures/To see why, notice that various pollsters are involved and some are taking several polls a week.png)
To see why, notice that various pollsters are involved and some are taking several polls a week.  Here's a table showing you how many polls each pollster took that last week.  Let's visualize the data for the pollsters that are regularly polling.  We write this piece of code that first filters for only pollsters that polled more than 6 times.  And then we simply plot the spreads estimated by each pollster.  
![](C:/Users/qp/Pictures/Let's visualize the data for the pollsters that are regularly polling..png)
![](C:/Users/qp/Pictures/then This plot reveals an unexpected result. First.png)
Each one has between five and six.  This plot reveals an unexpected result.  First note that the standard error, predicted by theory for each poll--now, we're going to do this poll by poll--gives us values between 0.018 and 0.033.  This appears to be right.  This appears to be consistent with what we see in the plot.  
![](C:/Users/qp/Pictures/note that the standard error, predicted by theory for each poll, showing .png)
[][However, there appears to be differences across the polls].  This is not explained by the theory.  Note for example, how the USC Dornsife/LA Times pollster is predicting a 4% win for Trump while Ipsos is predicting a win larger than 5% for Clinton.  *The theory of learned says nothing about different pollsters producing polls with different expected values*.  All the polls should have the same expected value, the actual spread, the spread we will see on election night.  FiveThirtyEight refers to these differences as "house effects."  We can also call them pollster bias.  Rather than use urn model theory, we're instead going to develop a data-driven model to produce a better estimate and a better confidence interval.  



[][Textbook links]

This section corresponds to the textbook section on poll data External link and the textbook section on pollster bias. External link
https://rafalab.github.io/dsbook/models.html#poll-data
https://rafalab.github.io/dsbook/models.html#pollster-bias


[][Key points]

     We analyze real 2016 US polling data organized by FiveThirtyEight. We start by using reliable national polls taken within the week before the election to generate an urn model.
    
[][    Consider p the proportion voting for Clinton and 1-p the proportion voting for Trump. We are interested in the spread d = 2p - 1. ]
    
    Poll results are a random normal variable with expected value of the spread d and standard error 2 sqrt(p(1-p)/N).
    
[][    Our initial estimate of the spread did not include the actual spread. Part of the reason is that different pollsters have different numbers of polls in our dataset, and each pollster has a bias. ]
    
[][    Pollster bias reflects the fact that repeated polls by a given pollster have an expected value different from the actual spread and different from other pollsters. Each pollster has a different bias. ]
    
    The urn model does not account for pollster bias. We will develop a more flexible data-driven model that can account for effects like bias.


Code: Generating simulated poll data

library(dslabs)
data(polls_us_election_2016)
names(polls_us_election_2016)

# keep only national polls from week before election with a grade considered reliable
polls <- polls_us_election_2016 %>%
    filter(state == "U.S." & enddate >= "2016-10-31" &
               (grade %in% c("A+", "A", "A-", "B+") | is.na(grade)))

# add spread estimate
polls <- polls %>%
    mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)

# compute estimated spread for combined polls
d_hat <- polls %>%
    summarize(d_hat = sum(spread * samplesize) / sum(samplesize)) %>%    #################################################
    .$d_hat

# compute margin of error
p_hat <- (d_hat+1)/2     ############################################################
moe <- 1.96 * 2 * sqrt(p_hat*(1-p_hat)/sum(polls$samplesize))

# histogram of the spread
polls %>%
    ggplot(aes(spread)) +
    geom_histogram(color="black", binwidth = .01)

Code: Investigating poll data and pollster bias

# number of polls per pollster in week before election
polls %>% 
  group_by(pollster) %>% 
  summarize(n())

# plot results by pollsters with at least 6 polls
polls %>% group_by(pollster) %>%
    filter(n() >= 6) %>%
    ggplot(aes(pollster, spread)) +
    geom_point() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

# standard errors within each pollster
polls %>% group_by(pollster) %>%
    filter(n() >= 6) %>%
    summarize(se = 2 * sqrt(p_hat * (1-p_hat) / median(samplesize)))    ####################################


```{r}
library(tidyverse)
library(dslabs)
data(polls_us_election_2016)
names(polls_us_election_2016)

# keep only national polls from week before election with a grade considered reliable
polls <- polls_us_election_2016 %>%
    filter(state == "U.S." & enddate >= "2016-10-31" &
               (grade %in% c("A+", "A", "A-", "B+") | is.na(grade)))

# add spread estimate
polls <- polls %>%
    mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)


head(polls, 3)
```

```{r}
d_hat <- polls %>%
  summarize(d_hat = sum(spread*samplesize)/sum(samplesize)) %>%
  .$d_hat

d_hat
```

```{r}
p_hat <- (d_hat + 1)/2


moe <- 1.96*2*sqrt(p_hat*(1-p_hat)/sum(polls$samplesize))

moe
```

```{r}
polls %>%
  ggplot(aes(spread)) +
  geom_histogram(color='black', bins=20)
```

```{r}
polls %>%
  group_by(pollster) %>%
  #summarize(n() >= 6) %>%    Don't you even use your brain usually???
  filter(n()>=6) %>%
  ggplot(aes(pollster, spread)) +
  geom_point() +
  theme(axis.text.x=element_text(angle=90, hjust=1))
```

```{r}
polls %>%
  group_by(pollster) %>%
  filter(n() >= 6) %>%
  summarize(se=2*sqrt(p_hat*(1-p_hat)/median(samplesize)))
```










# Data-Driven Models


For each pollster, let's collect their last-reported result before the election using this simple piece of code.  Here's a histogram of the data for these 15 pollsters.  
![](C:/Users/qp/Pictures/for each poster, collect their last-reported result before the election.png)
![](C:/Users/qp/Pictures/and here is the histogram of the data.png)
![](C:/Users/qp/Pictures/Here's a histogram of the data for these 15 pollsters.png)
In the previous video, [][we saw that using the urn model theory to combine these results might not be appropriate due to the pollster effect].  Instead, *we will model this spread data directly*.  The new model can also be thought of as an urn model, although the connection to the urn idea is not as direct.  Rather than having beads with zeros and ones inside the urn, [][now the urn contains poll results from all possible pollsters].  
![](C:/Users/qp/Pictures/and here is our new urn model, the inside beads are now poll results from all possible pollsters.png)
![](C:/Users/qp/Pictures/We assume that the expected value of our urn is the actual spread, d.png)
We *assume that the expected value of our urn is the actual spread*, which we have been calling d, which is equal to 2p minus 1.  Now, [][because rather than zeros and ones our urn contains continuous numbers between minus 1 and 1, the standard deviation of the urn is no longer the square root of p times 1 minus p].  Rather than just the sampling variability we get from taking different samples of zeros and ones, the standard error for our average now includes the pollster-to-pollster variability. 
![](C:/Users/qp/Pictures/the standard error for our average now includes the pollster to pollster variablity.png)
Our new urn also includes the sample variability from the polling.  **Regardless, this standard deviation is now an unknown parameter**.  In statistics textbooks, the Greek symbol sigma is used to represent this parameter.  Now in summary, we have two unknown parameters now, the expected value d, what we want to estimate, and the standard deviation, sigma.  [][Our task is to estimate d and provide inference for it.  ]
![](C:/Users/qp/Pictures/this standard deviation is now an unknow parameter we call it Sigma.png)
![](C:/Users/qp/Pictures/now we have two unknow parameters now, the expected value d and the standard deviation Sigma.png)
[][*Because we model the observed values, let's call them X1 through XN, as a random sample from the urn, the central limit theorem still works for the average of these values because it's an average of independent random variables*](think here).  For a large enough sample size N, the probability distribution of the sample average, which we'll call X-bar, is approximately normal with expected value d and standard deviation sigma divided by the square root of N. (*THINK, Think, think*)
# Trying to find out why, why using N-1 is better than just N, and what the connection between this and the one we see before

If we are willing to consider N equals to 15 large enough, we can use this to construct a confidence interval.  A problem, though, is that we don't know sigma.  ***But the theory tells us that we can estimate the urn model sigma, the unobserved sigma, with the sample standard deviation, which is defined like this with this mathematical formula.  Now note in the mathematical formula that unlike the population standard deviation, we now divide by N minus 1.  This makes s a better estimate of sigma than if we just divided by N.  And there's a mathematical explanation for this, which is explained in most statistics textbooks, but we don't cover it here.  
![the N-1 makes ths S the better estimate of Sigma](C:/Users/qp/Pictures/the sample deviation, which is defined like this with this mathematical formula..png)
![](C:/Users/qp/Pictures/the sd function in R computes the sample standard deviation..png)
[][*Now the sd function in R computes the sample standard deviation*].  So we can compute it for our data here with this simple line.  And we get that it's 0.024.  We are now ready to form a confidence interval based on our new data-driven model.  We simply use the central limit theorem and create a confidence interval using this simple code.  We get an average, a standard error, and then a start of 1.7% and an end of 4.1%.  That's our 95% confidence interval using now our data-driven model.  
![](C:/Users/qp/Pictures/We simply use the central limit theorem and create a confidence interval.png)
Note that our new confidence interval is wider, and it now incorporates the pollster variability.  It does include the election-night result of 2.1%, and also it's small enough not to include 0.  Which means that in this particular case, we would have been quite confident that Clinton would win the popular vote.  **Now, are we now ready to declare a probability of Clinton winning as the pollsters do?  Not yet.  In our model, d is a fixed parameter, so we can't talk about probabilities**(why say that???).  To provide probabilities, we'll need to learn something new.  We're going to have to learn about Bayesian statistics.  And we do that next.  End of transcript. Skip to the start.  



[][Textbook links]

This video corresponds to the textbook section on data-driven models External link.
https://rafalab.github.io/dsbook/models.html#data-driven-model


[][Key points]

    Instead of using an urn model where each poll is a random draw from the same distribution of voters, we instead define a model using an urn that contains poll results from all possible pollsters.
    
    We assume the expected value of this model is the actual spread d = 2p-1.
    
    Our new standard error Sigma now factors in pollster-to-pollster variability. It can no longer be calculated from p or d and is an unknown parameter.
    
    The central limit theorem still works to estimate the sample average of many polls X_1, X_2, .. X_N because the average of the sum of many random variables is a normally distributed random variable with expected value d and standard error Sigma*sqrt(N).
    
    We can estimate the unobserved Sigma as the sample standard deviation, which is calculated with the sd function.


[][Code ]

Note that to compute the exact 95% confidence interval, we would use qnorm(.975) instead of 1.96.

# collect last result before the election for each pollster
one_poll_per_pollster <- polls %>% group_by(pollster) %>%
    filter(enddate == max(enddate)) %>%      # keep latest poll
    ungroup()

# histogram of spread estimates
one_poll_per_pollster %>%
    ggplot(aes(spread)) + geom_histogram(binwidth = 0.01)

# construct 95% confidence interval
results <- one_poll_per_pollster %>%
    summarize(avg = mean(spread), se = sd(spread)/sqrt(length(spread))) %>%
    mutate(start = avg - 1.96*se, end = avg + 1.96*se)
round(results*100, 1)










# Assessment 4.1: Statistical Models


DataCamp due Jul 26, 2022 18:35 AWST

In this assessment, you will learn about different types of probability models.

By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy. Note that you might need to disable your pop-up blocker, or allow "www.datacamp.com" in your pop-up blocker allowed list. When you have completed the exercises, return to edX to continue your learning.

Assessment 4.1: Statistical Models (External resource) (13.5 points possible)
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy



Ask your questions about statistical models or the related DataCamp assessment here. Remember to search the discussion board before posting to see if someone else has asked the same thing before asking a new question! You're also encouraged to answer each other's questions to help further your own learning.

Some reminders:

    Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
    Posting snippets of code is okay, but posting full code solutions is not.
    If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.

Discussion: Assessment 4.1
Topic: Section 4 / Assessment 4.1: Statistical Models
Filter:
Sort:

    discussion
    Typo in 4th video
    At 2:50 there should be a square root around the right hand side, right?
    2 comments (2 unread comments)
    discussion
    First Video of Section 4
    Hi, I don't understand why at 1:24 of the video, in the second last line of code ie. "2*c(X_hat - 2*SE_hat, X_hat + 2*SE_hat) -1" Why do we multiply by 2 and minus 1 outside the brackets?
    3 comments (3 unread comments)
    discussion
    Help om grades and linkage between Edx and DataCamp, please
    Help with grades and linkage between Edx and DataCamp, please
    1 comments
    discussion
    In computing MOE why do we multiply the SE by 1.96 AND 2?
    The formula given was : moe <- 1.96 * 2 * sqrt(p_hat * (1 - p_hat) / sum(polls$samplesize)) The formula for moe is SE x 2, but why do we multiply this by 1.96 (or qnorm(0.975))?
    3 comments (3 unread comments)
    discussion
    Excerscise 13
    I gave up and decide to click on show anwser, only for the anwser to be my own code, with some spaces added, and giving the same result...
    1 comments
    discussion
    Help Needed!
    Can anyone explain why 1. do we use se <- sd(X)/sqrt(N). is it becasuse sd already has N-1 component in it ? 2. why do we use mu = 0 and sd = 1 while calculating p value ?
    2 comments (2 unread comments)
    discussion
    Good section
    Simulating polls is tricky especially when bias factors are introduced. Good examples!
    1 comments

Confirm Dialog Result: Yes


Help Needed!

discussion posted 4 months ago by sonicksuri

Can anyone explain why 1. do we use se <- sd(X)/sqrt(N). is it becasuse sd already has N-1 component in it ? 2. why do we use mu = 0 and sd = 1 while calculating p value ?



## Exercise 1 - Heights Revisited

We have been using urn models to motivate the use of probability models. However, most data science applications are not related to data obtained from urns. More common are data that come from individuals. Probability plays a role because the data come from a random sample. The random sample is taken from a population and the urn serves as an analogy for the population.

Let's revisit the heights dataset. For now, consider x to be the heights of all males in the data set. Mathematically speaking, x is our population. Using the urn analogy, we have an urn with the values of x in it.

What are the population average and standard deviation of our population?
Instructions
100 XP

    Execute the lines of code that create a vector x that contains heights for all males in the population.
    Calculate the average of x.
    Calculate the standard deviation of x.

```{r}
# Load the 'dslabs' package and data contained in 'heights'
library(dslabs)
data(heights)

# Make a vector of heights from all males in the population
x <- heights %>% filter(sex == "Male") %>%
  .$height

# Calculate the population average. Print this value to the console.
mean(x)

# Calculate the population standard deviation. Print this value to the console.
sd(x)
```



## Exercise 2 - Sample the population of heights

Call the population average computed above mu and the standard deviation Sigma. Now take a sample of size 50, with replacement, and construct an estimate for mu and Sigma.
Instructions
100 XP

    Use the sample function to sample N values from x.
    Calculate the mean of the sampled heights.
    Calculate the standard deviation of the sampled heights.

```{r}
# The vector of all male heights in our population `x` has already been loaded for you. You can examine the first six elements using `head`.
head(x)

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind = "Rounding")

# Define `N` as the number of people measured
N <- 50

# Define `X` as a random sample from our population `x`
X <- sample(x, N, replace=T)

# Calculate the sample average. Print this value to the console.
mean(X)

# Calculate the sample standard deviation. Print this value to the console.
sd(X)
```



## Exercise 3 - Sample and Population Averages

[][What does the central limit theory tell us about the sample average and how it is related to mu, the population average?]
Instructions
50 XP
Possible Answers

    It is identical to mu.
    It is a random variable with expected value mu and standard error Sigma*sqrt(N).     0   #############################
    It is a random variable with expected value mu and standard error Sigma.
    It underestimates mu.



## Exercise 4 - Confidence Interval Calculation

We will use X_bar as our estimate of the heights in the population from our sample size N. We know from previous exercises that the standard estimate of our error X_bar - mu is Sigma*sqrt(N).    ###########################################

Construct a 95% confidence interval for mu.
Instructions
100 XP

    Use the sd and sqrt functions to define the standard error se
    Calculate the 95% confidence intervals using the qnorm function. Save the lower then the upper confidence interval to a variable called ci.

```{r}
# The vector of all male heights in our population `x` has already been loaded for you. You can examine the first six elements using `head`.
head(x)

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)

# Define `N` as the number of people measured
N <- 50

# Define `X` as a random sample from our population `x`
X <- sample(x, N, replace = TRUE)


##################################################################################################################
# Define `se` as the standard error of the estimate. Print this value to the console.
se <- sd(X)/sqrt(N)
se


# Construct a 95% confidence interval for the population average based on our sample. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(mean(X)-2*se, mean(X)+2*se)
ci
```
Incorrect submission
Make sure you are using the qnorm function. Use a value slightly less than 1 to calculate the 95% confidence interval. 

```{r}
# Construct a 95% confidence interval for the population average based on our sample. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(mean(X) - qnorm(0.975)*se, mean(X) + qnorm(0.975)*se)
ci
```
Incorrect submission
You are not supplying the correct vector of sampled heights to the mean function. 

Incorrect submission
You are not supplying the correct vector of sampled heights to the sd function. 



## Exercise 5 - Monte Carlo Simulation for Heights

Now run a Monte Carlo simulation in which you compute 10,000 confidence intervals as you have just done. What proportion of these intervals include mu?
Instructions
100 XP

# =======================================================================================================================
    Use the replicate function to replicate the sample code for B <- 10000 simulations. Save the results of the replicated code to a variable called res. The replicated code should complete the following steps: -1. Use the sample function to sample N values from x. Save the sampled heights as a vector called X. -2. Create an object called interval that contains the 95% confidence interval for each of the samples. Use the same formula you used in the previous exercise to calculate this interval. -3. Use the between function to determine if mu is contained within the confidence interval of that simulation.
    Finally, use the mean function to determine the proportion of results in res that contain mu.
# ==========================================================================================================================

```{r}
# Define `mu` as the population average
mu <- mean(x)

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind = "Rounding")

# Define `N` as the number of people measured
N <- 50

# Define `B` as the number of times to run the model
B <- 10000

# Define an object `res` that contains a logical vector for simulated intervals that contain mu
res <- replicate(B, {
    X <- sample(x, N, replace=T)      ### Because its sample se, which requires dividing by sqrt(N)  #########################
    interval <- c(mean(X) - qnorm(0.975)*sd(X)/sqrt(N), mean(X) + qnorm(0.975)*sd(X)/sqrt(N)) 
    between(X, interval[1], interval[2])
})


# Calculate the proportion of results in `res` that include mu. Print this value to the console.
mean(res)
```
Hint

    The standard error equals Sigma*sqrt(N).
[][    Add and subtract qnorm(0.975) multiplied by the standard error to calculate the confidence interval. ]
    To use the replicate function, provide the number of replications and the code you want to be replicated. You can use {} to include multiple expressions that you'd like to be repeated.

results <- replicate(number_of_times_to_replicate, {
  first_command_to_run
  second_command_to_run
  third_command_to_run
})

Incorrect submission
You are not providing a calculation that gives the correct proportion of confidence intervals that contain mu. Make sure you use set.seed(1) and follow each line in the instructions. 

```{r}
# Define `mu` as the population average
mu <- mean(x)

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind = "Rounding")

# Define `N` as the number of people measured
N <- 50

# Define `B` as the number of times to run the model
B <- 10000

# Define an object `res` that contains a logical vector for simulated intervals that contain mu
res <- replicate(B, {
    X <- sample(x, N, replace=T)
    interval <- c(mean(X) - qnorm(0.975)*sd(X)/sqrt(N), mean(X) + qnorm(0.975)*sd(X)/sqrt(N))
    between(mu, interval[1], interval[2])
})



# Calculate the proportion of results in `res` that include mu. Print this value to the console.
mean(res)
```

# ==========================================================================================================================
```{r}
# Define `mu` as the population average
mu <- mean(x)

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1, sample.kind = "Rounding")

# Define `N` as the number of people measured
N <- 50

# Define `B` as the number of times to run the model
B <- 10000

# Define an object `res` that contains a logical vector for simulated intervals that contain mu
res <- replicate(B, {
  X <- sample(x, N, replace=TRUE)
  interval <- mean(X) + c(-1,1)*qnorm(0.975)*sd(X)/sqrt(N)    #======================================================
  between(mu, interval[1], interval[2])
})

# Calculate the proportion of results in `res` that include mu. Print this value to the console.
mean(res)
```



## Exercise 6 - Visualizing Polling Bias

# ========================================================================================================================
In this section, we used visualization to motivate the presence of pollster bias in election polls. Here we will examine that bias more rigorously. Lets consider two pollsters that conducted daily polls and look at national polls for the month before the election.

Is there a poll bias? Make a plot of the spreads for each poll.
Instructions
100 XP

    Use ggplot to plot the spread for each of the two pollsters.
    Define the x- and y-axes usingusing aes() within the ggplot function.
    Use geom_boxplot to make a boxplot of the data.
    Use geom_point to add data points to the plot.

```{r}
# Load the libraries and data you need for the following exercises
library(dslabs)
library(dplyr)
library(ggplot2)
data("polls_us_election_2016")


# These lines of code filter for the polls we want and calculate the spreads
polls <- polls_us_election_2016 %>% 
  filter(pollster %in% c("Rasmussen Reports/Pulse Opinion Research","The Times-Picayune/Lucid") &
           enddate >= "2016-10-15" &
           state == "U.S.") %>% 
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) 



# Make a boxplot with points of the spread for each pollster
polls %>% 
  ggplot(aes(pollster, spread)) +
  geom_boxplot() +
  geom_point()
```



## Exercise 7 - Defining Pollster Bias

# =========================================================================================================================
The data do seem to suggest there is a difference between the pollsters. However, these data are subject to variability. Perhaps the differences we observe are due to chance. [][Under the urn model, both pollsters should have the same expected value: the election day difference, d.]

# =========================================================================================================================
We will model the observed data Y_ij in the following way:
      Y_ij = d + b_i + e_ij

with i = 1, 2 indexing the two pollsters, b_i the bias for pollster i, and e_ij poll to poll chance variability. We assume the e are independent from each other, have expected value 0 and standard deviation Sigma_i regardless of j.

Which of the following statements best reflects what we need to know to determine if our data fit the urn model?
Instructions
50 XP
Possible Answers

#    Is e_ij = 0?
    How close are Y_ij to d?
    Is b_1 <> b_2?               0
    Are b_1 = 0 and b_2 = 0?

Incorrect submission
Try again. Our assumption about the poll to poll variability do not tell us if our data fit the urn model. 

Hint

    Under the urn model, the estimates both have the same expected result, d.



## Exercise 8 - Derive Expected Value

We modelled the observed data Y_ij as:
      Y_ij = d + b_i + e_ij
      
# ==================================================================================================================
On the right side of this model, only e_ij is a random variable. The other two values are constants.

What is the expected value of Y_ij?
Instructions
50 XP
Possible Answers

    d + b_1          0
    b_1 + e_ij
    d
    d + b_1 + e_ij



## Exercise 9 - Expected Value and Standard Error of Poll 1

# ==============================================================================================================THINK, Think
Suppose we define Y_1_bar as the average of poll results from the first poll and Sigma_1 as the standard deviation of the first poll.

What is the expected value and standard error of Y_1_bar?
Instructions
50 XP
Possible Answers

#    The expected value is d + b_1 and the standard error is Sigma_1
    The expected value is d and the standard error is Sigma_1/sqrt(N_1)
    The expected value is d + b_1 and the standard error is Sigma_1/sqrt(N_1)        0   ?????????????????????????????????
    The expected value is d and the standard error is Sigma_1/sqrt(N_1)
    
Incorrect submission
Try again. The standard error is divided by the square root of the sample size. 

Hint

    The sample average is the same as the urn average.
    The standard error involves the standard deviation and the sample size.



## Exercise 10 - Expected Value and Standard Error of Poll 2

Now we define Y_2_bar as the average of poll results from the second poll.

What is the expected value and standard error of Y_2_bar?
Instructions
50 XP
Possible Answers

    The expected value is d + b_2 and the standard error is Sigma_2
    The expected value is d and the standard error is Sigma_1/sqrt(N_2)
#    The expected value is d + b_2 and the standard error is Sigma_1/sqrt(N_2)        
    The expected value is d and the standard error is Sigma_1/sqrt(N_2)



## Exercise 11 - Difference in Expected Values Between Polls

Using what we learned by answering the previous questions, what is the expected value of Y_2_bar - Y_1_bar?
Instructions
50 XP
Possible Answers

    (b_2 - b_1)^2
    b_2 - b_1 /sqrt(N)
    b_2 + b_1
    b_2 - b_1                       0



## Exercise 12 - Standard Error of the Difference Between Polls

Using what we learned by answering the questions above, what is the standard error of Y_2_bar - Y_1_bar?
Instructions
50 XP
Possible Answers

    sqrt(Sigma_2^2/N_2 + Sigma_1^2/N_1)            0
    sqrt(Sigma_2/N_2 + Sigma_1/N_1)
    (Sigma_2^2/N_2 + Sigma_1^2/N1)^2
    Sigma_2^2/N_2 + Sigma_1^2/N_1

# =================================================================================================================
Hint

    Here's a hint to the first step:  SE[Y_2_bar - Y_1_bar] = sqrt(SE[Y_2_bar]^2 + SE[Y_1_bar]^2)

# =====================================================================================================================
Remember that the standard error is the standard deviation divided by the square root of the sample size.



## Exercise 13 - Compute the Estimates

The answer to the previous question depends on Sigma_1 and Sigma_2, which we don't know. We learned that we can estimate these values using the sample standard deviation.

Compute the estimates of Sigma_1 and Sigma_2.
Instructions
100 XP

    Group the data by pollster.
    Summarize the standard deviation of the spreads for each of the two pollsters. Name the standard deviation s.
    Store the pollster names and standard deviations of the spreads (Sigma) in an object called sigma.

```{r}
# The `polls` data have already been loaded for you. Use the `head` function to examine them.
head(polls)

# Create an object called `sigma` that contains a column for `pollster` and a column for `s`, the standard deviation of the spread
sigma <- polls %>%
  group_by(pollster) %>%
  summarize(name = pollster, 
            s = (spread - mean(spread))^2/(sum(samplesize) - 1))   #####################################################
                                                                   # Why I choose to do it this way? 
                                                                   # how I misinterpreted the question?
                                                                   # does my brain overloading or working in stress environment?
                                                                   # How can I keep a good condition in solving question 
                                                                   # and thinking about those new knowledge???

# Print the contents of sigma to the console
sigma
```

Hint

    Use the pipe %>% to pass the data in polls to the group_by function that groups the data by a variable of interest (such as "pollster").
    Use the pipe %>% to pass the grouped data to the summarize function that reduces grouped values to a single value. Use the sd function nested within summarize to find the standard deviation of the grouped 'spread' data.
    The sigma object should contain two columns and two rows. Name the standard deviation column s.

# =======================================================================================================================
```{r}
# The `polls` data have already been loaded for you. Use the `head` function to examine them.
head(polls)

# Create an object called `sigma` that contains a column for `pollster` and a column for `s`, the standard deviation of the spread
sigma <- polls %>% group_by(pollster) %>%
  summarize(s = sd(spread))

# Print the contents of sigma to the console
sigma
```



## Exercise 14 - Probability Distribution of the Spread

What does the central limit theorem tell us about the distribution of the differences between the pollster averages, Y_2_bar - Y_1_bar?
Instructions
50 XP
Possible Answers

    The central limit theorem cannot tell us anything because this difference is not the average of a sample.
    Because Y_ij are approximately normal, the averages are normal too.
    If we assume N_2 and N_1 are large enough, Y_2_bar and Y_1_bar, and their difference, are approximately normal.       0
    These data do not contain vectors of 0 and 1, so the central limit theorem does not apply.



## Exercise 15 - Calculate the 95% Confidence Interval of the Spreads

# ======================================================================================================================
We have constructed a random variable that has expected value b_2 - b_1, the pollster bias difference. If our model holds, then this random variable has an approximately normal distribution. The standard error of this random variable depends on  Sigma_1 and Sigma_2, but we can use the sample standard deviations we computed earlier. [][We have everything we need to answer our initial question: is b_2 - b_1 different from 0?]
# ===========================================================================================================================

Construct a 95% confidence interval for the difference b_2 and b_1. Does this interval contain zero?
Instructions
100 XP

    Use pipes %>% to pass the data polls on to functions that will group by pollster and summarize the average spread, standard deviation, and number of polls per pollster.
    Calculate the estimate by subtracting the average spreads. Save this estimate to a variable called estimate.
    Calculate the standard error using the standard deviations of the spreads and the sample size. Save this value to a variable called se_hat.
    Calculate the 95% confidence intervals using the qnorm function. Save the lower and then the upper confidence interval to a variable called ci.

Hint

    Recall the formula for the standard error:
    SE[Y_2_bar - Y_1_bar] = sqrt(Sigma_2^2/N_2 + Sigma_1^2/N_1)

# ==========================================================================================================================
```{r}
# The `polls` data have already been loaded for you. Use the `head` function to examine them.
head(polls)

# Create an object called `res` that summarizes the average, standard deviation, and number of polls for the two pollsters.
res <- polls %>% 
  group_by(pollster) %>% 
  summarize(avg = mean(spread), 
            s = sd(spread), 
            N = n()) 

res


# =========================================================================================================================
# Store the difference between the larger average and the smaller in a variable called `estimate`. Print this value to the console.
estimate <- res$avg[2] - res$avg[1]
estimate

# Store the standard error of the estimates as a variable called `se_hat`. Print this value to the console.
se_hat <- sqrt(res$s[2]^2/res$N[2] + res$s[1]^2/res$N[1])    # $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
se_hat

# Calculate the 95% confidence interval of the spreads. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(estimate - qnorm(0.975)*se_hat, estimate + qnorm(0.975)*se_hat)
ci
```
# ============================================================================================================================



## Exercise 16 - Calculate the P-value

The confidence interval tells us there is relatively strong pollster effect resulting in a difference of about 5%. Random variability does not seem to explain it.

Compute a p-value to relay the fact that chance does not explain the observed pollster effect.
Instructions
100 XP

    Use the pnorm function to calculate the probability that a random value is larger than the observed ratio of the estimate to the standard error.
    Multiply the probability by 2, because this is the two-tailed test.

```{r}
# We made an object `res` to summarize the average, standard deviation, and number of polls for the two pollsters.
res <- polls %>% group_by(pollster) %>% 
  summarize(avg = mean(spread), s = sd(spread), N = n()) 

res


# The variables `estimate` and `se_hat` contain the spread estimates and standard error, respectively.
estimate <- res$avg[2] - res$avg[1]
se_hat <- sqrt(res$s[2]^2/res$N[2] + res$s[1]^2/res$N[1])

# Calculate the p-value
2*(1 - pnorm(estimate/se_hat, mean=0, sd=1))
```
# ==========================================================================================================================
Hint

    Our quantile is the estimate divided by the standard error.
    The expected value is 0 with a standard deviation of 1.



## Exercise 17 - Comparing Within-Poll and Between-Poll Variability

We compute statistic called the t-statistic by dividing our estimate of b_2 - b_1 by its estimated standard error:
    (Y_2_bar - Y_1_bar)/sqrt(S_2^2/N_2 + S_1^2/N1)
    
![](C:/Users/qp/Pictures/the equation in the test.png)

Later we learn will learn of another approximation for the distribution of this statistic for values of N_2 and N_1 that aren't large enough for the CLT.

Note that our data has more than two pollsters. We can also test for pollster effect using all pollsters, not just two. The idea is to compare the variability across polls to variability within polls. We can construct statistics to test for effects and approximate their distribution. The area of statistics that does this is called Analysis of Variance or ANOVA. We do not cover it here, but ANOVA provides a very useful set of tools to answer questions such as: is there a pollster effect?
# =======================================================================================================================

Compute the average and standard deviation for each pollster and examine the variability across the averages and how it compares to the variability within the pollsters, summarized by the standard deviation.
Instructions
100 XP

    Group the polls data by pollster.
    Summarize the average and standard deviation of the spreads for each pollster.
    Create an object called var that contains three columns: pollster, mean spread, and standard deviation.
    Be sure to name the column for mean avg and the column for standard deviation s.

Incorrect submission
You have not correctly defined the object var. Use the group_by and summarize functions to group polls by pollster and summarize the means and standard deviations of the spreads. Be sure to name the column for mean avg and the column for standard deviation s. 

```{r}
# Execute the following lines of code to filter the polling data and calculate the spread
polls <- polls_us_election_2016 %>% 
  filter(enddate >= "2016-10-15" &
           state == "U.S.") %>%
  group_by(pollster) %>%
  filter(n() >= 5) %>% 
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
  ungroup()

# Create an object called `var` that contains columns for the pollster, mean spread, and standard deviation. Print the contents of this object to the console.

var <- polls %>%
  group_by(pollster) %>%
  summarize(avg = mean(spread), 
            s = spread/sqrt(sd(spread)^2/samplesize))   # ======================================
var

```

Incorrect submission
Your code contains an error that you should fix:

Error: Column `s` must be length 1 (a summary value), not 15

# ===========================================================================================================================
```{r}
var <- polls %>%
  group_by(pollster) %>%
  summarize(avg = mean(spread), 
            s = sd(spread))
var
```















## Course  /  Section 5: Bayesian Statistics  /  Section 5 Overview



# Section 5 Overview

In Section 5, you will learn about Bayesian statistics through looking at examples from rare disease diagnosis and baseball.

After completing Section 5, you will be able to:

        Apply Bayes' theorem to calculate the probability of A given B.
        Understand how to use hierarchical models to make better predictions by considering multiple levels of variability.
        Compute a posterior probability using an empirical Bayesian approach.
        Calculate a 95% credible interval from a posterior probability.

There are one assignment on the DataCamp platform for you to practice your coding skills.

We encourage you to use R to interactively test out your answers and further your learning.










## Course  /  Section 5: Bayesian Statistics  /  Bayesian Statistics


[][What does it mean when an election forecaster tells us that a given candidate has a 90% chance of winning]?  In the context of the urn model this would be equivalent to stating that the probability that the proportion p of people voting for this candidate being bigger than 0.5, than 50%, is 90%.  

But as we discussed, in the urn model, p is a fixed parameter, and it does not make sense to talk about the probability of p being this or that.  With Bayesian statistics, we assume it is in fact random.  And then it makes sense to talk about probability.  Forecasters also use models to describe variability at different levels.  For example: sampling variability, pollster to pollster variability, day to day variability, and election to election variability.  

One of the most successful approaches used to describe these different levels of variability are called hierarchical models.  And hierarchical models are best explained in the context of Bayesian statistics.  Bayesian statistics is the topic of the following videos.  


[][Textbook link]

This video corresponds to the textbook introduction to the Bayesian statistics section External link.
https://rafalab.github.io/dsbook/models.html#bayesian-statistics


[][Key points]

    In the urn model, it does not make sense to talk about the probability of p being greater than a certain value because p is a fixed value.
    
    With Bayesian statistics, we assume that p is in fact random, which allows us to calculate probabilities related to p.
    
    Hierarchical models describe variability at different levels and incorporate all these levels into a model for estimating p.











# Bayes' Theorem

 We start by reviewing Bayes' theorem.  We do this using a hypothetical cystic-fibrosis test as an example.  So let's start.  Suppose a test for cystic fibrosis has an accuracy of 99%.  We will use the following notation to represent this.  
 ![](C:/Users/qp/Pictures/Suppose a test for cystic fibrosis has an accuracy of 99%.png)
We're going to write that *the probability of a positive test given that you have the disease, D equals 1, is 0.99.  Also*, the probability of a negative test given that you don't have the disease, D equals 0, is 0.99.  Here in this formula plus means a positive test, and D represents if you actually have the disease, 1 or 0.  Suppose we select a random person and they test positive.  What is the probability that they have the disease?  We write this as the probability of D equals 1 given that the test was positive.  
![](C:/Users/qp/Pictures/We write this as the probability of D equals 1 given that the test was positive.png)
The cystic fibrosis rate is 1 in 3,900, which implies that the probability that D equals 1 is 0.00025.  To answer this question, we'll use Bayes' theorem, which in general tells us that the probability of event A happening given that event B happening is equal to the probability of them both happening divided by the probability of B happening.  The numerator is split using the multiplication rule into the probability of B happening given A happening times the probability of A happening.  
![](C:/Users/qp/Pictures/To answer this question, we'll use Bayes' theorem, which in general tells.png)
This is going to be useful because sometimes we know the probability of A given B and not the probability of B given A, as is the case in the cystic fibrosis example.  Here is the Bayes' theorem equation applied to our cystic-fibrosis example.  
![](C:/Users/qp/Pictures/Here is the Bayes' theorem equation applied to our cystic-fibrosis example.png)
The probability of D equals 1 given a positive test is what we want to know.  We don't know that.  We want to find out.  What we do know is the probability of a positive test given that D equals 1.  We also know the probability of a positive test given D equals 0.  So now using Bayes' formula, we write out the equation.  And we end up with a larger fraction that includes quantities that we know.  If you look at each one of those quantities up there, we know what they are.  And now we're going to plug in the values.  We do that here.  
![](C:/Users/qp/Pictures/And now we have that that probability is 0.02, only a 2% chance.png)
And now we have that that probability is 0.02, only a 2% chance.  *This says that despite the test having 99% accuracy, the probability of having the disease given a positive test is only 2%*.  This may appear counterintuitive to some.  But we're going to see how it makes sense.  [][The reason this is the case is because we have to factor in the very rare possibility that a person chosen at random has the disease].  This is the Bayesian way of thinking.  
![](C:/Users/qp/Pictures/This says that despite the test having 99% accuracy, the probability of having the disease given a positive test is only 2%.png)
To illustrate this, we can use a Monte Carlo simulation.  The following simulation is meant to help you visualize Bayes' theorem.  We start by randomly selecting 100,000 people from a population in which the disease in question has a 1 in 3,900 prevalence.  So we set the prevalence to be 0.00025.  We set N to be 100,000.  And now we sample 100,000 people using the code that we've learned to use.  Here it is.  
![](C:/Users/qp/Pictures/We start by randomly selecting 100,000 people from a population in which the disease in question has a 1 in 3,900 prevalence.png)
![](C:/Users/qp/Pictures/And of course there's a lot of healthy people, 99,977.png)
Note that because prevalence is so low, once we take the sample the number of people with the disease is low.  It's only 23.  Here's the code you use to get that.  And of course there's a lot of healthy people, 99,977.  This makes the probability that we see some false positives quite high.  There are so many people without the disease that are getting the test that, although it's rare, we were going to get a few people getting a positive test despite them being healthy.  Here's the code that shows this.  
![](C:/Users/qp/Pictures/Each person has a 99% chance of getting the test giving them the right answer.png)
Each person has a 99% chance of getting the test giving them the right answer.  So we write the code like this.  For each of the diseased and healthy people, we're going to sample either a correct or incorrect test with the appropriate probabilities, a very high probability of the correct test.  If you examine this code, you will see that that's what we're doing in the sample call.  So we have two variables here, the outcome, which is disease or healthy, and test, which is positive or negative.  
![](C:/Users/qp/Pictures/We can make a table that shows us the number of people in each of 4 combinations.png)
We can make a table that shows us the number of people in each one of these four combinations.  We do that using the table command.  Here it is.  *We can see that there are a lot of people they are healthy that got a positive outcome.  That's because there are so many more healthy people*.  From this table, we can also see that the proportion of positive tests that have the disease is 23.  And this is out of a total of 23 plus 965, which is 988.  If you divide 23 by 988, you get about 2%, which is exactly what Bayes' theorem told us it should be.  We can run this simulation over and over again, and we'll see that that probability will converge to the probability that Bayes' theorem told us it would be, which is again, about 2%.  



[][Textbook link]

This video corresponds to the textbook section on Bayes' Theorem External link.
https://rafalab.github.io/dsbook/models.html#bayes-theorem



[][Key points]

    Bayes' Theorem states that the probability of event A happening given event B is equal to the probability of both A and B divided by the probability of event B:
    Pr(A|B) = Pr(B|A) * Pr(A)/Pr(B)

[][    Bayes' Theorem shows that a test for a very rare disease will have a high percentage of false positives even if the accuracy of the test is high. ]


Equations: Cystic fibrosis test probabilities

In these probabilities, + represents a positive test, - represents a negative test, D=0 indicates no disease, and D=1 indicates the disease is present.

    Probability of having the disease given a positive test: Pr(D=1 | +)

    99% test accuracy when disease is present: Pr(+ | D=1) = 0.99

    99% test accuracy when disease is absent: Pr(- | D=0) = 0.99

    Rate of cystic fibrosis: Pr(D=1) = 0.00025


Bayes' theorem can be applied like this: 

    Pr(D=1|+) = Pr(+|D=1)*Pr(D=1)/Pr(+)
    Pr(D=1|+) = Pr(+|D=1)*Pr(D=1)/(Pr(+|D=1)*Pr(D=1) + Pr(+|D=0)*Pr(D=0))

Substituting known values, we obtain:

    Pr(D=1|+) = Pr(+|D=1)*Pr(D=1)/(Pr(+|D=1)*Pr(D=1) + Pr(+|D=0)*Pr(D=0))
    (0.99*0.00025)/(0.99*0.00025 + 0.01*0.99975) = 0.02

Code: Monte Carlo simulation

prev <- 0.00025    # disease prevalence
N <- 100000    # number of tests
outcome <- sample(c("Disease", "Healthy"), N, replace = TRUE, prob = c(prev, 1-prev))

N_D <- sum(outcome == "Disease")    # number with disease
N_H <- sum(outcome == "Healthy")    # number healthy

# for each person, randomly determine if test is + or -
accuracy <- 0.99
test <- vector("character", N)
test[outcome == "Disease"] <- sample(c("+", "-"), N_D, replace=TRUE, prob = c(accuracy, 1-accuracy))
test[outcome == "Healthy"] <- sample(c("-", "+"), N_H, replace=TRUE, prob = c(accuracy, 1-accuracy))

table(outcome, test)












# Bayes in Practice

To demonstrate the usefulness of hierarchical models, Bayesian models, in practice, we're going to show you a baseball example.  *In sports, we use Bayesian thinking all the time, even if we don't realize it*.  Let's go to the example.  Jose Iglesias is a professional baseball player.  In April 2013, when he was starting his career, he was performing rather well.  He had been to bat 20 times and he had nine hits, which is an average of 0.450.  
![](C:/Users/qp/Pictures/In April 2013, when he was starting his career, he was performing rather well.png)
This average of 0.450 means Jose had been successful 45% of the times he had batted, which is rather high historically speaking.  Note, for example, that no one has finished a season with an average of 0.400 or more since Ted Williams did it in 1941.  **To illustrate the way hierarchical models are powerful, we will try to predict Jose's batting average at the end of the season**.  In a typical season, players have about 500 or bats.  With the techniques we have learned up to now, referred to as frequentist statistics, the best we can do is provide a confidence interval(based on this observation???).  
![](C:/Users/qp/Pictures/With the techniques we have learned up to now, referred to as frequentist statistics.png)
We can think of outcomes for hitting as a binomial with a success rate of p.  *So if the success rate is indeed 0.450, the standard error of just 20 at bats can be computed like this.  And it's 0.111* (>>> math.sqrt(0.45*(1-0.45)/20)).  We can use this to construct a 95% confidence interval, which will be from 0.228 to 0.672.  [][This prediction has two problems].  First, it's very large, so it's not very useful.  Second, it's centered at 0.450, which implies that our best guess is that this relatively unknown player will break Ted Williams' longstanding record.  If you follow baseball, this last statement will seem wrong.  And this is because you're implicitly using the hierarchical model that factors in information from years of following baseball.  

[][Here we show how we can quantify this intuition].  First, let's explore the distribution of batting averages for all players with more than 500 at bats during the seasons 2010, 2011, and 2012.  
![](C:/Users/qp/Pictures/First, let's explore the distribution of batting averages for all players with more than 500 at bats during the seasons 2010, 2011, and 2012.png)
![](C:/Users/qp/Pictures/more than 500 at bats during the seasons 2010, 2011, and 2012, here are the histograms.png)
Here are the histograms.  We note that the average player had an average of 0.275.  And the standard deviation of the population of all these players was 0.027.  So we can see already that 0.450 would be quite an anomaly, since it is over six standard deviations away from the average.  So is Jose lucky or the best batter seen in the last 50 years?  Perhaps it's a combination of both.  [][But how do we decide how much of this is luck and how much of this is talent?]  If we become convinced that this is just luck, we should trade him to a team that trusts the 0.450 observation and is maybe overestimating his potential.  End of transcript. Skip to the start.  



[][Textbook link]

This video corresponds to the textbook section on Bayes in practice External link.
https://rafalab.github.io/dsbook/models.html#bayes-in-practice


[][Key points]

    The techniques we have used up until now are referred to as frequentist statistics as they consider only the frequency of outcomes in a dataset and do not include any outside information. Frequentist statistics allow us to compute confidence intervals and p-values.
    
    Frequentist statistics can have problems when sample sizes are small and when the data are extreme compared to historical results.
    
    Bayesian statistics allows prior knowledge to modify observed results, which alters our conclusions about event probabilities.












# The Hierarchical Model

The hierarchical model provides a mathematical description of how we come to see the observation of 0.450.  First, we pick a player at random with an intrinsic ability summarized by, for example, p--the proportion of times they will actually be successful.  Then we see 20 random outcomes with success probability p.  *We use a model to represent two levels of variability in our data*.  First, each player is assigned a natural ability to hit at birth.  You can think of it that way.  We will use a symbol, p, to represent this ability.  You can think of p as a batting average you would have converged to if this particular player batted over and over and over and over again.  Based on the plots, we assume that p has a normal distribution.  [][If we just pick a player at random, the random variable p will have a normal distribution.  We also know that the expected value is about 0.270 and a standard error of 0.027.  ]
![](C:/Users/qp/Pictures/First, each player is assigned a natural ability to hit at birth.png)
![](C:/Users/qp/Pictures/Now, the second level of variability has to do with luck.png)
**Now, the second level of variability has to do with luck.  Regardless of how good or bad a player is, sometimes you have bad luck, and sometimes you have good luck when you're batting**.  At each at bat, this player has a probability of success, p.  If we add up these successes and failures as 0's and 1's, then the CLT tells us that the observed average, let's call it Y, has a normal distribution with expected value p and standard error square root of p times 1 minus p, divided by N. N is the number of at bats.  
![](C:/Users/qp/Pictures/If we add up these successes and failures as 0's and 1's, then the CLT tells us the observed average Y.png)
Statistical textbooks will write the model like this.  We are going to use a tilde to denote the distribution of something.  So p tilde N(mu, tau) is telling us that p, which is now a random variable, has a distribution that is normal with respected value mu and standard error tau.  **This describes the randomness in picking a player.  **
![](C:/Users/qp/Pictures/This describes the randomness in picking a player.png)
Now we describe the distribution at the next level.  So the distribution of the observed batting average Y, given that this player has a talent, p, is also normally distributed with expected value p and a standard error sigma.  **This describes the randomness in the performance of this particular player**.  
![](C:/Users/qp/Pictures/So the distribution of the observed batting average Y, given that this player has a talent, p, is also is also normally distributed.png)
In our case, mu is 0.270.  Tau is 0.027.  And sigma squared is p times 1 minus p divided by n.  Because there are two levels, we call these hierarchical models.  The first one is the player to player variability.  The second is the variability due to luck when batting.  In a Bayesian framework, the [][first level is called prior distribution, and the second the sampling distribution].  Now, let's use this model for Jose's data.  Suppose we want to predict his innate ability in the form of his true batting average, p.  This would be the hierarchical model for our data.  
![](C:/Users/qp/Pictures/p is normal with expected value 0.275, standard error 0.027. and Y given p is normal with expected value p which we dont know and standard error 0.111.png)
p is normal with expected value 0.275, standard error 0.027.  And Y, given p, is normal with expected value p-- we don't know what p is; we're trying to estimate it--and standard error 0.111.  We now are ready to compute what is called a [][posterior distribution] to summarize our prediction of p.  What Bayesian statistics lets us do is compute the probability distribution of p given that we have observed data.  This is called a posterior distribution.  Again, the probability distribution of p conditioned that we have observed data Y. There is a continuous version of Bayes' rule that lets us compute the posterior distribution in cases like this, where the distributions are continuous.  The normal distribution is a continuous distribution.  We can use this continuous version of Bayes' rule to derive a posterior probability function for p assuming that we have observed Y equals, for example, little y.  
![](C:/Users/qp/Pictures/In our case, we can show that this posterior distribution follows a normal distribution with expected value given by this formula..png)
In our case, we can show that this posterior distribution follows a normal distribution with expected value given by this formula.  Now, let's study this formula closely, because it is very informative, and it actually explains our intuition.  [][Note that this is a weighted average between mu--mu is the average for all baseball players--and Y, what we have observed for Jose].  So if B were to be 1, this would mean that we're just saying Jose is just an average player, so we're going to predict mu.  If B is 0, we would be saying forget the past, we're going to predict that Jose is what he is, what we've observed.  His average is 0.450.  Now, look at how B is constructed.  B is the standard error sigma squared divided by the sum of the standard error sigma squared, plus the standard error tau squared.  So B, the weight, is going to be closer to 1 when sigma is large.  

When is sigma large?  Sigma is large when the variance, when the standard error, of our observed data is large.  When we don't trust our observed data too much, sigma is large.  So we make B 1.  In this case, we would predict that Jose Iglesias is an average player.  We would predict mu.  On the other hand, if the sigma is very, very small, this means that we really do trust our data Y, and we're actually going to say, no, we trust our data, and we are going to actually ignore the past and predict Y.  Of course, B is somewhere in the middle, so we get something in the middle.  This weighted average is sometimes referred to as shrinking, because it shrinks the observed Y towards a prior mean, which in this case is mu.  We shrink the observed data towards what the average player is, mu.  

![](C:/Users/qp/Pictures/in the case of Jose, we can fill in those numbers, and get that the expected value for the posterior distribution is 0.285..png)
In the case of Jose Iglesias, we can fill in those numbers and get that the expected value for the posterior distribution is 0.285.  It's a number between the 0.450 that we saw and the 0.270 that we've seen historically for the average player.  The standard error can also be computed.  We use mathematics to do this.  We're not showing it here, but we can do it.  And we get a formula that we show here.  This is the formula for the standard error of the posterior distribution.  
![](C:/Users/qp/Pictures/This is the formula for the standard error of the posterior distribution.png)
And in this case, we get that the standard deviation is 0.026.  So we started with a frequentist 95% confidence interval that ignored data from other players from the past and simply summarized Jose's data as 0.450 plus or minus 0.220.  Then we used a Bayesian approach that incorporated data from the past, from other players, and obtained a posterior probability.  We should point out that this is actually referred to as an empirical Bayesian approach.  In a traditional Bayesian approach, we simply state the prior.  In an empirical Bayesian approach, we use data to construct the prior, and that's what we did here.  Using the posterior distribution, we can report what is called a [][95% credible interval].  This is a region *centered at the expected value with a 95% chance of occurring.  *
![](C:/Users/qp/Pictures/In our case, we can construct this by adding twice the standard error to the expected value of the posterior distribution..png)
Remember that p is now random, so we can talk about the chances of p happening, falling here or falling there.  In our case, we can construct this by adding twice the standard error to the expected value of the posterior distribution.  And we get 0.285 plus or minus 0.052.  Note that the Bayesian approach is giving us a prediction that is much lower than the 0.450.  It's also giving us a much more precise interval.  The Bayesian credible interval suggests that if another team that is ignoring past data is impressed by the 450, the 0.450 observation, we should consider trading Jose as they probably overvalue--if we trust our new prediction that is predicting that he will be just slightly above average.  
![](C:/Users/qp/Pictures/Notice that if we take April out, his batting average for the rest of the season was 0.293..png)
Interestingly, the Red Sox traded Jose Iglesias to the Detroit Tigers in July 2013.  Let's look at his batting average for the next five months.  Notice that if we take April out, his batting average for the rest of the season was 0.293.  Although both intervals, the frequentist confidence interval and the Bayesian credible intervals, included the final batting average of 0.293, the Bayesian credible interval provided a much more precise prediction.  In particular, it predicted that he would not be as good for the remainder of the season.  So trading him was perhaps the right decision.  End of transcript. Skip to the start.  



[][Textbook link]

This video corresponds to the textbook section on hierarchical models External link.
https://rafalab.github.io/dsbook/models.html#hierarchical-models


[][Key points]   ############################################################################################################

    Hierarchical models use multiple levels of variability to model results. They are hierarchical because values in the lower levels of the model are computed using values from higher levels of the model.
    
    We model baseball player batting average using a hierarchical model with two levels of variability:
        p ~ N(mu, tau) describes player-to-player variability in natural ability to hit, which has a mean mu and standard deviation tau.
        
        Y | p~N(p, Sigma) describes a player's observed batting average given their ability p, which has a mean p and standard deviation Sigma = sqrt(p*(q-p)/N). This represents variability due to luck.
        
        In Bayesian hierarchical models, the first level is called the prior distribution and the second level is called the sampling distribution.

    The posterior distribution allows us to compute the probability distribution of p given that we have observed data Y.
    
    By the continuous version of Bayes' rule, the expected value of the posterior distribution p given Y=y is a weighted average between the prior mean mu and the observed data Y:
        E(p | y) = B*mu + (1-B)*Y where B = Sigma^2/(Sigma^2 + tau^2)
    
    The standard error of the posterior distribution SE(p|Y)^2 is 1/(1/Sigma^2 + 1/tau^2). Note that you will need to take the square root of both sides to solve for the standard error.
    
    This Bayesian approach is also known as shrinking. When Sigma is large, B is close to 1 and our prediction of p shrinks towards the mean (\mu). When Sigma is small, B is close to 0 and our prediction of p is more weighted towards the observed data Y.  









# Assessment 5.1: Bayesian Statistics


DataCamp due Aug 1, 2022 23:55 AWST

In this assessment, you will learn about Bayesian statistics.

By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy. Note that you might need to disable your pop-up blocker, or allow "www.datacamp.com" in your pop-up blocker allowed list. When you have completed the exercises, return to edX to continue your learning.

Assessment 5.1: Bayesian Statistics (External resource) (10.5 points possible)
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy.

Ask your questions about Bayesian statistics or the related DataCamp assessment here. Remember to search the discussion board before posting to see if someone else has asked the same thing before asking a new question! You're also encouraged to answer each other's questions to help further your own learning.

Some reminders:

    Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
    Posting snippets of code is okay, but posting full code solutions is not.
    If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.



## Exercise 1 - Statistics in the Courtroom

In 1999 in England Sally Clark was found guilty of the murder of two of her sons. Both infants were found dead in the morning, one in 1996 and another in 1998, and she claimed the cause of death was sudden infant death syndrome (SIDS). No evidence of physical harm was found on the two infants so the main piece of evidence against her was the testimony of Professor Sir Roy Meadow, who testified that the chances of two infants dying of SIDS was 1 in 73 million. He arrived at this figure by finding that the rate of SIDS was 1 in 8,500 and then calculating that the chance of two SIDS cases was 8,500 * 8,500 = 73 million.

Based on what we've learned throughout this course, which statement best describes a potential flaw in Sir Meadow's reasoning?
Instructions
50 XP
Possible Answers

    Sir Meadow assumed the second death was independent of the first son being affected, thereby ignoring possible genetic causes.       0
    There is no flaw. The multiplicative rule always applies in this way: Pr(A and B) = Pr(A) * Pr(B)
    Sir Meadow should have added the probabilities: Pr(A and B) = Pr(A) + Pr(B)
    The rate of SIDS is too low to perform these types of statistics.
    


## Exercise 2 - Recalculating the SIDS Statistics

Let's assume that there is in fact a genetic component to SIDS and the the probability of Pr(second case of SIDS | first case of SIDS) = 1/100, is much higher than 1 in 8,500.

What is the probability of both of Sally Clark's sons dying of SIDS?
Instructions
100 XP

    Calculate the probability of both sons dying to SIDS.

```{r}
# Define `Pr_1` as the probability of the first son dying of SIDS
Pr_1 <- 1/8500

# Define `Pr_2` as the probability of the second son dying of SIDS
Pr_2 <- 1/100

# Calculate the probability of both sons dying of SIDS. Print this value to the console.
Pr_1*1/100
```



## Exercise 3 - Bayes' Rule in the Courtroom

Many press reports stated that the expert claimed the probability of Sally Clark being innocent as 1 in 73 million. Perhaps the jury and judge also interpreted the testimony this way. This probability can be written like this:

    Pr(mother is a murderer | two children found dead with no evidence of harm)

Bayes' rule tells us this probability is equal to:

Bayes' rule tells us this probability is equal to:
Instructions
50 XP
Possible Answers
    
    Pr(two children found death with no evidence of harm) * Pr(mother is a murderer)/Pr(two children found dead with no evidence of harm)

    Pr(two children found death with no evidence of harm) * Pr(mother is a murderer)
    
    Pr(two children found death with no evidence of harm | mother is a murderer) * Pr(mother is a murderer)

    1/8500
![](C:/Users/qp/Pictures/pr of a given b and others in one equation.png)



## Exercise 4 - Calculate the Probability

Assume that the probability of a murderer finding a way to kill her two children without leaving evidence of physical harm is:
    Pr(two children found dead with no evidence of harm | mother is a murderer) = 0.50
    
Assume that the murder rate among mothers is 1 in 1,000,000.
    Pr(mother is a murderer) = 1/1,000,000
    
According to Bayes' rule, what is the probability of:
    Pr(mother is a murderer | two children dead with no evidence of harm)

Instructions
100 XP

    Use Bayes' rule to calculate the probability that the mother is a murderer, considering the rates of murdering mothers in the population, the probability that two siblings die of SIDS, and the probability that a murderer kills children without leaving evidence of physical harm.
    Print your result to the console.

```{r}
# Define `Pr_1` as the probability of the first son dying of SIDS
Pr_1 <- 1/8500

# Define `Pr_2` as the probability of the second son dying of SIDS
Pr_2 <- 1/100

# Define `Pr_B` as the probability of both sons dying of SIDS
Pr_B <- Pr_1*Pr_2

# Define Pr_A as the rate of mothers that are murderers
Pr_A <- 1/1000000

# Define Pr_BA as the probability that two children die without evidence of harm, given that their mother is a murderer
Pr_BA <- 0.50

# Define Pr_AB as the probability that a mother is a murderer, given that her two children died with no evidence of physical harm. Print this value to the console.
Pr_AB <- Pr_A * Pr_BA/Pr_B
Pr_AB
```
Incorrect submission
You are not providing a calculation that gives the correct answer or you forgot to print the result to the console. Make sure you are using Bayes' rule. 



## Exercise 5 - Misuse of Statistics in the Courts

After Sally Clark was found guilty, the Royal Statistical Society issued a statement saying that there was "no statistical basis" for the expert's claim. They expressed concern at the "misuse of statistics in the courts". Eventually, Sally Clark was acquitted in June 2003.

In addition to misusing the multiplicative rule as we saw earlier, what else did Sir Meadow miss?
Instructions
50 XP
Possible Answers

#    He made an arithmetic error in forgetting to divide by the rate of SIDS in siblings.
    He did not take into account how rare it is for a mother to murder her children.       0
#    He mixed up the numerator and denominator of Bayes' rule.
    He did not take into account murder rates in the population.

Incorrect submission
Try again. Sir Meadow did not use Bayes' rule. 

Incorrect submission
Try again. Remember what we included in our calculations. 



## Exercise 6 - Back to Election Polls

Florida is one of the most closely watched states in the U.S. election because it has many electoral votes and the election is generally close. Create a table with the poll spread results from Florida taken during the last days before the election using the sample code.

The CLT tells us that the average of these spreads is approximately normal. Calculate a spread average and provide an estimate of the standard error.
Instructions
100 XP

    Calculate the average of the spreads. Call this average avg in the final table.
    Calculate an estimate of the standard error of the spreads. Call this standard error se in the final table.
    Use the mean and sd functions nested within summarize to find the average and standard deviation of the grouped spread data.
    Save your results in an object called results.

```{r}
# Load the libraries and poll data
library(dplyr)
library(dslabs)
data(polls_us_election_2016)

# Create an object `polls` that contains the spread of predictions for each candidate in Florida during the last polling days
polls <- polls_us_election_2016 %>% 
  filter(state == "Florida" & enddate >= "2016-11-04" ) %>% 
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)

# Examine the `polls` object using the `head` function
head(polls)

# Create an object called `results` that has two columns containing the average spread (`avg`) and the standard error (`se`). Print the results to the console.
results <- polls %>%
  summarize(avg=mean(spread), se=sd(spread)) %>%
  table
results
```

Incorrect submission
You have not correctly defined the object results. Use the summarize functions to summarize the means and standard errors of the spreads. 

```{r}
# Load the libraries and poll data
library(dplyr)
library(dslabs)
data(polls_us_election_2016)

# Create an object `polls` that contains the spread of predictions for each candidate in Florida during the last polling days
polls <- polls_us_election_2016 %>% 
  filter(state == "Florida" & enddate >= "2016-11-04" ) %>% 
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)

# Examine the `polls` object using the `head` function
head(polls)

# Create an object called `results` that has two columns containing the average spread (`avg`) and the standard error (`se`). Print the results to the console.
results <- polls %>%
  summarize(avg=mean(spread), se=sd(spread))      ################################## What are you thinking in your head???

results
```

Incorrect submission
You have not correctly defined the object results. Use the summarize functions to summarize the means and standard errors of the spreads. 

Hint

    Use the pipe %>% to pass the grouped data to the summarize function that reduces grouped values to a single value.
    Remember that the standard error is the standard deviation divided by the square root of the sample size.
    You can use n() within the summarize function to tally the number of observations.
    
```{r}
# Load the libraries and poll data
library(dplyr)
library(dslabs)
data(polls_us_election_2016)

# Create an object `polls` that contains the spread of predictions for each candidate in Florida during the last polling days
polls <- polls_us_election_2016 %>% 
  filter(state == "Florida" & enddate >= "2016-11-04" ) %>% 
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)

# Examine the `polls` object using the `head` function
head(polls)

# Create an object called `results` that has two columns containing the average spread (`avg`) and the standard error (`se`). Print the results to the console.
results <- polls %>% 
summarize(avg = mean(spread),  se = sd(spread)/sqrt(n()))
results
```



## Exercise 7 - The Prior Distribution

Assume a Bayesian model sets the prior distribution for Florida's election night spread d to be normal with expected value mu and standard deviation tau.

What are the interpretations of mu and tau?
Instructions
50 XP
Possible Answers

#    mu and tau are arbitrary numbers that let us make probability statements about d.
    mu and tau summarize what we would predict for Florida before seeing any polls.       0 ????????????
    mu and tau summarize what we want to be true. We therefore set mu at 0.10 and tau at 0.01.
X    The choice of prior has no effect on the Bayesian analysis.

Incorrect submission
Try again. Remember how we used these values in the baseball example. 



## Exercise 8 - Estimate the Posterior Distribution

The CLT tells us that our estimate of the spread d_hat has a normal distribution with expected value d and standard deviation Sigma, which we calculated in a previous exercise.

Use the formulas for the posterior distribution to calculate the expected value of the posterior distribution if we set mu = 0 and tau = 0.01.
Instructions
100 XP

    Define mu and tau 
    Identify which elements stored in the object results represent Sigma and Y 
    Estimate B using Sigma and tau 
    Estimate the posterior distribution using B, mu, and Y

```{r}
# The results` object has already been loaded. Examine the values stored: `avg` and `se` of the spread
results

# Define `mu` and `tau`
mu <- 0
tau <- 0.01

# Define a variable called `sigma` that contains the standard error in the object `results`
sigma <- results$se

# Define a variable called `Y` that contains the average in the object `results`
Y <- results$avg

# Define a variable `B` using `sigma` and `tau`. Print this value to the console.
B <- sigma^2/(sigma^2 + tau^2)
B

# Calculate the expected value of the posterior distribution
0*B + (1-B)*Y
```



## Exercise 9 - Standard Error of the Posterior Distribution

Compute the standard error of the posterior distribution.
Instructions
100 XP

    Using the variables we have defined so far, calculate the standard error of the posterior distribution.
    Print this value to the console.

```{r}
# Here are the variables we have defined
mu <- 0
tau <- 0.01
sigma <- results$se
Y <- results$avg
B <- sigma^2 / (sigma^2 + tau^2)

# Compute the standard error of the posterior distribution. Print this value to the console.
sqrt(1/(1/sigma^2 + 1/tau^2))
```
Incorrect submission
Use the sqrt function to calculate the standard error 



## Exercise 10- Constructing a Credible Interval

Using the fact that the posterior distribution is normal, create an interval that has a 95% of occurring centered at the posterior expected value. Note that we call these credible intervals.
Instructions
100 XP

    Calculate the 95% credible intervals using the qnorm function.
    Save the lower and upper confidence intervals as an object called ci. Save the lower confidence interval first.

```{r}
# Here are the variables we have defined in previous exercises
mu <- 0
tau <- 0.01
sigma <- results$se
Y <- results$avg
B <- sigma^2 / (sigma^2 + tau^2)
se <- sqrt( 1/ (1/sigma^2 + 1/tau^2))

# Construct the 95% credible interval. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(B*mu + (1-B)*Y - qnorm(0.975)*se, B*mu + (1-B)*Y + qnorm(0.975)*se)
ci      ############## Now, what is this ci, can you explain this to others??????????????????????
```
Incorrect submission
The values contained in the object 'ci' are not correct. Are you using the correct formula to calculate the confidence intervals? Make sure to save the lower confidence interval first. 

Hint

    The center of the 95% credible interval is the expected value of the posterior distribution that we calculated previously:
        E(p|y) = B*mu + (1-B)*Y

Add and subtract from this estimate to construct the credible interval. Use the standard error and the qnorm function.



## Exercise 11 - Odds of Winning Florida

According to this analysis, what was the probability that Trump wins Florida?
Instructions
100 XP

[][    Using the pnorm function, calculate the probability that the spread in Florida was less than 0. ]

```{r}
# Assign the expected value of the posterior distribution to the variable `exp_value`
exp_value <- B*mu + (1-B)*Y 

# Assign the standard error of the posterior distribution to the variable `se`
se <- sqrt( 1/ (1/sigma^2 + 1/tau^2))

# Using the `pnorm` function, calculate the probability that the actual spread was less than 0 (in Trump's favor). Print this value to the console.
pnorm(0, exp_value, se)
```



## Exercise 12 - Change the Priors

We had set the prior variance tau to 0.01, reflecting that these races are often close.

Change the prior variance to include values ranging from 0.005 to 0.05 and observe how the probability of Trump winning Florida changes by making a plot.
Instructions
100 XP

    Create a vector of values of taus by executing the sample code.
    Create a function using function(){} called p_calc that takes the value tau as the only argument, then calculates B from tau and sigma, and then calculates the probability of Trump winning, as we did in the previous exercise.
    Apply your p_calc function across all the new values of taus.
    Use the plot function to plot 

on the x-axis and the new probabilities on the y-axis.

```{r}
# Define the variables from previous exercises
mu <- 0
sigma <- results$se
Y <- results$avg

# Define a variable `taus` as different values of tau
taus <- seq(0.005, 0.05, len = 100)

# Create a function called `p_calc` that generates `B` and calculates the probability of the spread being less than 0
p_calc <- function(tau) {
    B <- sigma^2/(sigma^2 + tau^2)
    pnorm(0, B*mu + (1-B)*Y, sqrt(1/(1/sigma^2 + 1/tau^2)))
}




# Create a vector called `ps` by applying the function `p_calc` across values in `taus`
ps <- p_calc(taus)

# Plot `taus` on the x-axis and `ps` on the y-axis
plot(taus, ps, type='o')     # What am I looking at?  What does this chart tell us?
```












## Course  /  Section 6: Election Forecasting  /  Section 6 Overview


In Section 6, you will learn about election forecasting, building on what you've learned in the previous sections about statistical modeling and Bayesian statistics.

After completing Section 6, you will be able to:

        Understand how pollsters use hierarchical models to forecast the results of elections.
        Incorporate multiple sources of variability into a mathematical model to make predictions.
        Construct confidence intervals that better model deviations such as those seen in election data using the t-distribution.

There are 2 assignments that use the DataCamp platform for you to practice your coding skills.

We encourage you to use R to interactively test out your answers and further your learning.











# Election Forecasting

Pollsters tend to make probabilistic statements about the result of the election.  *For example, the chance that Obama wins the electoral college is 91%*.  [][That is a probabilistic statement about the parameter d].  We show that for the 2016 election, FiveThirtyEight gave Clinton 81.4% chance of winning the popular vote, and that happened.  
![](C:/Users/qp/Pictures/probabilistic statement about the result.png)
![](C:/Users/qp/Pictures/We assume a hierarchical model similar to the one we used to predict the performance of a baseball player.png)
To do this, they used the Bayesian approach we describe here.  [][We assume a hierarchical model similar to the one we used to predict the performance of a baseball player].  In this case, we write it like this.  d, the spread, is going to be assumed to come from a normal distribution with expected value mu and standard error tau.  And this describes our best guess before we see any polling data.  Then, once we collect data for a given spread and compute an average, we have that this is going to be normally distributed with expected value d and standard error sigma.  This probability distribution describes the randomness due to sampling and the pollster effect.  For our best guess, we note that before any poll data is available we can use data sources other than polling data.  
![](C:/Users/qp/Pictures/fundamentals.png)
A popular approach is to use what are called [][fundamentals], which are based on properties about the current economy and other factors that historically appear to have an effect in favor or against an incumbent party.  We don't use those here.  Instead *we'll simply set mu to 0* which is interpreted as a model that simply does not provide any information on who will win.  So before we see a poll we say we have no idea who is going to win.  *For the standard deviation, we will use recent historical data that shows the winner of the popular vote has an average spread of about 3.5%*.  So we set tau to 0.035.  
![](C:/Users/qp/Pictures/we won't use the properties about the current economy and other factors.png)
![](C:/Users/qp/Pictures/For the standard deviation, we will use recent historical data.png)
[][Now we can use the formulas for the posterior distribution for the parameter d that we previously learned to report the probability of d being bigger than 0 given the observed poll data].  
![](C:/Users/qp/Pictures/With this code, we compute the posterior mean, the posterior standard error..png)
![](C:/Users/qp/Pictures/one thing we can do is report what is called a credible interval..png)
![](C:/Users/qp/Pictures/posterior mean plus or minus 1.96 standard error give us interval that has a 95% chance of occurring..png)
With this code, we compute the posterior mean, the posterior standard error.  To make a probability statement, we use the fact that the posterior distribution is also normal.  So one thing we can do is report what is called a credible interval.  The posterior mean plus minus 1.96 times the posterior standard error to the posterior mean plus 1.96 times the posterior standard error give us interval that has a 95% chance of occurring.  The interval is now random.  Here is that interval and this goes from 1.6% to 4%.  
![](C:/Users/qp/Pictures/This gives us a probability of almost 1, 99.9999%..png)
[][Now more interesting is to report the probability that d is bigger than 0].  That we can compute using pnorm.  1 minus pnorm of 0, that means the probably being bigger than 0, for a mean of posterior_mean and a standard deviation of posterior_se.  This gives us a probability of almost 1, 99.9999%.  That's the probability that we're going to report for Clinton winning the popular vote.  We're saying it's almost 100%.  This seems to be a little bit too overconfident.  Also it is not in agreement with FiveThirtyEight's 81.4%.  **What explains this difference**?  

To understand why this happens we can look at what happens after elections.  After elections are over, one can look at the difference between the pollsters' predictions and the actual results.  An important observation that our model does not account for is that it is common to see what is called a general bias that affects many pollsters in the same way.  [][There is no good explanation for this].  But we do observe it in historical data.  *One election the average of polls favors Democrats by 2%.  The next election they favor Republicans with 1%.  Than next there's no bias*.  Then the next Republicans are favored are 3%, and so on.  
![](C:/Users/qp/Pictures/In 2016, the polls were biased in favor of the Democrats by about 1% or 2%..png)
In 2016, the polls were biased in favor of the Democrats by about 1% or 2%.  Although we now know this bias, before the election we had no way of knowing it.  So we can't correct our polls accordingly.  What we can do is including a term in our model that accounts for this variability.  Later we will learn how to do this.  



[][Textbook link]

This video corresponds to the textbook section on the Bayesian approach to election forecasting External link.
https://rafalab.github.io/dsbook/models.html#bayesian-approach


[][Key points]

    In our model:
        The spread d ~ N(mu, tau) describes our best guess in the absence of polling data. We set mu = 0 and tau = 0.035 using historical data.
        The average of observed data X_bar | d ~ N(d, sigma) describes randomness due to sampling and the pollster effect.

    Because the posterior distribution is normal, we can report a 95% credible interval that has a 95% chance of overlapping the parameter using E(p | Y) and SE(p | Y).
    
    Given an estimate of E(p | Y) and SE(p | Y), we can use pnorm() to compute the probability that d > 0.
    
    It is common to see a general bias that affects all pollsters in the same way. This bias cannot be predicted or measured before the election. We will include a term in later models to account for this variability.


[][Code: Definition of results object]

This code from previous videos defines the results object used for empirical Bayes election forecasting.

```{r}
library(tidyverse)
library(dslabs)
data("polls_us_election_2016")

polls <- polls_us_election_2016 %>%
  filter(state=="U.S." &
           enddate>="2016-10-31" &
           (grade %in% c("A+", "A", "A-", "B+") | (is.na(grade)))) %>%
  mutate(spread=rawpoll_clinton/100 - rawpoll_trump/100)

one_poll_per_pollster <- polls %>%
  group_by(pollster) %>%
  filter(enddate==max(enddate)) %>%
  ungroup()


one_poll_per_pollster

```
```{r}
one_poll_per_pollster %>%
  summarize(n())

one_poll_per_pollster %>%
  summarize(length(spread))
```


```{r}
results <- one_poll_per_pollster %>%
  summarize(avg = mean(spread), 
            se = sd(spread)/sqrt(length(spread))) %>%
  mutate(start = avg - 1.96*se, end = avg + 1.96*se)


results
```

Code: Computing the posterior mean, standard error, credible interval and probability

Note that to compute an exact 95% credible interval, we would use qnorm(.975) instead of 1.96.

```{r}
mu <- 0
tau <- 0.035

sigma <- results$se
Y <- results$avg
B <- sigma^2 / (sigma^2 + tau^2)

posterior_mean <- B*mu + (1-B)*Y
posterior_se <- sqrt(1 / (1/sigma^2 + 1/tau^2))

posterior_mean
posterior_se

# 95% credible interval
posterior_mean + c(-1.96, 1.96)*posterior_se

# probability of d > 0
1 - pnorm(0, posterior_mean, posterior_se)
```












# Mathematical Representations of Models


Suppose we are collecting data from one pollster, and we assume there's no general bias.  The pollster collects several polls with a sample size of N.  So we observe several measurements of the spread.  Let's call it X1 through XJ.  The theory tells us that these random variables have expected value d and standard error 2 times the square root of p times 1 minus p divided by N.  For reasons that will become clear soon, we can represent this model mathematically like this.  
![](C:/Users/qp/Pictures/can represent this several polls from one pollster model mathematically like this..png)
We type Xj equals d plus what is called an error term.  We use the Greek letter epsilon, that is the Greek letter for e, which is the first letter of error.  We use the index j to represent the different polls.  *And we define epsilon j to be the random variable that explains the poll to poll variability introduced by sampling error*.  To do this, we assume it has expected value 0 and standard error 2 times square root of p times 1 minus p divided by n.  If d is 2.1 and the sample size for these polls was, say, 2,000, we could simulate six data points from a model using this simple code.  
![](C:/Users/qp/Pictures/we could simulate six data points from a model using this simple code..png)
Now, suppose we have six data points from five different pollsters.  To represent this, we now need two indices--one for pollster and one for the polls each pollster takes.  We're going to use X_i,j--with i representing the pollster and j representing the jth poll from a given pollster.  If we apply the same model, we would then write Xij equals d plus the sampling error.  To simulate data, we now have to use a loop.  
![](C:/Users/qp/Pictures/we are going to use X_ij with i representing the pollster and j representing the jth poll from a given pollster.png)
![Wrong](C:/Users/qp/Pictures/if we apply same model, we'll need a equation like this.png)
![](C:/Users/qp/Pictures/This creates data for five different pollsters..png)
![](C:/Users/qp/Pictures/The simulated data does not really seem to capture the pollster to pollster variability.png)
![](C:/Users/qp/Pictures/The model used does not match the actual data, which we can see includes the pollster to pollster variability.png)
We're going to use the sapply function like this.  This creates data for five different pollsters.  Here is the simulated data.  The simulated data does not really seem to capture an important feature of the actual data, which we can see here.  *The model does not account for pollster to pollster variability*.  [][To fix this, we add a new term for the pollster effect].  We're going to use hi to represent the house effect for the ith pollster.  So now, the model looks like this.  To simulate data for a specific pollster, we now need to draw an hi for each pollster and then add the epsilons.  
![h_i for the pollster effect.](C:/Users/qp/Pictures/After the new term h_i for hourse effect, the model looks like this..png)
We can do this using this simulation.  We're going to [][assume that the standard error for pollster to pollster variability is 0.025].  So here's the code.  The simulated data now looks much more like the actual data, where *each pollster has its own center*.  Note that hi is common to all the observed spreads from a specific pollster.  [][Different pollsters have a different hi], which explains why we can see the groups of points shift up and down from pollster to pollster.  
![](C:/Users/qp/Pictures/The simulated data now looks much more like the actual data,.png)
Now, [][in the model, we assume the average house effect is 0].  *We think that for every pollster that's biased in favor of one party, there's another that is favored in favor of the other*, so it all averages out.  But historically, we see that every election has a general bias affecting all polls, as we said earlier.  We can't observe this with just the 2016 data.  But if we were to collect historical data, we will see that the average of polls misses by more than models like the one we showed would predict.  To see this, we would take--for each election year-- the average of polls and compare it to the actual value.  If we did this, we would see differences with standard deviations of between 2% and 3%.  To incorporate this into the model, we can add yet another term that accounts for this general bias variability.  
![](C:/Users/qp/Pictures/So now the model looks like this. we added a b into it.png)
So now the model looks like this.  We've added a b into it.  [][The b is modeled to have expected value 0.  And again, based on historical data, we assume that the standard error is about 0.025].  *Note that the variability of b is not observed in one year*, because every single poll we observe that year has this general bias.  So we don't see that variability.  Every single poll has the same value.  An implication of adding this term to the model, though, is that the standard deviation of Xi,j is actually higher than what we earlier called sigma--which combines the pollster variability and the sample n variability.  We have to add the general bias variability.  
![](C:/Users/qp/Pictures/Since we add this, now we note that the sample average, which is shown here,.png)
![Where does this equation comes from??? Think about it](C:/Users/qp/Pictures/Since we add this, now we note that the sample average, which is shown here, implies that the standard deviation of X-bar includes this term sigma b..png)
[][Since we add this, now we note that the sample average], which is shown here, implies that the standard deviation of X-bar includes this term sigma b.  **Because the same b is in every measurement**, the average does not reduce its variance.  This is an important point.  [][It does not matter how many polls you take.  This bias does not get reduced by taking averages].  If we redo the Bayesian calculation taking this variability into account, we get a result much closer to FiveThirtyEight.  We write the code again.  
![](C:/Users/qp/Pictures/If we redo the Bayesian calculation taking this variability into account, we get a result much close to 538.png)
But now notice that the sigma includes the 0.025 squared.  That's the general bias variability.  Once we do this, we get a probability of Clinton winning the popular vote of 81.7%--much lower than the 99.999--again, because we're including the general bias variability.  



[][Textbook link]

This video corresponds to the textbook section on mathematical representations of models External link.
https://rafalab.github.io/dsbook/models.html#mathematical-representations-of-models


[][Key points]

    If we collect several polls with measured spreads X_1, X_2, ... , X_j with a sample size of N, these random variables have expected value d and standard error 2*sqrt(p(1-p)/N).
    
    We represent each measurement as X_ij = d + b + h_j + e_ij where:

        The index i represents the different pollsters
        The index j represents the different polls
        X_ij is the j-th poll by the i-th pollster 
        d is the actual spread of the election
        b is the general bias affecting all pollsters
        h_i represents the house effect for the i-th pollster
        e_ij represents the random error associated with the i, j-th poll.

    The sample average is now X_bar == d + b + (1/N) * SUM_i=1 ~ N (X_i) with standard deviation SE(X_bar) = sqrt(sigma^2/N + sigma_b^2).
    
![](C:/Users/qp/Pictures/the sample equations from election forecast with hierarchical model.png)
    
    The standard error of the general bias sigma_b does not get reduced by averaging multiple polls, which increases the variability of our final estimate.


*Code: Simulated data with X_j = d + e_j*

J <- 6
N <- 2000
d <- .021
p <- (d+1)/2
X <- d + rnorm(J, 0, 2*sqrt(p*(1-p)/N))

*Code: Simulated data with X_ij = d + e_ij*

I <- 5
J <- 6
N <- 2000
d <- .021
p <- (d+1)/2
X <- sapply(1:I, function(i){
    d + rnorm(J, 0, 2*sqrt(p*(1-p)/N))
})

*Code: Simulated data with X_ij = d + h_j + e_ij*

I <- 5
J <- 6
N <- 2000
d <- .021
p <- (d+1)/2
h <- rnorm(I, 0, 0.025)    # assume standard error of pollster-to-pollster variability is 0.025
X <- sapply(1:I, function(i){
    d + rnorm(J, 0, 2*sqrt(p*(1-p)/N))
})

*Code: Calculating probability of d > 0 with general bias*

Note that sigma now includes an estimate of the variability due to general bias sigma_b = 0.025.

mu <- 0
tau <- 0.035
sigma <- sqrt(results$se^2 + .025^2)
Y <- results$avg
B <- sigma^2 / (sigma^2 + tau^2)

posterior_mean <- B*mu + (1-B)*Y
posterior_se <- sqrt(1 / (1/sigma^2 + 1/tau^2))

1 - pnorm(0, posterior_mean, posterior_se)


```{r}
I <- 5
J <- 6
N <- 2000

d <- 0.021
p <- (d+1)/2

h <- rnorm(I, 0, 0.025)
X <- sapply(1:I, function(i){
  d + h[i] +rnorm(I, 0, 2*sqrt(p*(1-p)/N))
})


X
```

```{r}
library(dslabs)
library(tidyverse)
data("polls_us_election_2016")

polls <- polls_us_election_2016 %>%
  filter(state=="U.S." & 
           enddate >= "2016-10-31" & 
           (grade %in% c("A+", "A", "A-", "B+") | (is.na(grade)))) %>%
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)


one_poll_per_pollster <- polls %>%
  group_by(pollster) %>%
  filter(enddate == max(enddate)) %>%
  ungroup()
```


```{r}
results <- one_poll_per_pollster %>%
  summarize(avg = mean(spread), 
            se = sd(spread)/sqrt(length(spread))) %>%   ################################################################
  mutate(start = avg - 1.96*se, end = avg + 1.96*se)

results
```


```{r}
mu <- 0
tau <- 0.035

sigma <- sqrt(results$se^2 + 0.025^2)   ######################################################################################
Y <- results$avg

B <- sigma^2/(sigma^2 + tau^2)

posterior_mean <- B*mu + (1-B)*Y
posterior_se <- sqrt(1/(1/sigma^2 + 1/tau^2))

1 - pnorm(0, posterior_mean, posterior_se)
```











# Predicting the Electoral College

Up to now we have focused on the popular vote.  But in the United States, elections are not decided by the popular vote, but rather by what is called the [][electoral college].  Each state and DC get a number of electoral votes that depend in a somewhat complex way on the population size of the state.  
![](C:/Users/qp/Pictures/Here are the top five states ranked by electoral votes..png)
Here are the top five states ranked by electoral votes.  We can see California has 55 votes, Texas has 38 votes, et cetera.  With some minor exceptions that we don't discuss, the electoral votes are won all or nothing.  So for example, if you win California by just one vote, you still get all of its 55 electoral votes.  This means that by winning a few big states by a large margin, but losing many small states by a small margin, you can win the popular vote and lose the electoral college.  This happened in 1876, 1888, 2000, and 2016.  The idea behind this is to avoid a few large states having too much power and dominate the presidential election.  Although many in the US consider the electoral college unfair and would like to see it changed, this is how the elections are decided.  We are now ready to predict the electoral college result for 2016.  We start by aggregating results from polls taken during the last week before the election.  
![get familiar with those code, and doing it on your own](C:/Users/qp/Pictures/We write this code to filter out the polls we dont want and compute the spread and then avg and standard deviation.png)
We write this code to filter out the polls we don't want, compute the spread, and then compute the average and the standard deviation for each state.  Here are the 10 closest races according to the polls already summarized.  We see Florida, North Carolina, Ohio, et cetera.  These are called the battleground states.  
![](C:/Users/qp/Pictures/Here are the 10 closest races according to the polls already summarized..png)
![](C:/Users/qp/Pictures/left_join, and it will let us easily add the number of electoral votes for each state.png)
We now introduce a command that we will learn more about in later videos called [][left_join()], and it will let us easily add the number of electoral votes for each state.  Note that some states have no polls.  This is because a winner is pretty much known.  No polls were conducted in DC, Rhode Island, Alaska, and Wyoming because the first two are sure to be Democrats and the last two Republicans.  
![](C:/Users/qp/Pictures/No polls were conducted in DC, Rhode Island, Alaska, and Wyoming.png)
![](C:/Users/qp/Pictures/This code assigns a standard deviation to the states that just has one poll.png)
*This code assigns a standard deviation to the states that had just one poll by substituting the missing value by the median of the standard deviations of all the other states*.  We're going to use a Monte Carlo simulation to generate outcomes from simulated elections.  Then we're going to use this to make probability statements.  For each state, we apply the Bayesian approach we learned to generate an Election Day d for each state.  We could construct the priors for each state based on recent history.  However, to keep it simple, we assign the same prior to each state.  [][This prior is going to assume that we know nothing about what will happen.  So the expected value will be 0].  

Because from election year to election year the results from a specific state don't change that much, we will assign a standard deviation of 2%.  So tau is going to be now 0.02.  The Bayesian calculation looks like this.  This is the code that generates the posterior mean and the posterior standard error.  Note that estimates based on posteriors move the estimates towards 0, although the states with many polls are influenced less.  You can see that in this plot.  
![sigma and Y were changed to sd and avg in below calculation, think about why doing this???](C:/Users/qp/Pictures/Note that estimates based on posteriors move the estimates towards 0,.png)
![](C:/Users/qp/Pictures/Note that estimates based on posteriors move the estimates towards 0, although the state with many polls were influenced less.png)
*This is expected, as the more poll data we collect, the more we trust those results*.  [][Now we repeat this 10,000 times and generate an outcome from the posterior.  So we're generating 10,000 election night results].  
![](C:/Users/qp/Pictures/Now we repeat this 10,000 times and generate an outcome from the posterior..png)
In this code, we only keep the total number of electoral votes for Clinton.  We add 7 because we have no polls for Rhode Island and DC, but we're sure that the Democrats will win those states.  This model gives Clinton over 99% chance of winning, as we can see by writing this code.  Here's a histogram of the outcomes.  
![](C:/Users/qp/Pictures/Here's a histogram of the outcomes. with this code.png)
![](C:/Users/qp/Pictures/Note that a similar prediction was made by the Princeton Election Consortium..png)
Note that a similar prediction was made by the Princeton Election Consortium.  We now know that that was quite off.  [][So what happened?]  *Note that the model that we just showed ignores the general bias*.  The general bias in 2016 was not that big compared to other years, but it was between 1% and 2%.  But because the election was close in several big states and the large number of polls made the estimates of standard error small, by ignoring the variability introduced by the general bias, pollsters were overconfident of the polling data.  FiveThirtyEight, on the other hand, models the general bias in a rather sophisticated way and reported closer results.  We can simulate the results now using this bias term.  For the state level, we're going to assume the general bias is larger.  So we're going to set sigma b to be 0.03.  
![](C:/Users/qp/Pictures/The code here recomputes the Monte Carlo simulation, by computing the general bias of each state.png)
The code here recomputes the Monte Carlo simulation,  but accounting for the general bias.  When we do this, the probability of Clinton winning goes way down to 83%.  Looking at the outcomes of the simulation for these two approaches, we see how the bias term adds variability to the final results.  You can see this in this plot.  
![](C:/Users/qp/Pictures/Looking at the outcomes of the simulation for these two approaches, we see how bias term adds variability to the final results..png)
[][FiveThirtyEight includes many other features we did not include here.  One is that they model variability with distributions that have higher probabilities for extreme events compared to what the normal distribution give us] (*Think about this???*).  They ended up predicting a probability of 71%.  



[][Textbook link]

This video corresponds to the textbook section on predicting the electoral college External link.
https://rafalab.github.io/dsbook/models.html#predicting-the-electoral-college


[][Key points]

    In the US election, each state has a certain number of votes that are won all-or-nothing based on the popular vote result in that state (with minor exceptions not discussed here).
    We use the left_join() function to combine the number of electoral votes with our poll results.
    For each state, we apply a Bayesian approach to generate an Election Day d. We keep our prior simple by assuming an expected value of 0 and a standard deviation based on recent history of 0.02.
    We can run a Monte Carlo simulation that for each iteration simulates poll results in each state using that state's average and standard deviation, awards electoral votes for each state to Clinton if the spread is greater than 0, then compares the number of electoral votes won to the number of votes required to win the election (over 269).
    If we run a Monte Carlo simulation for the electoral college without accounting for general bias, we overestimate Clinton's chances of winning at over 99%.
    If we include a general bias term, the estimated probability of Clinton winning decreases significantly.


[][Code: Top 5 states ranked by electoral votes]

The results_us_election_2016 object is defined in the dslabs package:

library(tidyverse)
library(dslabs)
data("polls_us_election_2016")
head(results_us_election_2016)

results_us_election_2016 %>% arrange(desc(electoral_votes)) %>% top_n(5, electoral_votes)

Code: Computing the average and standard deviation for each state

results <- polls_us_election_2016 %>%
    filter(state != "U.S." &
            !grepl("CD", state) &
            enddate >= "2016-10-31" &
            (grade %in% c("A+", "A", "A-", "B+") | is.na(grade))) %>%
    mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
    group_by(state) %>%
    summarize(avg = mean(spread), sd = sd(spread), n = n()) %>%
    mutate(state = as.character(state))

# 10 closest races = battleground states
results %>% arrange(abs(avg))

# joining electoral college votes and results
results <- left_join(results, results_us_election_2016, by="state")

# states with no polls: note Rhode Island and District of Columbia = Democrat
results_us_election_2016 %>% filter(!state %in% results$state)

# assigns sd to states with just one poll as median of other sd values
results <- results %>%
    mutate(sd = ifelse(is.na(sd), median(results$sd, na.rm = TRUE), sd))

Code: Calculating the posterior mean and posterior standard error

Note there is a small error in the video code: B should be defined as sigma^2/(sigma^2 + tau^2).

mu <- 0
tau <- 0.02
results %>% mutate(sigma = sd/sqrt(n),
                   B = sigma^2/ (sigma^2 + tau^2),
                   posterior_mean = B*mu + (1-B)*avg,
                   posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2))) %>%
    arrange(abs(posterior_mean))

Code: Monte Carlo simulation of Election Night results (no general bias)

mu <- 0
tau <- 0.02
clinton_EV <- replicate(1000, {
    results %>% mutate(sigma = sd/sqrt(n),
                       B = sigma^2/ (sigma^2 + tau^2),
                       posterior_mean = B*mu + (1-B)*avg,
                       posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2)),
                       simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se),
                       clinton = ifelse(simulated_result > 0, electoral_votes, 0)) %>%    # award votes if Clinton wins state
        summarize(clinton = sum(clinton)) %>%    # total votes for Clinton
        .$clinton + 7    # 7 votes for Rhode Island and DC
})
mean(clinton_EV > 269)    # over 269 votes wins election

# histogram of outcomes
data.frame(clintonEV) %>%
    ggplot(aes(clintonEV)) +
    geom_histogram(binwidth = 1) +
    geom_vline(xintercept = 269)

Code: Monte Carlo simulation including general bias

mu <- 0
tau <- 0.02
bias_sd <- 0.03
clinton_EV_2 <- replicate(1000, {
    results %>% mutate(sigma = sqrt(sd^2/(n) + bias_sd^2),    # added bias_sd term
                        B = sigma^2/ (sigma^2 + tau^2),
                        posterior_mean = B*mu + (1-B)*avg,
                        posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2)),
                        simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se),
                        clinton = ifelse(simulated_result > 0, electoral_votes, 0)) %>%    # award votes if Clinton wins state
        summarize(clinton = sum(clinton)) %>%    # total votes for Clinton
        .$clinton + 7    # 7 votes for Rhode Island and DC
})
mean(clinton_EV_2 > 269)    # over 269 votes wins election



```{r}
library(dslabs)
data("polls_us_election_2016")


results <- polls_us_election_2016 %>%
  filter(state!="U.S." &
           !grepl("CD", state) &
           enddate>="2016-10-31" &
           (grade %in% c("A+", "A", "A-", "B+") | is.na(grade))) %>%
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
  group_by(state) %>%
  summarize(avg = mean(spread), sd = sd(spread), n = n()) %>%
  mutate(state = as.character(state))

results


results %>%
  arrange(abs(avg))



results <- left_join(results, results_us_election_2016, by="state")

results_us_election_2016 %>%
  filter(!state %in% results$state)
  
```

```{r}
results <- results %>%
    mutate(sd = ifelse(is.na(sd), median(results$sd, na.rm = TRUE), sd))
```


```{r}
mu <- 0
tau <- 0.02

bias_sd <- 0.03


clinton_EV_2 <- replicate(1000, {
  results %>% 
    mutate(sigma = sqrt(sd^2/(n) + bias_sd^2),    # added bias_sd term
           B = sigma^2/ (sigma^2 + tau^2),
           posterior_mean = B*mu + (1-B)*avg,
           posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2)),
           simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se),
           clinton = ifelse(simulated_result > 0, electoral_votes, 0)) %>%    # award votes if Clinton wins state
    summarize(clinton = sum(clinton)) %>%    # total votes for Clinton
    .$clinton + 7    # 7 votes for Rhode Island and DC
})

mean(clinton_EV_2 > 269)    # over 269 votes wins election
```












# Forecasting

Forecasters like to make predictions well before the election.  The predictions are adapted as new polls come out.  However, an important question forecasters must ask is, how informative are polls taken several weeks before the election?  [][Here we study the variability of poll results across time.  ]

In our example, to make sure that the variability we observe is not due to pollster effects, we're going to stick to just one pollster.  Using this code, we're going to look at Ipsos data.  
![](C:/Users/qp/Pictures/In our example, to make sure that the variability we observe is not due to pollster effects, we're going to stick to just one pollster..png)
![](C:/Users/qp/Pictures/Since there's no pollster effect, perhaps the theoretical standard error will match the data-derived standard deviation.png)
Since there's no pollster effect, perhaps the theoretical standard error will match the data-derived standard deviation.  We can compute both of them using this code.  And we see that the empirical standard deviation is a little bit higher than the theoretical one.  Furthermore, the distribution of the data does not look normal as the theory would predict, as we can see in this figure.  
![](C:/Users/qp/Pictures/and the distribution of the data does not look normal as the theory would predict, as we can see in this figure..png)
![](C:/Users/qp/Pictures/the distribution of the data does not look normal as the theory would predict, as we can see in this figure..png)
Where is this extra variability coming from?  This plot makes a strong case that the extra variability comes from time variation not accounted for by the theory that assumes that p is fixed across time.  
![](C:/Users/qp/Pictures/This plot makes a strong case that the extra variability comes from time variation.png)
![](C:/Users/qp/Pictures/We can see them consistently across several pollsters, not just one..png)
![](C:/Users/qp/Pictures/we can see the trend across time for several pollster, We can see the peaks and valleys just the same..png)
Some of the peaks and valleys we see coincide with events such as the party conventions, which tend to give the candidates a boost.  We can see them consistently across several pollsters, not just one.  With this code we show the trend across time for several pollsters.  We can see the peaks and valleys just the same.  This implies that if we're going to forecast, our model must include a term to model the time effect.  We could write a model that includes a bias term for time, like this.  We just add the b_t term that will account for the variability of changes through time.  
![](C:/Users/qp/Pictures/We just add the b_t term that will account for the variability of changes through time.png)
*The standard deviation of bt would depend on time, since the closer we get to the election day, the smaller this variability should become*.  Pollsters also try to estimate trends.  Let's call them f(t) here, the function f of t.  They try to estimate them from the data and incorporate them into the predictions.  The blue lines in the plot that we just showed are estimates of this f(t).  The model would then look like this.  
![](C:/Users/qp/Pictures/pollsters also try to estimate trends, ft, The blue lines in the plot that we just showed are estimates of this ft..png)
In many pollsters' websites, we actually see the estimated f(t), not for the difference but for the actual percentages for the two main candidates.  The following code will let you make a plot like that, which looks like this.  
![](C:/Users/qp/Pictures/not for the difference but for the actual percentage for the two candidate. The following code will let you make a plot like that, which looks like this..png)
![](C:/Users/qp/Pictures/The following code will let you make a plot like that, which looks like this..png)
Once we decide on a model, like the one we just showed, we can use historical data and this year's data to estimate all the necessary parameters to make predictions.  There are a variety of methods for fitting models, which we don't discuss here.  But later, we'll discuss some of these methods.  



[][Textbook link]

This video corresponds to the textbook section on forecasting External link.
https://rafalab.github.io/dsbook/models.html#forecasting


[][Key points]

    In poll results, p is not fixed over time. Variability within a single pollster comes from time variation.
    
    In order to forecast, our model must include a bias term b_t to model the time effect.
    
    Pollsters also try to estimate f(t), the trend of p given time t using a model like:
        Y_ijt = d + b + h_j + b_t + f(t) + e_ijt

    Once we decide on a model, we can use historical data and current data to estimate the necessary parameters to make predictions.

Code: Variability across one pollster

# select all national polls by one pollster
one_pollster <- polls_us_election_2016 %>%
    filter(pollster == "Ipsos" & state == "U.S.") %>%
    mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)

# the observed standard error is higher than theory predicts
se <- one_pollster %>%
    summarize(empirical = sd(spread),
            theoretical = 2*sqrt(mean(spread)*(1-mean(spread))/min(samplesize)))
se

# the distribution of the data is not normal
one_pollster %>% ggplot(aes(spread)) +
    geom_histogram(binwidth = 0.01, color = "black")

Code: Trend across time for several pollsters

polls_us_election_2016 %>%
    filter(state == "U.S." & enddate >= "2016-07-01") %>%
    group_by(pollster) %>%
    filter(n() >= 10) %>%
    ungroup() %>%
    mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
    ggplot(aes(enddate, spread)) +
    geom_smooth(method = "loess", span = 0.1) +
    geom_point(aes(color = pollster), show.legend = FALSE, alpha = 0.6)

Code: Plotting raw percentages across time

polls_us_election_2016 %>%
    filter(state == "U.S." & enddate >= "2016-07-01") %>%
    select(enddate, pollster, rawpoll_clinton, rawpoll_trump) %>%
    rename(Clinton = rawpoll_clinton, Trump = rawpoll_trump) %>%
    gather(candidate, percentage, -enddate, -pollster) %>%
    mutate(candidate = factor(candidate, levels = c("Trump", "Clinton"))) %>%
    group_by(pollster) %>%
    filter(n() >= 10) %>%
    ungroup() %>%
    ggplot(aes(enddate, percentage, color = candidate)) +
    geom_point(show.legend = FALSE, alpha = 0.4) +
    geom_smooth(method = "loess", span = 0.15) +
    scale_y_continuous(limits = c(30, 50))










# Assessment 6.1: Election Forecasting



DataCamp due Aug 8, 2022 05:15 AWST

In this assessment, you will learn about election forecasting by exploring data from the 2016 US Presidential Election.

By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy. Note that you might need to disable your pop-up blocker, or allow "www.datacamp.com" in your pop-up blocker allowed list. When you have completed the exercises, return to edX to continue your learning.

Assessment 6.1: Election Forecasting (External resource) (10.5 points possible)
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy.



## Exercise 1 - Confidence Intervals of Polling Data

For each poll in the polling data set, use the CLT to create a 95% confidence interval for the spread. Create a new table called cis that contains columns for the lower and upper limits of the confidence intervals.
Instructions
100 XP

    Use pipes %>% to pass the poll object on to the mutate function, which creates new variables.
[][    Create a variable called X_hat that contains the estimate of the proportion of Clinton voters for each poll. ] ???
    Create a variable called se that contains the standard error of the spread.
    Calculate the confidence intervals using the qnorm function and your calculated se.
    Use the select function to keep the following columns: state, startdate, enddate, pollster, grade, spread, lower, upper.

```{r}
# Load the libraries and data
library(dplyr)
library(dslabs)
data("polls_us_election_2016")


# Create a table called `polls` that filters by  state, date, and reports the spread
polls <- polls_us_election_2016 %>% 
  filter(state != "U.S." & enddate >= "2016-10-31") %>% 
  mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)

polls


# Create an object called `cis` that has the columns indicated in the instructions
cis <- polls %>%
  mutate(X_hat = rawpoll_clinton/100, 
         se = sd(spread)/sqrt(n()), 
         lower = spread - qnorm(0.975)*se, 
         upper = spread + qnorm(0.975)*se) %>%
  select(state, startdate, enddate, pollster, grade, spread, lower, upper)


cis
```
Incorrect submission
The object cis should contain columns for the 'state', 'startdate', 'enddate', 'pollster', 'grade', 'spread', 'lower', and 'upper'. If your column names are correct, [][check that you've properly calculated the standard error of the spread as two times the standard error of x_hat when calculating confidence intervals.  ]

```{r}
# Create an object called `cis` that has the columns indicated in the instructions
cis <- polls %>%
  mutate(X_hat = (spread+1)/2, 
         # se = 2*(X_hat*(1-X_hat)/sqrt(n())), 
         se = 2*sqrt(X_hat*(1-X_hat)/samplesize),    ######################### its samplesize, not n()
         lower = spread - qnorm(0.975)*se, 
         upper = spread + qnorm(0.975)*se) %>%
  select(state, startdate, enddate, pollster, grade, spread, lower, upper, )


cis
```



## Exercise 2 - Compare to Actual Results

You can add the final result to the cis table you just created using the left_join function as shown in the sample code.

Now determine how often the 95% confidence interval includes the actual result.
Instructions
100 XP

    Create an object called p_hits that contains the proportion of intervals that contain the actual spread using the following two steps.
    Use the mutate function to create a new variable called hit that contains a logical vector for whether the actual_spread falls between the lower and upper confidence intervals.
    Summarize the proportion of values in hit that are true using the mean function inside of summarize.

```{r}
# Add the actual results to the `cis` data set
add <- results_us_election_2016 %>% 
  mutate(actual_spread = clinton/100 - trump/100) %>% 
  select(state, actual_spread)

ci_data <- cis %>% 
  mutate(state = as.character(state)) %>% 
  left_join(add, by = "state")

# Create an object called `p_hits` that summarizes the proportion of confidence intervals that contain the actual value. Print this object to the console.
p_hits <- ci_data %>%
  mutate(hit = actual_spread >= lower & actual_spread <= upper) %>%
  #   mutate(hit =  lower <= 0.482 & 0.482 <= upper) %>%
  summarize(mean(hit))


p_hits
```



## Exercise 3 - Stratify by Pollster and Grade

Now find the proportion of hits for each pollster. Show only pollsters with at least 5 polls and order them from best to worst. Show the number of polls conducted by each pollster and the FiveThirtyEight grade of each pollster.
Instructions
100 XP

    Create an object called p_hits that contains the proportion of intervals that contain the actual spread using the following steps.
    Use the mutate function to create a new variable called hit that contains a logical vector for whether the actual_spread falls between the lower and upper confidence intervals.
    Use the group_by function to group the data by pollster.
    Use the filter function to filter for pollsters that have at least 5 polls.
    Summarize the proportion of values in hit that are true as a variable called proportion_hits. Also create new variables for the number of polls by each pollster (n) using the n() function and the grade of each poll (grade) by taking the first row of the grade column.
    Use the arrange function to arrange the proportion_hits in descending order.

```{r}
# The `cis` data have already been loaded for you
add <- results_us_election_2016 %>% 
  mutate(actual_spread = clinton/100 - trump/100) %>% 
  select(state, actual_spread)

ci_data <- cis %>% 
  mutate(state = as.character(state)) %>% 
  left_join(add, by = "state")

# Create an object called `p_hits` that summarizes the proportion of hits for each pollster that has at least 5 polls.
p_hits <- ci_data %>%
  mutate(hit = actual_spread >= lower & actual_spread <= upper) %>%
  group_by(pollster) %>%
  filter(n() >= 5) %>%
  summarize(proportion_hits = mean(hit), 
            n = n(), 
            grade = grade[1]) %>%
arrange(desc(proportion_hits))


p_hits
```



## Exercise 4 - Stratify by State

Repeat the previous exercise, but instead of pollster, stratify by state. Here we can't show grades.
Instructions
100 XP

    Create an object called p_hits that contains the proportion of intervals that contain the actual spread using the following steps.
    Use the mutate function to create a new variable called hit that contains a logical vector for whether the actual_spread falls between the lower and upper confidence intervals.
    Use the group_by function to group the data by state.
    Use the filter function to filter for states that have more than 5 polls.
    Summarize the proportion of values in hit that are true as a variable called proportion_hits. Also create new variables for the number of polls in each state using the n() function.
    Use the arrange function to arrange the proportion_hits in descending order.

```{r}
# The `cis` data have already been loaded for you
add <- results_us_election_2016 %>% 
  mutate(actual_spread = clinton/100 - trump/100) %>% 
  select(state, actual_spread)

ci_data <- cis %>% 
  mutate(state = as.character(state)) %>% 
  left_join(add, by = "state")

# Create an object called `p_hits` that summarizes the proportion of hits for each state that has more than 5 polls.
p_hits <- ci_data %>%
  mutate(hit = actual_spread >= lower & actual_spread <= upper) %>%
  group_by(state) %>%
  filter(n() > 5) %>%
  summarize(proportion_hits = mean(hit), 
            n = n()) %>%
  arrange(desc(proportion_hits))


p_hits
```



## Exercise 5- Plotting Prediction Results

Make a barplot based on the result from the previous exercise.
Instructions
100 XP

    Reorder the states in order of the proportion of hits.
    Using ggplot, set the aesthetic with state as the x-variable and proportion of hits as the y-variable.
    Use geom_bar to indicate that we want to plot a barplot. Specifcy stat = "identity" to indicate that the height of the bar should match the value.
    Use coord_flip to flip the axes so the states are displayed from top to bottom and proportions are displayed from left to right.

```{r}
# The `p_hits` data have already been loaded for you. Use the `head` function to examine it.
head(p_hits)

# Make a barplot of the proportion of hits for each state
p_hits %>%
  ggplot(aes(state, proportion_hits)) +
  geom_bar(stat = "identity") +
  coord_flip()
```



## Exercise 6 - Predicting the Winner

Even if a forecaster's confidence interval is incorrect, the overall predictions will do better if they correctly called the right winner.

Add two columns to the cis table by computing, for each poll, the difference between the predicted spread and the actual spread, and define a column hit that is true if the signs are the same.
Instructions
100 XP

    Use the mutate function to add two new variables to the cis object: error and hit.
    For the error variable, subtract the actual spread from the spread.
    For the hit variable, return "TRUE" if the poll predicted the actual winner. Use the sign function to check if their signs match - learn more with ?sign.
    Save the new table as an object called errors.
    Use the tail function to examine the last 6 rows of errors.

```{r}
# The `cis` data have already been loaded for you
add <- results_us_election_2016 %>% 
  mutate(actual_spread = clinton/100 - trump/100) %>% 
  select(state, actual_spread)

cis <- cis %>%
  left_join(add, by = "state") %>%
  select(state, startdate, enddate, pollster, grade, spread, lower, upper, actual_spread)



# The `cis` data have already been loaded. Examine it using the `head` function.
head(cis)

# Create an object called `errors` that calculates the difference between the predicted and actual spread and indicates if the correct winner was predicted
cis <- cis %>%
  mutate(error = spread-actual_spread, 
         hit = sign(error > 0))

errors <- cis 

# Examine the last 6 rows of `errors`
tail(errors, 6)
```
Incorrect submission
The object cis should contain columns for the 'state', 'startdate', 'enddate', 'pollster', 'grade', 'spread', 'lower', 'upper', 'actual_spread', 'error', and 'hit'. 

Hint

    Use the function sign to return the sign of a value.
    Use the logical operator == to indicate whether two values are equal.


```{r}
# The `cis` data have already been loaded. Examine it using the `head` function.
head(cis)

# Create an object called `errors` that calculates the difference between the predicted and actual spread and indicates if the correct winner was predicted
errors <- cis %>% 
  mutate(error = spread - actual_spread, 
  hit = sign(spread) == sign(actual_spread))   #===========================================================================

# Examine the last 6 rows of `errors`
tail(errors)
```



## Exercise 7 - Plotting Prediction Results

Create an object called p_hits that contains the proportion of instances when the sign of the actual spread matches the predicted spread for states with 5 or more polls.

Make a barplot based on the result from the previous exercise that shows the proportion of times the sign of the spread matched the actual result for the data in p_hits.
Instructions
100 XP

    Use the group_by function to group the data by state.
    Use the filter function to filter for states that have 5 or more polls.
    Summarize the proportion of values in hit that are true as a variable called proportion_hits. Also create a variable called n for the number of polls in each state using the n() function.
    To make the plot, follow these steps:
    Reorder the states in order of the proportion of hits.
    Using ggplot, set the aesthetic with state as the x-variable and proportion of hits as the y-variable.
    Use geom_bar to indicate that we want to plot a barplot.
    Use coord_flip to flip the axes so the states are displayed from top to bottom and proportions are displayed from left to right.

```{r}
# Create an object called `errors` that calculates the difference between the predicted and actual spread and indicates if the correct winner was predicted
errors <- cis %>% 
  mutate(error = spread - actual_spread, 
         hit = sign(spread) == sign(actual_spread))

# Create an object called `p_hits` that summarizes the proportion of hits for each state that has 5 or more polls
p_hits <- errors %>%
  group_by(state) %>%
  filter(n() >= 5) %>%
  summarize(proportion_hits = sum(hit)/length(hit), 
            n = n())

p_hits


# Make a barplot of the proportion of hits for each state
p_hits %>%
  ggplot(aes(state, proportion_hits)) +
  geom_bar(stat=identity) +
  coord_flip()
```

Hint

    Use n=n() in the summarize function to include the number of polls
[][    Use the reorder function to reorder the states by proportion of hits. ]
    Use stat=identity when calling the geom_bar function.

```{r}
p_hits <- errors %>%  group_by(state) %>%
  filter(n() >= 5) %>%
  summarize(proportion_hits = mean(hit), n = n())

# Make a barplot of the proportion of hits for each state
p_hits %>% 
  mutate(state = reorder(state, proportion_hits)) %>%    ######################################################3
  ggplot(aes(state, proportion_hits)) + 
  geom_bar(stat = "identity") +
 coord_flip()
```



## Exercise 8 - Plotting the Errors

In the previous graph, we see that most states' polls predicted the correct winner 100% of the time. Only a few states polls' were incorrect more than 25% of the time. Wisconsin got every single poll wrong. In Pennsylvania and Michigan, more than 90% of the polls had the signs wrong.

Make a histogram of the errors. What is the median of these errors?
Instructions
100 XP

    Use the hist function to generate a histogram of the errors
    Use the median function to compute the median error

```{r}
# The `errors` data have already been loaded. Examine them using the `head` function.
head(errors)

# Generate a histogram of the error
hist(errors$error)

# Calculate the median of the errors. Print this value to the console.
median(errors$error)
```



## Exercise 9- Plot Bias by State

We see that, at the state level, the median error was slightly in favor of Clinton. The distribution is not centered at 0, but at 0.037. This value represents the general bias we described in an earlier section.

Create a boxplot to examine if the bias was general to all states or if it affected some states differently. Filter the data to include only pollsters with grades B+ or higher.
Instructions
100 XP

    Use the filter function to filter the data for polls with grades equal to A+, A, A-, or B+.
    Use the reorder function to order the state data by error.
    Using ggplot, set the aesthetic with state as the x-variable and error as the y-variable.
    Use geom_boxplot to indicate that we want to plot a boxplot.
    Use geom_point to add data points as a layer.

```{r}
# The `errors` data have already been loaded. Examine them using the `head` function.
head(errors)

# Create a boxplot showing the errors by state for polls with grades B+ or higher
errors %>%
  filter(grade %in% c("A+", "A", "A-", "B")) %>%
  mutate(state = reorder(state, error)) %>%
  ggplot(aes(state, error)) +
  geom_boxplot() +
  geom_point()
```



## Exercise 10 - Filter Error Plot

Some of these states only have a few polls. Repeat the previous exercise to plot the errors for each state, but only include states with five good polls or more.
Instructions
100 XP

    Use the filter function to filter the data for polls with grades equal to A+, A, A-, or B+.
    Group the filtered data by state using group_by.
    Use the filter function to filter the data for states with at least 5 polls. Then, use ungroup so that polls are no longer grouped by state.
    Use the reorder function to order the state data by error.
    Using ggplot, set the aesthetic with state as the x-variable and error as the y-variable.
    Use geom_boxplot to indicate that we want to plot a boxplot.
    Use geom_point to add data points as a layer.

```{r}
# The `errors` data have already been loaded. Examine them using the `head` function.
head(errors)

# Create a boxplot showing the errors by state for states with at least 5 polls with grades B+ or higher
errors %>%
  filter(grade %in% c("A+", "A", "A-", "B")) %>%
  group_by(state) %>%
  filter(n() >= 5) %>%
  ungroup() %>%
  ggplot(aes(state, error)) +
  geom_boxplot() +
  geom_point()
```

Great work! You see that the West (Washington, New Mexico, California) underestimated Hillary's performance, while the Midwest (Michigan, Pennsylvania, Wisconsin, Ohio, Missouri) overestimated it. In our simulation in we did not model this behavior since we added general bias, rather than a regional bias. Some pollsters are now modeling correlation between similar states and estimating this correlation from historical data. To learn more about this, you can learn about random effects and mixed models. 


## End of Assessment

This is the end of the programming assignment for this section. Please DO NOT click through to additional assessments from this page. If you do click through, your scores may NOT be recorded.

Click "Got it!" and submit to get the "points" for this question.

You can close this window and return to Data Science: Inference.
Answer the question
50XP
Possible Answers

    Got it!
    press
    1
    
    
    
    







# The t-Distribution


Previously, we made use of the Central Limit Theorem with sample sizes as small as 15.  Because we're also estimating a second parameter, sigma, further variability is introduced into our confidence interval.  And this results in a confidence interval that is overconfident because it doesn't account for that variability.  For very large sample sizes, this extra variability is negligible.  But in general, for values smaller than 30, we need to be cautious about using the Central Limit Theorem.  
![](C:/Users/qp/Pictures/But in general, for values smaller than 30,.png)
[][However, if the data in the urn is known to follow a normal distribution, in other words, if the population data is known to follow a normal distribution, then we actually have a mathematical theory that tells us how much bigger we need to make the intervals to account for the estimation of sigma].  Using this theory, we can construct confidence intervals for any urn but again, only if the data in the urn is known to follow a normal distribution.  
![](C:/Users/qp/Pictures/only if the data in the urn is known following a normal distribution.png)
![](C:/Users/qp/Pictures/The statistic on which confidence intervals for d are based is this one..png)
So for the 0, 1 data of previous urn models, this theory definitely does not apply.  The statistic on which confidence intervals for d are based is this one.  We've seen it earlier.  We call it Z. The CLT tells us that Z is approximately normally distributed with expected value 0 and standard error 1.  But in practice, we don't know sigma, so we use s instead.  
![](C:/Users/qp/Pictures/But in practice, we don't know sigma, so we use s instead..png)
We substitute s where we have the sigma.  But by doing this, we introduce some variability.  The s, as variability, is estimated from data.  This theory that we mentioned tells us that Z follows what is called a t-distribution with what is called N minus 1 degrees of freedom.  
![](C:/Users/qp/Pictures/is called N minus 1 degrees of freedom..png)
![](C:/Users/qp/Pictures/You can see that in this figure, where we have t-distributions.png)
The degrees of freedom is a parameter that controls the variability via what are called fatter tails.  You can see that in this figure, where we have t-distributions  with degrees of freedom 3, 5, and 15.  And you can see how the tails, the ends, go higher and higher, meaning that large values have larger probabilities for smaller values of the degrees of freedom.  [][In our case of pollster data, if we are willing to assume that the pollster effect data is normally distributed, then we can use this theory].  Based on the sample, we can corroborate if, in fact, the data is normally distributed.  
![](C:/Users/qp/Pictures/Here's a q-q plot showing us our sample data versus a normal distribution..png)
Here's a q-q plot showing us our sample data versus a normal distribution.  It's not a perfect match, but is relatively close.  And this particular theory is quite robust to deviations from normality.  *So once we make that decision, then, perhaps a better confidence interval for d is constructed using the t-distribution instead of the normal distribution*.  So all we change is the 1.96.  We now change to the quantile coming from a t-distribution with 14 degrees of freedom.  The new confidence interval goes from 1.5% to 4.2%.  
![](C:/Users/qp/Pictures/We now change to the quantile coming from a t-distribution.png)
So it is a little bit bigger than the one we made using the normal distribution.  This is, of course, expected because the quantile from the t-distribution is larger than the quantile from the normal distribution, as we can see here.  
![](C:/Users/qp/Pictures/because the quantile from the t-distribution is larger than the quantile from the normal distribution,.png)
FiveThirtyEight uses the t-distribution to generate errors that better model the deviation we see in election data. Again, because they have fatter tails.  So, for example, the deviation we saw in Wisconsin between the polls and the actual result, the actual result was that Trump won by 0.7%, is more in line with t-distributed data than normal distributed data.  




[][Textbook link]

This video corresponds to the textbook section on the t-distribtution External link.
https://rafalab.github.io/dsbook/models.html#t-dist


[][Correction]

At 1:31-1:41, there is an error in the t-distribution formula. The formula should be:
    Z = (X_bar - d)/(s/sqrt(N))


Key points

    In models where we must estimate two parameters, p and sigma, the Central Limit Theorem can result in overconfident confidence intervals for sample sizes smaller than approximately 30.
    
    If the population data are known to follow a normal distribution, theory tells us how much larger to make confidence intervals to account for estimation of sigma.
    
    Given s as an estimate of sigma, then Z = (X_bar - d)/(s/sqrt(N)) follows a t-distribution with N - 1 degrees of freedom.
    
    Degrees of freedom determine the weight of the tails of the distribution. Small values of degrees of freedom lead to increased probabilities of extreme values.
    
    We can determine confidence intervals using the t-distribution instead of the normal distribution by calculating the desired quantile with the function qt().


Code: Calculating 95% confidence intervals with the t-distribution

z <- qt(0.975, nrow(one_poll_per_pollster) - 1)
one_poll_per_pollster %>%
    summarize(avg = mean(spread), moe = z*sd(spread)/sqrt(length(spread))) %>%
    mutate(start = avg - moe, end = avg + moe)

# quantile from t-distribution versus normal distribution
qt(0.975, 14)    # 14 = nrow(one_poll_per_pollster) - 1
qnorm(0.975)











# Assessment 6.2: The t-Distribution


DataCamp due Aug 8, 2022 05:15 AWST

In this assessment, you will learn about the t-distribution.

By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy. Note that you might need to disable your pop-up blocker, or allow "www.datacamp.com" in your pop-up blocker allowed list. When you have completed the exercises, return to edX to continue your learning.

Assessment 6.2: The t-Distribution (External resource) (5.0 points possible)
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy.

 

Ask your questions about the t-distribution or the related DataCamp assessment here. Remember to search the discussion board before posting to see if someone else has asked the same thing before asking a new question! You're also encouraged to answer each other's questions to help further your own learning.

Some reminders:

    Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
    Posting snippets of code is okay, but posting full code solutions is not.
    If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.

Discussion: Assessment 6.2
Topic: Section 6 / Assessment 6.2: The t-Distribution



## Exercise 1 - Using the t-Distribution

We know that, with a normal distribution, only 5% of values are more than 2 standard deviations away from the mean.

Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value when the degrees of freedom are 3.
Instructions
100 XP

    Use the pt function to calculate the probability of seeing a value less than or equal to the argument. Your output should be a single value.

```{r}
# Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value when 'df = 3'.
pt(2, df=3)
```

Incorrect submission
You are not providing a calculation that gives the correct answer. [][Make sure you subtracting the probability of the value being equal to or lower than 2 from 1 and adding the probability that the value is equal or lower than -2. ]

```{r}
# Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value when 'df = 3'.
1 - pt(2, df=3)
```

```{r}
# Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value when 'df = 3'.
1 - pt(2, df=3) + pt(-2, df=3)
```



## Exercise 2 - Plotting the t-distribution

Now use sapply to compute the same probability for degrees of freedom from 3 to 50.

Make a plot and notice when this probability converges to the normal distribution's 5%.
Instructions
100 XP

    Make a vector called df that contains a sequence of numbers from 3 to 50.
    Using function, make a function called pt_func that recreates the calculation for the probability that a value is greater than 2 as an absolute value for any given degrees of freedom.
    Use sapply to apply the pt_func function across all values contained in df. Call these probabilities probs.
    Use the plot function to plot df on the x-axis and probs on the y-axis.

```{r}
# Generate a vector 'df' that contains a sequence of numbers from 3 to 50
df <- seq(3, 50)

# Make a function called 'pt_func' that calculates the probability that a value is more than |2| for any degrees of freedom 
pt_func <- function(value) {
    1 - pt(2, df=value) + pt(-2, df=value)
}

# Generate a vector 'probs' that uses the `pt_func` function to calculate the probabilities
probs = pt_func(df)

# Plot 'df' on the x-axis and 'probs' on the y-axis
plot(df, probs)
```



## Exercise 3 - Sampling From the Normal Distribution

In a previous section, we repeatedly took random samples of 50 heights from a distribution of heights. We noticed that about 95% of the samples had confidence intervals spanning the true population mean.

Re-do this Monte Carlo simulation, but now instead of
, use

. Notice what happens to the proportion of hits.
Instructions
100 XP

    Use the replicate function to carry out the simulation. Specify the number of times you want the code to run and, within brackets, the three lines of code that should run.
    First use the sample function to randomly sample N values from x.
    Second, create a vector called interval that calculates the 95% confidence interval for the sample. You will use the qnorm function.
    Third, use the between function to determine if the population mean mu is contained between the confidence intervals.
    Save the results of the Monte Carlo function to a vector called res.
    Use the mean function to determine the proportion of hits in res.

```{r}
# Load the neccessary libraries and data
library(dslabs)
library(dplyr)
data(heights)

# Use the sample code to generate 'x', a vector of male heights
x <- heights %>% filter(sex == "Male") %>%
  .$height

# Create variables for the mean height 'mu', the sample size 'N', and the number of times the simulation should run 'B'
mu <- mean(x)
N <- 15
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)

# Generate a logical vector 'res' that contains the results of the simulations
res <- replicate(B, {
  sample = sample(x, N, replace=T)
  mu = mean(sample)
  interval = mu + c(-1, 1)*qnorm(0.975, mu, sqrt(mu*(1-mu)/N))
  between(mu, interval[1], interval[2])
})




# Calculate the proportion of times the simulation produced values within the 95% confidence interval. Print this value to the console.
mean(res)
```

Incorrect submission
Find the square root of the sample size in order to calculate the confidence interval. 

Hint

    You can generate the 95% confidence intervals using the following code:

mean(X) + c(-1,1)*qnorm(0.975)*sd(X)/sqrt(N)
#################################################################################################

```{r}
# Load the neccessary libraries and data
library(dslabs)
library(dplyr)
data(heights)

# Use the sample code to generate 'x', a vector of male heights
x <- heights %>% filter(sex == "Male") %>%
  .$height

# Create variables for the mean height 'mu', the sample size 'N', and the number of times the simulation should run 'B'
mu <- mean(x)
N <- 15
B <- 10000

# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)

# Generate a logical vector 'res' that contains the results of the simulations
res <- replicate(B, {
  X = sample(x, N, replace=T)
  interval = mean(X) + c(-1, 1)*qnorm(0.975)*sd(X)/sqrt(N)
  between(mu, interval[1], interval[2])
})




# Calculate the proportion of times the simulation produced values within the 95% confidence interval. Print this value to the console.
mean(res)
```



## Exercise 4 - Sampling from the t-Distribution

N = 15 is not that big. We know that heights are normally distributed, so the t-distribution should apply. Repeat the previous Monte Carlo simulation using the t-distribution instead of using the normal distribution to construct the confidence intervals.

What are the proportion of 95% confidence intervals that span the actual mean height now?
Instructions
100 XP

    Use the replicate function to carry out the simulation. Specify the number of times you want the code to run and, within brackets, the three lines of code that should run.
    First use the sample function to randomly sample N values from x.
    Second, create a vector called interval that calculates the 95% confidence interval for the sample. Remember to use the qt function this time to generate the confidence interval.
    Third, use the between function to determine if the population mean mu is contained between the confidence intervals.
    Save the results of the Monte Carlo function to a vector called res.
    Use the mean function to determine the proportion of hits in res.

```{r}
# The vector of filtered heights 'x' has already been loaded for you. Calculate the mean.
mu <- mean(x)

# Use the same sampling parameters as in the previous exercise.
set.seed(1)
N <- 15
B <- 10000

# Generate a logical vector 'res' that contains the results of the simulations using the t-distribution
res <- replicate(B, {
  X = sample(x, N, replace=T)
  interval = X + c(-1, 1)*qt(0.975, mu, sd(X)/sqrt(N))
  between(mu, interval[1], interval[2])
})



# Calculate the proportion of times the simulation produced values within the 95% confidence interval. Print this value to the console.
mean(res)
```

Incorrect submission
Use the quantile '0.975' to calculate the confidence intervals. Remember to specify the degrees of freedom (N-1) 

```
# The vector of filtered heights 'x' has already been loaded for you. Calculate the mean.
mu <- mean(x)

# Use the same sampling parameters as in the previous exercise.
set.seed(1)
N <- 15
B <- 10000

# Generate a logical vector 'res' that contains the results of the simulations using the t-distribution
res <- replicate(B, {
  X = sample(x, N, replace=T)
  interval = X + c(-1, 1)*qt(0.975, df=N-1)
  between(mu, interval[1], interval[2])
})



# Calculate the proportion of times the simulation produced values within the 95% confidence interval. Print this value to the console.
mean(res)
```

Incorrect submission
Your code contains an error that you should fix:

Error: unused arguments (mean = mean(X), sd = sd(X)/sqrt(N))

Hint

    You can generate the 95% confidence intervals using the following code:

mean(X) + c(-1,1)*qt(0.975, N-1)*sd(X)/sqrt(N)

```{r}
# The vector of filtered heights 'x' has already been loaded for you. Calculate the mean.
mu <- mean(x)

# Use the same sampling parameters as in the previous exercise.
set.seed(1)
N <- 15
B <- 10000

# Generate a logical vector 'res' that contains the results of the simulations using the t-distribution
res <- replicate(B, {
  X = sample(x, N, replace=T)
  interval = mean(X) + c(-1, 1)*qt(0.975, df=N-1)*sd(X)/sqrt(N)    #######################################################
  between(mu, interval[1], interval[2])
})



# Calculate the proportion of times the simulation produced values within the 95% confidence interval. Print this value to the console.
mean(res)
```

Incorrect submission
You are not providing a calculation that gives the correct answer. Make sure you are using the mean function to generate the proportion of hits. 

```{r}
library(dplyr)


# The vector of filtered heights 'x' has already been loaded for you. Calculate the mean.
mu <- mean(x)

# Use the same sampling parameters as in the previous exercise.
set.seed(1)
N <- 15
B <- 10000

# Generate a logical vector 'res' that contains the results of the simulations using the t-distribution
res <- replicate(B, {
  X = sample(x, N, replace=T)
  interval = mean(X) + c(-1, 1)*qt(0.975, df=N-1)*sd(X)/sqrt(N)
  between(mu, interval[1], interval[2])
})



# Calculate the proportion of times the simulation produced values within the 95% confidence interval. Print this value to the console.
mean(res)
```



## Exercise 5 - Why the t-Distribution?

Why did the t-distribution confidence intervals work so much better?
Instructions
50 XP
Possible Answers

#    The t-distribution takes the variability into account and generates larger confidence intervals.     0
    Because the t-distribution shifts the intervals in the direction towards the actual mean.
    This was just a chance occurrence. If we run it again, the CLT will work better.
    The t-distribution is always a better approximation than the normal distribution.
    
    
    

## End of Assessment

This is the end of the programming assignment for this section. Please DO NOT click through to additional assessments from this page. If you do click through, your scores may NOT be recorded.

Click "Got it!" and submit to get the "points" for this question.

You can close this window and return to Data Science: Inference.
Answer the question
50XP
Possible Answers

    Got it!
    press
    1











## Course  /  Section 7: Association Tests  /  Section 7 Overview


# Section 7 Overview

In Section 7, you will learn how to use association and chi-squared tests to perform inference for binary, categorical, and ordinal data through an example looking at research funding rates.

After completing Section 7, you will be able to:

        Use association and chi-squared tests to perform inference on binary, categorical, and ordinal data.
[][        Calculate an odds ratio to get an idea of the magnitude of an observed effect. ]

There is 1 assignment that uses the DataCamp platform for you to practice your coding skills.

We encourage you to use R to interactively test out your answers and further your learning.










## Course  /  Section 7: Association Tests  /  Association Tests


# Association Tests

The statistical tests we have covered up to now leave out a substantial portion of data types.  Specifically, we have not discussed inference for *binary*, *categorical*, or *ordinal data*.  To give a very specific example, consider the following case study.  A 2014 PNAS paper analyzed success rates from funding agencies in the Netherlands and concluded that their "results reveal gender bias favoring male applicants over female applicants in the prioritization of their quality of research."  
![](C:/Users/qp/Pictures/A 2014 PNAS paper analyzed success rates from funding agencies.png)
The main evidence for this conclusion comes down to a comparison of the percentages.  The first table in the supplement of the paper includes the information we need.  We have included it in the dslabs package.  And here it is.  
![](C:/Users/qp/Pictures/The first table in the supplement of the paper includes the information we need..png)
We can compute the differences in percentages for men and women.  To do this, we'll compute the totals that were successful and the totals that were not using this code.  So we see that a larger percent of men received awards than women.  
![](C:/Users/qp/Pictures/We can compute the differences in percentages for men and women..png)
![](C:/Users/qp/Pictures/So we see that a larger percent of men received awards than women.png)
The percent for men was about 18%.  The percent for women was about 15%.  [][But could this be due just to random variability?]  Here in this video, we learn how to perform inference for this type of data.  R.A. Fisher was one of the first to formalize hypothesis testing.  *The Lady Tasting Tea* is one of the most famous examples.  The story goes like this.  Muriel Bristol, a colleague of Fisher's, claimed that she could tell if milk was added before or after tea was poured.  Fisher was skeptical.  He designed an experiment to test this claim.  *He gave her 4 pairs of cups of tea, 1 with milk poured first, the other after*.  The order of the two was randomized.  The null hypothesis here is that she was just guessing.  

Fisher derived the distribution of the number of correct picks on the assumption that the choices were random and independent.  As an example, suppose she picked 3 out of 4 correctly.  Do we believe she has a special ability based on this?  The basic question we ask is, if the tester is actually guessing, what are the chances that she gets 3 or more correct?  
![](C:/Users/qp/Pictures/The basic question we ask is, if the tester is actually guessing,.png)
Just as we have done before, we can compute a probability under the null hypothesis that she's guessing four of each.  Under this null hypothesis, [][we can think of this particular example as picking 4 beads out of an urn where 4 are blue.  Those are the correct answers.  And 4 are red.  Those are the incorrect answers].  
![](C:/Users/qp/Pictures/Under this null hypothesis, we can think of this particular example.png)
*Remember that she knows that there are 4 before tea and 4 after*.  Under the null hypothesis that she's simply guessing, each bead has the same chance of being picked.  We can then use combinatorics to figure out each probability.  
![](C:/Users/qp/Pictures/We can then use combinatorics to figure out each probability..png)
![](C:/Users/qp/Pictures/The probability of picking 3 can be derived.png)
![](C:/Users/qp/Pictures/The probability of picking 4 correct is given by this formula,.png)
The probability of picking 3 can be derived using this mathematical formula that tells you that it's 16/70.  The probability of picking 4 correct is given by this formula, and the answer is 1/70.  Thus the chance of observing 3 correct answers or more under the null hypothesis is approximately 0.24 (17/70).  [][This is the p-value]??????.  The procedure that produces p-value is called Fisher's exact test.  *And it uses the hypergeometric distribution to compute the probabilities*.  
![](C:/Users/qp/Pictures/And it uses the hypergeometric distribution.png)
One quick note.  In the real story, it turns out Muriel could tell if the milk was poured before or after every single time.  So the p-value was 1/70.  The data from this type of experiment is usually summarized by a table like this.  
![](C:/Users/qp/Pictures/The data from this type of experiment is usually summarized by a table like this.png)
These are referred to as [][two-by-two tables].  They show, for each of the 4 combinations one can get with a pair of binary variables, the observed counts for each of these pairs.  The function fisher.test performs the inference calculations and can be applied to the two-by-two table using this simple piece of code.  
![](C:/Users/qp/Pictures/The function fisher.test performs the inference calculations.png)
Here we can see that the p-value is what we previously calculated.  


Statistics and tea
Main article: Lady tasting tea

One day at Rothamsted, Ronald Fisher offered Bristol a cup of hot tea that he had just drawn from an urn. Bristol declined it, saying that she preferred the flavour when the milk was poured into the cup before the tea. Fisher scoffed that the order of pouring could not affect the flavour. Bristol insisted that it did and that she could tell the difference. Overhearing this debate, William Roach said, "Let's test her."[3]

Fisher and Roach hastily put together an experiment to test Bristol's ability to identify the order in which the two liquids were poured into several cups. At the conclusion of this experiment in which she correctly identified all eight, Roach proclaimed that "Bristol divined correctly more than enough of those cups into which tea had been poured first to prove her case".[3]

This incident led Fisher to do important work in the design of statistically valid experiments based on the statistical significance of experimental results.[3] He developed Fisher's exact test to assess the probabilities and statistical significance of experiments. 

![](C:/Users/qp/Pictures/lady testing tea supplement material.png)
![](C:/Users/qp/Pictures/Youngronaldfisher2.png)




[][Textbook link]

This video corresponds to the textbook section introducing association tests External link up to and including the textbook section on two-by-two tables External link.
https://rafalab.github.io/dsbook/inference.html#association-tests
https://rafalab.github.io/dsbook/inference.html#two-by-two-tables


[][Key points]

    We learn how to determine the probability that an observation is due to random variability given categorical, binary or ordinal data.
    Fisher's exact test determines the p-value as the probability of observing an outcome as extreme or more extreme than the observed outcome given the null distribution.
    Data from a binary experiment are often summarized in two-by-two tables.
    The p-value can be calculated from a two-by-two table using Fisher's exact test with the function fisher.test().

Code: Research funding rates example

# load and inspect research funding rates object
library(tidyverse)
library(dslabs)
data(research_funding_rates)
research_funding_rates

# compute totals that were successful or not successful
totals <- research_funding_rates %>%
    select(-discipline) %>%
    summarize_all(funs(sum)) %>%
    summarize(yes_men = awards_men,
                         no_men = applications_men - awards_men,
                         yes_women = awards_women,
                         no_women = applications_women - awards_women)

# compare percentage of men/women with awards
totals %>% summarize(percent_men = yes_men/(yes_men + no_men),
                                          percent_women = yes_women/(yes_women + no_women))

Code: Two-by-two table and p-value for the Lady Tasting Tea problem

tab <- matrix(c(3,1,1,3), 2, 2)
rownames(tab) <- c("Poured Before", "Poured After")
colnames(tab) <- c("Guessed Before", "Guessed After")
tab

# p-value calculation with Fisher's Exact Test
fisher.test(tab, alternative = "greater")












# Chi-Squared Tests

Note that, in a way, our funding rates case study is similar to the lady tasting tea example.  However, in the tasting tea example, the number of blue and red beads is experimentally fixed.  And the number of answers given for each category is also fixed.  This is because Fischer made sure there were 4 before tea and 4 after tea.  And the lady knew this so the answer is would also have 4 and 4.  If this is the case, the sums of the rows and the sum of the columns of the 2 by 2 table are fixed.  This defines a constraint on the possible ways we can fill the 2 by 2 table and also permits us to use the hypergeometric distribution.  In general, this is not the case.  
![](C:/Users/qp/Pictures/the chi-squared test.png)
Nonetheless, there's another approach that's very similar, the chi-squared test, which we will now describe.  Imagine we have 2,823 individuals.  Some are men, and some are women.  Some get funded; others don't.  There you have two binary variables.  We saw that the success rate for men and women were the following: about 18% for men, about 15% for women.  
![](C:/Users/qp/Pictures/We saw that the success rate for men and women were the following about 0.18 and 0.15 for women.png)
![](C:/Users/qp/Pictures/We can compute the overall funding rate using the following code..png)
We can compute the overall funding rate using the following code.  It's between 16% and 17%.  [][So now the question is will we see a difference between men and women as big as the one we see if funding was assigned at random using this rate?]  *The chi-squared test answers this question.  *
![](C:/Users/qp/Pictures/In our case, we can use this code and construct the following 2.png)
![](C:/Users/qp/Pictures/to what you expect to see at the overall funding rate, which.png)
The first step is to create a 2 by 2 table just like before.  In our case, we can use this code and construct the following 2 by 2 table for the research funding data.  The general idea of a chi-squared test is to compare this 2 by 2 table, the observed 2 by 2 table, to what you expect to see at the overall funding rate, which we can compute using this code.  And here is the table.  We can see that more men than expected and less women than expected received funding.  However, under the null hypothesis, this observation is a random variable.  
![](C:/Users/qp/Pictures/The chi-squared test tells us how likely it is to see a deviation this larger.png)
The chi-squared test tells us how likely it is to see a deviation like this, or larger, by chance.  *This test uses an asymptotic result, similar to the central limit theorem, related to the sums of independent binary outcomes in a context like this*.  The R function chisq.test takes a 2 by 2 table and returns the results from this test.  Here's the simple code.  
![](C:/Users/qp/Pictures/The R function chisq.test takes a 2 by 2 table.png)
![](C:/Users/qp/Pictures/Here's the simple code. to use chi-square test.png)
*We see that the p-value is 0.051.  this means that the probability of seeing a deviation like the one we see or bigger under the null that funding is assigned at random is 0.051*.  
![](C:/Users/qp/Pictures/We talked about how to obtain p-value, But now let's talk about summary statistics..png)
All right.  So we described how to obtain p-values.  But now let's talk about summary statistics.  An informative summary statistic associated with 2 by 2 tables is the odds ratio.  *Define the two variables X=1 if you are male or 0 otherwise and Y=1 if you're funded and 0 otherwise*.  The [][odds of getting funded if you're a man] is defined as follows and can be computed using the simple code.  
![](C:/Users/qp/Pictures/The odds of getting funded if you're a man is defined as this.png)
![](C:/Users/qp/Pictures/is defined as follows and can be computed using the simple code..png)
The odds of being funded if you're a woman is given by this simple formula and can be computed like this.  [][The odds ratio is the ratio of these two odds.  ]
![](C:/Users/qp/Pictures/The odds of being funded if you're a woman is given by this simple formula.png)
![](C:/Users/qp/Pictures/the odds of being funded if you are women is and can be computed like this..png) 
![](C:/Users/qp/Pictures/The odds ratio is the ratio of these two odds..png)
*How many times larger are the odds for men than for women?  In this case, we get that is 1.23*.  A quick note of caution regarding p-values from 2 by 2 tables.  As mentioned earlier, reporting only p-values is not an appropriate way to report the results of data analysis.  In scientific journals, for example, some studies seem to overemphasize p-values.  Some of these studies have large sample size and report impressively small p-values.  *Yet when one looks closely at the results, we realize that the odds ratios are quite modest, barely bigger than 1*.  In this case, the difference may not be [][practically significant] or [][scientifically significant].  
![](C:/Users/qp/Pictures/practically significant or scientifically significant..png)
[][Note that the relationship between odds ratios and p-values is not one to one].  It depends on the sample size.  So a very small p-value does not necessarily mean a very large odds ratio.  Look at what happens to the p-value if we multiply our 2 by 2 table by 10.  We multiply each cell by 10; the odds ratio remains the same.  But look at how small the p-value becomes.  
![](C:/Users/qp/Pictures/Look at what happens to the p-value if we multiply our 2 by 2 table by 10..png)
[][Earlier we mention that instead of p-values, it's more appropriate to report confidence intervals].  However, computing confidence intervals for odds ratio is not mathematically straightforward.  Unlike other statistics for which you can derive useful approximations for the distribution, *the odds ratio is not only a ratio, but a ratio of ratios*.  Therefore, there's no simple way of using, for example, the central limit theorem.  
![](C:/Users/qp/Pictures/Therefore, there's no simple way of using, for example, the central limit.png)
![](C:/Users/qp/Pictures/But you can learn more about it in this book..png)
One approach is to use the theory of generalized linear models, which is too advanced for this course.  But you can learn more about it in this book.  



[][Textbook links]

This video corresponds to the textbook section on the chi-square test External link and the textbook section on the odds ratio External link.
https://rafalab.github.io/dsbook/inference.html#chi-square-test
https://rafalab.github.io/dsbook/inference.html#odds-ratio


[][Key points]

    If the sums of the rows and the sums of the columns in the two-by-two table are fixed, then the hypergeometric distribution and  Fisher's exact test can be used. Otherwise, we must use the chi-squared test.
    The chi-squared test compares the observed two-by-two table to the two-by-two table expected by the null hypothesis and asks how likely it is that we see a deviation as large as observed or larger by chance.
    The function chisq.test() takes a two-by-two table and returns the p-value from the chi-squared test.
    The odds ratio states how many times larger the odds of an outcome are for one group relative to another group.
    A small p-value does not imply a large odds ratio. If a finding has a small p-value but also a small odds ratio, it may not be a practically significant or scientifically significant finding. 
    Because the odds ratio is a ratio of ratios, there is no simple way to use the Central Limit Theorem to compute confidence intervals. There are advanced methods for computing confidence intervals for odds ratios that we do not discuss here.

Code: Chi-squared test

# compute overall funding rate
funding_rate <- totals %>%
    summarize(percent_total = (yes_men + yes_women) / (yes_men + no_men + yes_women + no_women)) %>%
    .$percent_total
funding_rate

# construct two-by-two table for observed data
two_by_two <- tibble(awarded = c("no", "yes"),
                    men = c(totals$no_men, totals$yes_men),
                     women = c(totals$no_women, totals$yes_women))
two_by_two

# compute null hypothesis two-by-two table
tibble(awarded = c("no", "yes"),
           men = (totals$no_men + totals$yes_men) * c(1-funding_rate, funding_rate),
           women = (totals$no_women + totals$yes_women) * c(1-funding_rate, funding_rate))

# chi-squared test
chisq_test <- two_by_two %>%
    select(-awarded) %>%
nbsp;   chisq.test()
chisq_test$p.value

Code: Odds ratio

# odds of getting funding for men
odds_men <- (two_by_two$men[2] / sum(two_by_two$men)) /
        (two_by_two$men[1] / sum(two_by_two$men))

# odds of getting funding for women
odds_women <- (two_by_two$women[2] / sum(two_by_two$women)) /
        (two_by_two$women[1] / sum(two_by_two$women))

# odds ratio - how many times larger odds are for men than women
odds_men/odds_women

Code: p-value and odds ratio responses to increasing sample size

# multiplying all observations by 10 decreases p-value without changing odds ratio
two_by_two %>%
  select(-awarded) %>%
  mutate(men = men*10, women = women*10) %>%
  chisq.test()














# Assessment 7.1: Association and Chi-Squared Tests



DataCamp due Aug 14, 2022 10:35 AWST

In this assessment, you will learn about association tests and the chi-squared test.

By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy. Note that you might need to disable your pop-up blocker, or allow "www.datacamp.com" in your pop-up blocker allowed list. When you have completed the exercises, return to edX to continue your learning.

Assessment 7.1: Association and Chi-Squared Tests (External resource) (4.0 points possible)
By clicking OK, you agree to DataCamp's privacy policy: https://www.datacamp.com/privacy-policy.

 

Ask your questions about association and chi-squared tests or the related DataCamp assessment here. Remember to search the discussion board before posting to see if someone else has asked the same thing before asking a new question! You're also encouraged to answer each other's questions to help further your own learning.

Some reminders:

    Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
    Posting snippets of code is okay, but posting full code solutions is not.
    If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.

Discussion: Assessment 7.1
Topic: Section 7 / Assessment 7.1: Association and Chi-Squared Tests



## Exercise 1 - Comparing Proportions of Hits

In a previous exercise, we determined whether or not each poll predicted the correct winner for their state in the 2016 U.S. presidential election. Each poll was also assigned a grade by the poll aggregator. Now we're going to determine if polls rated A- made better predictions than polls rated C-.

In this exercise, filter the errors data for just polls with grades A- and C-. Calculate the proportion of times each grade of poll predicted the correct winner.
Instructions
100 XP

    Filter errors for grades A- and C-.
    Group the data by grade and hit.
    Summarize the number of hits for each grade.
    Generate a two-by-two table containing the number of hits and misses for each grade. Try using the spread function to generate this table.
    Calculate the proportion of times each grade was correct.

```{r}
# Find the previous code to generate the data first



library(tidyverse)


# The 'errors' data have already been loaded. Examine them using the `head` function.
head(errors)

# Generate an object called 'totals' that contains the numbers of good and bad predictions for polls rated A- and C-
totals <- errors %>%
  filter(grade %in% c("A-", "C-")) %>%
  group_by(grade, hit) %>%
  summarize(value = n()) %>%
  spread(., key=hit, value=value)
totals


# Print the proportion of hits for grade A- polls to the console
totals$'TRUE'[2]/
(totals$'TRUE'[2] + totals$'FALSE'[2])

# Print the proportion of hits for grade C- polls to the console
totals$'TRUE'[1]/
(totals$'TRUE'[1] + totals$'FALSE'[1])
```

try generate this table, if you can
# A tibble: 2 x 3
  hit    `C-`  `A-`
  <lgl> <int> <int>
1 FALSE    50    26
2 TRUE    311   106

Hint

    After grouping by grade and hit, you can use the summarize function to tally up the columns. Then you can use the spread function to generate the two-by-two table.

summarize(num = n()) %>%
  spread(grade, num)
  
# Solution is this one
```{r}
# The 'errors' data have already been loaded. Examine them using the `head` function.
head(errors)

# Generate an object called 'totals' that contains the numbers of good and bad predictions for polls rated A- and C-
totals <- errors %>%
  filter(grade %in% c("A-", "C-")) %>%
  group_by(grade,hit) %>%
  summarize(num = n()) %>%
  spread(grade, num)

# Print the proportion of hits for grade A- polls to the console
totals[[2,3]]/sum(totals[[3]])

# Print the proportion of hits for grade C- polls to the console
totals[[2,2]]/sum(totals[[2]])
```




## Exercise 2 - Chi-squared Test

We found that the A- polls predicted the correct winner about 80% of the time in their states and C- polls predicted the correct winner about 86% of the time.

Use a chi-squared test to determine if these proportions are different.
Instructions
100 XP

    Use the chisq.test function to perform the chi-squared test. Save the results to an object called chisq_test.
    Print the p-value of the test to the console.

```{r}
# The 'totals' data have already been loaded. Examine them using the `head` function.
head(totals)

# Perform a chi-squared test on the hit data. Save the results as an object called 'chisq_test'.
chisq_test <- totals %>%
  select(-hit) %>%
  chisq.test()
chisq_test


# Print the p-value of the chi-squared test to the console
chisq_test[3]
```

Hint

    Remember to remove the hit column from the totals data so that your data are in the two-by-two form going into the chi-squared test. You can remove the column using the select function as in the sample code below.

totals %>% select(-hit)

$p.value
[1] 0.1467902



## Exercise 3 - Odds Ratio Calculation

*It doesn't look like the grade A- polls performed significantly differently than the grade C- polls in their states.*

Calculate the odds ratio to determine the magnitude of the difference in performance between these two grades of polls.
Instructions
100 XP

    Calculate the odds that a grade C- poll predicts the correct winner. Save this result to a variable called odds_C.
    Calculate the odds that a grade A- poll predicts the correct winner. Save this result to a variable called odds_A.
    Calculate the odds ratio that tells us how many times larger the odds of a grade A- poll is at predicting the winner than a grade C- poll.

```{r}
# The 'totals' data have already been loaded. Examine them using the `head` function.
head(totals)

# Generate a variable called `odds_C` that contains the odds of getting the prediction right for grade C- polls
odds_C <- totals$'C-'[2]/totals$'C-'[1]
#odds_C
#311/50


# Generate a variable called `odds_A` that contains the odds of getting the prediction right for grade A- polls
odds_A <- totals$'A-'[2]/totals$'A-'[1]


# Calculate the odds ratio to determine how many times larger the odds ratio is for grade A- polls than grade C- polls
odds_A/odds_C
```

Hint

    It write too complicated, just go above and check what the instructor talked about.  

# And here is the solution
```{r}
# The 'totals' data have already been loaded. Examine them using the `head` function.
head(totals)

# Generate a variable called `odds_C` that contains the odds of getting the prediction right for grade C- polls
odds_C <- (totals[[2,2]] / sum(totals[[2]])) / 
  (totals[[1,2]] / sum(totals[[2]]))

# Generate a variable called `odds_A` that contains the odds of getting the prediction right for grade A- polls
odds_A <- (totals[[2,3]] / sum(totals[[3]])) / 
  (totals[[1,3]] / sum(totals[[3]]))

# Calculate the odds ratio to determine how many times larger the odds ratio is for grade A- polls than grade C- polls
odds_A/odds_C
```




## Exercise 4 - Significance

[][We did not find meaningful differences between the poll results from grade A- and grade C- polls in this subset of the data, which only contains polls for about a week before the election. Imagine we expanded our analysis to include all election polls and we repeat our analysis. In this hypothetical scenario, we get that the p-value for the difference in prediction success if 0.0015 and the odds ratio describing the effect size of the performance of grade A- over grade B- polls is 1.07.]

Based on what we learned in the last section, which statement reflects the best interpretation of this result?
Instructions
50 XP
Possible Answers

    The p-value is below 0.05, so there is a significant difference. Grade A- polls are significantly better at predicting winners.
    The p-value is too close to 0.05 to call this a significant difference. We do not observe a difference in performance.
[][    The p-value is below 0.05, but the odds ratio is very close to 1. There is not a scientifically significant difference in performance.]    0
    The p-value is below 0.05 and the odds ratio indicates that grade A- polls perform significantly better than grade C- polls.




## End of Assessment

This is the end of the programming assignment for this section. Please DO NOT click through to additional assessments from this page. If you do click through, your scores may NOT be recorded.

Click "Got it!" and submit to get the "points" for this question.

You can close this window and return to Data Science: Inference.
Answer the question
50XP
Possible Answers

    Got it!
    press










## Course  /  Course Wrap-up and Comprehensive Assessment: Brexit  /  Comprehensive Assessment: Brexit

Brexit poll analysis - Part 1
Level 2 headings may be created by course providers in the future.
Graded assignments are locked
Upgrade to gain access to locked features like this one and get the most out of your course.
Example Certificate
When you upgrade, you:

Earn a verified certificate of completion to showcase on your resumé
Unlock your access to all course activities, including graded assignments
Full access to course content and materials, even after the course ends

    Support our mission at edX


Final Assessment due Aug 20, 2022 15:55 AWST
Assessment for verified learners

Test your inference and modeling skills with this case study analyzing polls from the Brexit referendum! This assessment is available to verified learners only.
Directions

There are 12 multi-part problems in this comprehensive assessment that review concepts from the entire course. The problems are split over 3 pages. Make sure you read the instructions carefully and run all pre-exercise code.

For numeric entry problems, you have 10 attempts to input the correct answer. For true/false problems, you have 2 attempts.

If you have questions, visit the "Brexit poll analysis" discussion forum that follows the assessment.

IMPORTANT: Some of these exercises use dslabs datasets that were added in a July 2019 update. Make sure your package is up to date with the command install.packages("dslabs").
Overview

In June 2016, the United Kingdom (UK) held a referendum to determine whether the country would "Remain" in the European Union (EU) or "Leave" the EU. This referendum is commonly known as Brexit. Although the media and others interpreted poll results as forecasting "Remain" (\(p > 0.5)\), the actual proportion that voted "Remain" was only 48.1% \((p = 0.481)\) and the UK thus voted to leave the EU. Pollsters in the UK were criticized for overestimating support for "Remain". 

In this project, you will analyze real Brexit polling data to develop polling models to forecast Brexit results. You will write your own code in R and enter the answers on the edX platform.
Important definitions
Data Import

Import the brexit_polls polling data from the dslabs package and set options for the analysis:

# suggested libraries and options
library(tidyverse)
options(digits = 3)

# load brexit_polls object
library(dslabs)
data(brexit_polls)

Final Brexit parameters

Define \(p=0.481\) as the actual percent voting "Remain" on the Brexit referendum and \(d=2p-1=-0.038\) as the actual spread of the Brexit referendum with "Remain" defined as the positive outcome:

p <- 0.481    # official proportion voting "Remain"
d <- 2*p-1    # official spread







# Brexit poll analysis - Part 2
Level 2 headings may be created by course providers in the future.
Graded assignments are locked
Upgrade to gain access to locked features like this one and get the most out of your course.
Example Certificate
When you upgrade, you:

Earn a verified certificate of completion to showcase on your resumé
Unlock your access to all course activities, including graded assignments
Full access to course content and materials, even after the course ends
Support our mission at edX


Final Assessment due Aug 20, 2022 15:55 AWST

This problem set is continued from the previous page. Make sure you have run the following code:

# suggested libraries
library(tidyverse)

# load brexit_polls object and add x_hat column
library(dslabs)
data(brexit_polls)
brexit_polls <- brexit_polls %>%
    mutate(x_hat = (spread + 1)/2)

# final proportion voting "Remain"
p <- 0.481

Continue the comprehensive assessment on the next page.








# Brexit poll analysis - Part 3
Level 2 headings may be created by course providers in the future.
Graded assignments are locked
Upgrade to gain access to locked features like this one and get the most out of your course.
Example Certificate
When you upgrade, you:

Earn a verified certificate of completion to showcase on your resumé
Unlock your access to all course activities, including graded assignments
Full access to course content and materials, even after the course ends
Support our mission at edX



Final Assessment due Aug 20, 2022 15:55 AWST

This problem set is continued from the previous page. Make sure you have run the following code:

# suggested libraries
library(tidyverse)

# load brexit_polls object and add x_hat column
library(dslabs)
data(brexit_polls)
brexit_polls <- brexit_polls %>%
    mutate(x_hat = (spread + 1)/2)

# final proportion voting "Remain"
p <- 0.481










# Discussion: Brexit poll analysis



Ask your questions about this assessment here. Remember to search the discussion board before posting to see if someone else has asked the same thing before asking a new question! You're also encouraged to answer each other's questions to help further your own learning.

Some reminders:

        Please be specific in the title and body of your post regarding which question you're asking about to facilitate answering your question.
        Posting snippets of code is okay, but posting full code solutions is not.
        If you do post snippets of code, please format it as code for readability. If you're not sure how to do this, there are instructions in a pinned post in the "general" discussion forum.

Discussion: Brexit poll analysis
Topic: Brexit poll analysis / Assessment: Brexit poll analysis
Filter:
Sort:

    discussion
    Course 4 on Inference and Modeling
    This is by far the most challenging course in this professional certification program. Five more courses to go!
    1 comments
    discussion
    Acquired
    I have learned so much,thank you!
    1 comments
    discussion
    Thank you!
    Very impressive!
    1 comments
    discussion
    very detailed
    Thank you, very detailed
    1 comments
    unanswered question
    What proportion of polls having a confidence interval 0
    1 comments
    discussion
    Textbook
    Some contents are not in the textbook. I have to go and find additionnal contents in other books to complete my code
    1 comments
    discussion
    Association tests in the textbook
    I couldn't find the content of the section 7 (Association tests) in the textbook. Could someone please indicate where it is? Thanks
    1 comments
    discussion
    I satisfied with the course
    Thank you
    1 comments
    discussion
    Note
    The lectures are very detailed
    1 comments
    discussion
    Nice Course
    Good Assesment
    1 comments
    discussion
    Brexit - Wow
    Very comprehensive exercise indeed!