MITx 6.431x -- Probability - The Science of Uncertainty and Data + Unit_6.Rmd

---
title: "MITx 6.431x -- Probability - The Science of Uncertainty and Data + Unit_6.Rmd"
author: "John HHU"
date: "2022-11-05"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

```{r cars}
summary(cars)
```

## Including Plots

You can also embed plots, for example:

```{r pressure, echo=FALSE}
plot(pressure)
```

Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.


## Course  /  Unit 6: Further topics on random variables  /  Lec. 11: Derived distributions

# 1. Lecture 11 overview and slides

![](C:/Users/qp/Pictures/Screenshots/1. Lecture 11 overview and slides - 1.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 11 overview and slides - 2.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 11 overview and slides - 3.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 11 overview and slides - 4.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 11 overview and slides - 5.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 11 overview and slides - 6.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 11 overview and slides - 7.png)
In this lecture, we will deal with a single topic.
How to find the distribution, that is, the PMF or PDF of a
random variable that is defined as a function of one
or more other random variables with known distributions.
Why is this useful?
Quite often, we construct a model by first defining some
basic random variables.
These random variables usually have simple distributions and
often they are independent.
But we may be interested in the distribution of some more
complicated random variables that are defined in terms of
our basic random variables.
In this lecture, we will develop systematic methods for
the task at hand.
After going through a warm-up, the case of discrete random
variables, we will see that there is a general, very
systematic 2-step procedure that relies on cumulative
distribution functions.
We will pay special attention to the easier case where we
have a linear function of a single random variable.
We will also see that when the function involved is
monotonic, we can bypass CDFs and jump directly to a formula
that is easy to apply.
We will also see an example involving a function of two
random variables.
In such examples, the calculations may be more
complicated but the basic approach based on CDFs is
really the same.
Let me close with a final comment.
Finding the distribution of the function g of X is indeed
possible, but we should only do it when we really need it.
If all we care about is the expected value of g of X we
can just use the expected value rule.


Printable transcript available here.
https://courses.edx.org/assets/courseware/v1/54c9d0ee91b8434d2ab1db4c0ae7d2bb/asset-v1:MITx+6.431x+2T2022+type@asset+block/transcripts_L11-Overview.pdf

Lecture slides: [clean] [annotated]
https://courses.edx.org/assets/courseware/v1/1b8585e226baa3938df06b2408ea9f1a/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L11cleanslides.pdf
https://courses.edx.org/assets/courseware/v1/22d77661646e89b846880bb0955256a3/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L11annotatedslides.pdf

More information is given in Section 4.1 of the text.
https://courses.edx.org/courses/course-v1:MITx+6.431x+2T2022/pdfbook/0/chapter/1/31


# 2. The PMF of a function of a discrete r.v.

![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 2.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 3.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 4.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 5.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 6.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 7.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 8.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 9.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 10.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 11.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 12.png)
![](C:/Users/qp/Pictures/Screenshots/2. The PMF of a function of a discrete r.v. - 13.png)
As a warm-up towards finding the distribution of the
function of random variables, let us start by considering
the discrete case.
So let X be a discrete random variable and let Y be defined
as a given function of X. We know the PMF of X and wish to
find the PMF of Y. Here's a simple example.
The random variable X takes the values 2, 3, 4, and 5 with
the probabilities given in the figure, and Y is the function
indicated here.
Then, for example, the probability that Y takes a
value of 4.
This is also the value of the PMF of Y evaluated at 4.
This is simply the sum of the probabilities of the possible
values of X that give rise to a value of Y
that is equal to 4.
Therefore, this expression is equal to the probability that
X equals to 4 plus the probability that
X is equal to 5.
Or, in PMF notation, we can write it in this manner.
And in this numerical example, it would be 0.3 plus 0.4.
More generally, for any given value of little y, the
probability that the random variable capital Y takes this
particular value is the sum of the probabilities of the
little x that result in that particular value.
So the probability that the random variable capital Y,
which is the same as g of X, takes on a specific value is
the sum of the probabilities of all possible values of
little x where we only consider those values of
little x that give rise to the specific value, little y, that
you're interested.
Let us now look into the special case where we have a
linear function of a discrete random variable.
Suppose that X is described by the PMF shown in this diagram,
and let us consider the random variable Z, which is defined
as 2 times X. We would like to plot the PMF of Z.
First, let us note the values that Z can take.
When X is equal to minus 1, Z is going to be
equal to minus 2.
When X is equal to 1, Z is going to be equal to 2.
And when X is equal to 2, Z is going to be equal to 4.
This event that X is equal to minus 1 happens with
probability 2/6, and when that event happens, Z will take a
value of minus 2.
So this event happens with probability 2/6.
With probability 1/6, X takes a value of 1 so that Z takes a
value of 2.
And this happens with probability 1/6.
6 And finally, this last event here happens
with probability 3/6.
We have thus found the PMF of Z. Notice that it has the same
shape as the PMF of X, except that it is stretched or scaled
horizontally by a factor of 2.
Let us now consider the random variable Y, defined as 2X plus
3, or what is the same as Z plus 3.
With probability 2/6, Z is equal to minus 2.
And in that case, Y is going to be equal to plus 1.
And this event happens with probability 2/6.
With probability 1/6, Z takes a value of 2 so that Y it
takes a value of 5.
And finally, with probability 3/6, Z takes a value of 4 so
that Y it takes a value of 7.
What we see here is that the PMF of Y has exactly the same
shape as the PMF of Z, except that it is shifted to the
right by 3.
To summarize, in order to find the PMF of a linear function
such as 2X plus 3, what we do is that we first stretch the
PMF of X by a factor of 2 and then shift it
horizontally by 3.
We can also describe the PMF of Y through a formula.
For any given value of little y, the PMF is going to be
equal to the probability that our random variable Y takes on
the specific value little y.
Then we recall that Y has been defined in our example to be
equal to 2X plus 3, so we're looking at the probability of
this event.
But this is the same as the event that X takes a value
equal to y minus 3 divided by 2.
And in PMF notation, we can write it in this form.
So what this is saying is that the probability that Y takes
on a specific value is the same as the probability that X
takes on some other specific value.
And that value here is that value of X that would give
rise to this particular value little y.
Now, we can generalize the calculation that we just did.
And more generally, if we have a linear function of a
discrete random variable X, the PMF of the random variable
Y is given by this formula in terms of the PMF of the random
variable X. The derivation is the same.
We use b instead the specific number 3, and we have a
general constant a instead of the 2 that
we had in this example.
And this formula describes exactly what we did
graphically in our previous example.
This factor of a here serves to stretch the PMF by a factor
of a, and this term b here serves to shift the PMF by b.


# 3. Exercise: Linear functions of discrete r.v.'s

![](C:/Users/qp/Pictures/Screenshots/3. Exercise Linear functions of discrete r.v.'s - 1.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Linear functions of discrete r.v.'s - 2.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Linear functions of discrete r.v.'s - 3.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Linear functions of discrete r.v.'s - 4.png)


# 4. A linear function of a continuous r.v.

![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 1.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 2.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 3.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 4.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 5.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 6.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 7.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 8.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 9.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 10.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 11.png)
![point and slice diff](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 12.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 13.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 14.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 15.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 16.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 17.png)
![](C:/Users/qp/Pictures/Screenshots/4. A linear function of a continuous r.v. - 18.png)
We now move to the case of continuous random variables.
We will start with a special case where we want to find the
PDF of a linear function of a continuous random variable.
We will start by considering a simple example, and study it
using an intuitive argument.
And afterwards, we will justify our conclusions
mathematically.
So we start with a random variable X that has a PDF over
the form shown in this figure so that it is a piecewise
constant PDF.
We then consider a random variable z, which is defined
to be 2 times X. The random variable x takes values
between minus 1 and 1.
So z takes values between minus 2 and 2.
Now, values of X between minus 1 and 0 correspond to values
of Z between minus 2 and 0.
The different values of X in this range are, in some sense,
equally likely, because we have a constant PDF.
And that argues that the corresponding values of Z
should also be, in some sense, equally likely.
So the PDF should be constant over this range.
By a similar argument, the PDF of Z should also be constant
over the range from 0 to 2.
And the PDF must, of course, be 0 outside this range,
because these are values of Z that are impossible.
Let us now try to figure out the parameters of this PDF.
The probability that X is positive is the
area of this rectangle.
And the area of this rectangle is 2/3.
So the area of this rectangle should also be 2/3.
And that means that the height of this rectangle should be
equal to 1/3.
Similarly, the probability that X is negative is the area
of this rectangle, and the area of this rectangle is
equal to 1/3.
When X is negative, Z is also negative, so the probability
of a negative value should be equal to 1/3.
And for the area of this rectangle to be 1/3, it means
that the height of this rectangle should be 1/6.
So what happened here?
We started with a PDF of X and essentially stretched it out
by a factor of 2 while keeping the same shape.
However, we also scaled it down by a
corresponding amount.
So 2/3 became 1/3, and 1/3 became 1/6.
The reason for this scaling down is because we need the
total probability, the total area under this PDF, to be
equal to 1.
If we now add a number, let's say 3, to the random variable
Z, what is going to happen?
The random variable Y now will take values from
minus 2 plus 3--
this is plus 1--
all the way up to 2 plus 3, which is plus 5.
Values in the range from 1 to 3 correspond to values of Z in
the range from minus 2 to 0.
These values are all, in some sense, equally likely.
So they should also be equally likely here.
And by a similar argument, these values in the range from
3 to 5 should also be equally likely.
This rectangle corresponds to this rectangle here.
So the area should be the same.
And therefore, the height should also be the same.
Therefore, the height here should be 1/6.
And by the same argument, the height here
should be equal to 1/3.
So what hap5. Covariance of the multinomial
Level 2 headings may be created by course providers in the future.
pens here is that when we add 3 to a random
variable, the PDF just gets shifted by 3 but otherwise
retains the same shape.
So the story is entirely similar to what happened in
the discrete case.
We start with a PDF of X. We stretch it horizontally by a
factor of 2.
And then we shift it horizontally by 3.
The only difference is that here in the continuous case,
we also need to scale the plot in the vertical dimension by a
factor of 2.
Actually, make it smaller by a factor of 2.
And this needs to be done in order to keep the total area
under the PDF equal to 1.
Let us now go through a mathematical argument with the
purpose of also finding a formula that represents what
we just did in our previous example.
Let Y be equal to aX plus b.
Here, X is a random variable with a given PDF.
a and b are given constants.
Now, if a is equal to 0, then Y is identically equal to b.
So it is a constant random variable and
does not have a PDF.
So let us exclude this case and start by assuming that a
is a positive number.
We can try to work, as in the discrete case, and try
something like the following.
The probability that Y takes on a specific value is the
same as the probability that aX plus b takes on a specific
value, which is the same as the probability that X takes
on the specific value, y minus b divided by a.
This equality was useful in the discrete case.
Is it useful here?
Unfortunately not.
When we're dealing with continuous random variables,
the probability that the continuous random variable is
exactly equal to a given number, this probability is
going to be equal to 0.
And the same applies to this side as well.
So we have that 0 is equal to 0.
And this is uninformative, and we have not made any progress.
So instead of working with probabilities of individual
points which will always be 0, we will work with
probabilities of intervals that generally have non-zero
probability.
The trick is to work with CDFs.
So let us try to find the CDF of Y. The CDF of the random
variable Y is defined as the probability that the random
variable is less than or equal to a certain number.
Now, in our case, Y is aX plus b.
We move b to the other side of the inequality and then divide
both sides of the inequality by a.
And we get that this is the same as the probability that X
is less than or equal to y minus b divided by a, which is
the same as the CDF of X evaluated at y minus b over a.
So we have a formula for the CDF of Y in terms of the CDF
of X.
How can we find the PDF?
Simply by differentiating.
We differentiate both sides of this equation.
The derivative of a CDF is a PDF.
And therefore, the PDF of Y is going to be equal to the
derivative of this side.
Here we need to use the chain rule.
First, we take the derivative of this function.
And the derivative of the CDF is a PDF, so the PDF of X
evaluated at this particular number.
But then we also need to take the derivative of the argument
inside with respect to y.
And that derivative is equal to 1/a.
And this gives us a formula for the PDF of Y in terms of
the PDF of X.
How about the case where a is less than 0?
What is going to change?
The first step up to here remains valid.
But now when we divide both sides of the inequality by a,
the direction of the inequality gets reversed.
So we obtain instead the probability that X is larger
than or equal to y minus b divided by a.
And this is 1 minus the probability that X is less
than y minus b over a.
Now, X is a continuous random variable, so the probability
is not going to change if here we make the inequality to be a
less than or equal sign.
And what we have here is 1 minus the CDF of X evaluated
at y minus b over a.
We use the chain rule once more, and we obtain that the
PDF of Y, in this case, is equal to minus the PDF of X
evaluated at y minus b over a times 1/a.
Now, when a is positive, a is the same as the
absolute value of a.
When a is negative and we have this formula, we have here a
minus a, which is the same as the absolute value of a.
So we can unify these two formulas by replacing the
occurrences of a and that minus sign by just using the
absolute value.
And this gives us this formula for the PDF of Y in terms of
the PDF of X. And it is a formula that's valid whether a
is positive or negative.
What this formula represents is the following.
Because of the factor of a that we have here, we take the
PDF of X and scale it horizontally by a factor of a.
Because of the term b that we have here, the PDF also gets
shifted horizontally by b.
And finally, this term here corresponds to a vertical
scaling of the plot that we have.
And the reason that this term is present is so that the PDF
of Y integrates to 1.
It is interesting to also compare with the corresponding
discrete formula that we derived earlier.
The discrete formula has exactly the same appearance
except that the scaling factor is not present.
So for the case of continuous random variables, we need to
scale vertically the PDF.
But in the discrete case, such a scaling is not present.


# 5. Exercise: Linear functions of continuous r.v.'s

![](C:/Users/qp/Pictures/Screenshots/5. Exercise Linear functions of continuous r.v.'s - 1.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise Linear functions of continuous r.v.'s - 2.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise Linear functions of continuous r.v.'s - 3.png)
![Think here](C:/Users/qp/Pictures/Screenshots/5. Exercise Linear functions of continuous r.v.'s - 4.png)


# 6. A linear function of a normal r.v.

![](C:/Users/qp/Pictures/Screenshots/6. A linear function of a normal r.v. - 2.png)
![](C:/Users/qp/Pictures/Screenshots/6. A linear function of a normal r.v. - 3.png)
![](C:/Users/qp/Pictures/Screenshots/6. A linear function of a normal r.v. - 4.png)
![](C:/Users/qp/Pictures/Screenshots/6. A linear function of a normal r.v. - 5.png)
![](C:/Users/qp/Pictures/Screenshots/6. A linear function of a normal r.v. - 6.png)
Let us now consider an application of what we have
done so far.
Let X be a normal random variable with
given mean and variance.
This means that the PDF of X takes the familiar form.
We consider random variable Y, which is a linear function of
X. And to avoid trivialities, we assume that a is
different than zero.
We will just use the formula that we
have already developed.
So we have that the density of Y is equal to 1 over the
absolute value of a.
And then we have the density of X, but evaluated at x equal
to this expression.
So this expression will go in the place
of x in this formula.
And we have y minus b over a minus mu squared divided by 2
sigma squared.
And now we collect these constant terms here.
And then in the exponent, we multiply by a squared the
numerator and the denominator, which gives us this form here.
We recognize that this is again, a normal PDF.
It's a function of y.
We have a random variable Y. This is
the mean of the normal.
And this is the variance of that normal.
So the conclusion is that the random variable Y is normal
with mean equal to b plus a mu.
And with variance a squared, sigma squared.
The fact that this is the mean and this is the variance of Y
is not surprising.
This is how means and variances behave when you form
linear functions.
The interesting part is that the random variable Y is
actually normal.
Intuitively, what happened here is that we started with a
normal bell shaped curve.
A bell shaped PDF for X. We scale it vertically and
horizontally, and then shift it horizontally by b.
As we do these operations, the PDF still remains bell shaped.
And so the final PDF is again a bell shaped normal PDF.


# 7. The PDF of a general function

![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 2.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 3.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 4.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 5.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 6.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 7.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 8.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 9.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 10.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 11.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 12.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 13.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 14.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 15.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 16.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 17.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 18.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 19.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 20.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 21.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 22.png)
![](C:/Users/qp/Pictures/Screenshots/7. The PDF of a general function - 23.png)
In this important segment, we will develop a method for
finding the PDF of a general function of a continuous
random variable, a function g of X, which, in general, could
be nonlinear.
The method is very general and involves two steps.
The first step is to find the CDF of Y. And then the second
step is to take the derivative of the CDF and
then find the PDF.
Most of the work lies here in finding the CDF of Y. And how
do we do that?
Well, since Y is a function of the random variable X, we
replace Y by g of X. And now we're dealing with a
probability problem that involves a random variable, X,
with a known PDF.
And we somehow calculate this probability.
So let us illustrate this procedure
through some examples.
In our first example, we let X be a random variable which is
uniform on the range from 0 to 2.
And so the height of the PDF is 1/2.
And we wish to find the PDF of the random variable Y which is
defined as X cubed.
So since X goes all the way up to 2, Y goes all
the way up to 8.
The first step is to find the CDF of Y. And since Y is a
specific function of X, we replace that functional form.
And we write it this way.
So we want to calculate the probability that x cubed is
less than or equal to a certain number y.
Let us take cubic roots of both sides of this inequality.
This is the same as the probability that X is less
than or equal to y to the 1/3.
Now, we only care about values of y that are between 0 and 8.
So this calculation is going to be for those values of y.
For other values of y, we know that the PDF is equal to 0.
And there's no work that needs to be done there.
OK.
Now, y is less than or equal to 8, so the cubic root of y
is less than or equal to 2.
So y to the 1/3 is going to be a number
somewhere in this range.
Let's say this number.
We want the probability that X is less than or
equal to that value.
So that probability is equal to this area under the PDF of
X. And since it is uniform, this area is easy to find.
It's the height, which is 1/2 times the base,
which is y to the 1/3.
So we continue this calculation, and we get 1/2
times y to the 1/3.
So this is the formula for the CDF of Y for values of little
y between 0 and 8.
This completes step one.
The second step is simple calculus.
We just need to take the derivative of the CDF.
And the derivative is 1/2 times 1/3, this exponent, y to
the power of minus 2/3.
Or in a cleaner form, 1/6 times 1 over y
to the power 2/3.
So the form of this PDF is not a constant anymore.
Y is not a uniform random variable.
The PDF becomes larger and larger as y approaches 0.
And in fact, in this example, it even blows up when y
becomes closer and closer to 0.
So this is the shape of the PDF of Y.
Our second example is as follows.
You go to the gym, you jump on the treadmill, and you set the
speed on the treadmill to some random value which we call X.
And that random value is somewhere between 5 and 10
kilometers per hour.
And the way that you set it is chosen at random and uniformly
over this interval.
So X is uniformly distributed on the interval
between 5 and 10.
You want to run a total of 10 kilometers.
How long is it going to take you?
Let the time it takes you be denoted by Y. And the time
it's going to take you is the distance you want to travel,
which is 10 divided by the speed with
which you will be going.
So the random variable y is defined in terms of x through
this particular expression.
We want to find the PDF of y.
First let us look at the range of the random variable Y.
Since x takes values between 5 and 10, Y takes values
between 1 and 2.
Therefore, the PDF of Y is going to be 0
outside that range.
And let us now focus on values of Y that belong to this
interesting range.
So 1 less than y less than or equal to 2.
And now we start with our two-step program.
We want to find the CDF of Y, namely, the probability that
capital Y takes a value less than or equal to a certain
little y in this range.
We recall the definition of capital Y. So now we're
dealing with a probability problem that involves the
random variable capital X, whose
distribution is given to us.
Now, we rewrite this event as follows.
We move X to the other side.
This is the probability that X is larger than or equal after
we move the little y also to the left-hand side.
X being larger than or equal to 10 over little y.
Now, y is between 1 and 2.
10/y is going to be a number between 5 and 10.
So 10/y is going to be somewhere in this range.
We're interested in the probability that X is larger
than or equal to that number.
And this probability is going to be the area of this
rectangle here.
And the area of that rectangle is equal to the height of the
rectangle--
now, the height of this rectangle is going to be 1/5.
This is the choice that makes the total area under this
curve be equal to 1--
times the base.
And the length of the base is this number
10 minus that number.
It's 10 minus 10/y.
So this is the form of the CDF of Y for y's in this range.
To find the PDF of Y, we just take the derivative.
And we get 1/5 times the derivative of this term, which
is minus 10, divided by y squared.
But when we take the derivative of 1/y, that gives
us another minus sign.
The two minus signs cancel, and we
obtain 2 over y squared.
And if you wish to plot this, it starts at 2.
And then as y increases, the PDF actually decreases.
And this is the form of the PDF of the random variable y.
This is the form which is true when y lies between 1 and 2.
And of course, the PDF is going to be 0 for other
choices of little y.
So what we have seen here is a pretty systematic approach
towards finding the PDF of the random variable Y. Again, the
first step is to look at the CDF, write the CDF in terms of
the random variable X, whose distribution is known, and
then solve a probability problem that involves this
particular random variable.
And then in the last step, we just need to differentiate the
CDF in order to obtain the PDF.


# 8. Exercise: PDF of a general function 

![](C:/Users/qp/Pictures/Screenshots/8. Exercise PDF of a general function - 1.png)
![](C:/Users/qp/Pictures/Screenshots/8. Exercise PDF of a general function - 2.png)
![](C:/Users/qp/Pictures/Screenshots/8. Exercise PDF of a general function - 3.png)
![](C:/Users/qp/Pictures/20221025_200028.jpg)


# 9. The monotonic case

![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 2.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 3.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 4.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 5.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 6.png)
![think about it](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 7.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 8.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 9.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 10.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 11.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 12.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 13.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 14.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 15.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 16.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 17.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 18.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 19.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 20.png)
![](C:/Users/qp/Pictures/Screenshots/9. The monotonic case - 21.png)
![](C:/Users/qp/Pictures/20221025_204738.jpg)
We have already worked through some examples in which X was a
random variable with a given PDF, and we considered the
problem of finding the PDF of Y for the case where Y was the
function x cubed or the function of the form a/X. What
both of these examples have in common is that Y is a
monotonic function of X.
In this case, Y is increasing with X. In this case, Y was
decreasing with X. It turns out that there is a general
formula that gives us the PDF of Y in terms of the PDF of X
in the special case where we're dealing
with a monotonic function.
So, let us assume that g is a strictly increasing function.
And what that means is that, if x is a number or smaller
than some other number x prime, the value of g of x is
going to be smaller than the value of g x prime.
So, when you increase the argument of the function, the
function increases.
To keep things simple, we will also assume that the function
g is smooth, in particular that it is differentiable.
Then we have a diagram such as this one.
Here is x, and y is given by a function of x.
It's a smooth function, and that function keeps
increasing.
Now, because of the assumptions we have made on g,
we have an interesting situation.
Given a value of x, a corresponding value of y will
be determined according to the function g.
But we can also go the other way.
If I tell you a value of y, then you can specify for me
one and only one value of x that gives rise to this
particular y.
So, the function g takes us from x's to y's, but you can
also go back the opposite way from y's to values of x.
And the mapping that takes us from y's to x's, this is the
inverse of the function g.
And we give a name to that inverse function,
and we call it h.
So, h of y is the value of x that produces a
specific value y.
Let us now move on with the program of finding the PDF of
Y. We will follow the usual two step procedure.
And the first step is to find the CDF of Y.
So we fix some little y, And we want to find the
probability that the random variable y takes a value in
this range.
When does this happen?
For Y to take a value in this range, it must be the case
that X takes a value in this range here.
Values of X smaller than this particular number result in
values of Y that are less than or equal to
this particular number.
So, we can rewrite the event of interest in terms of the
random variable X and write it as follows.
We need to have x less than or equal to h of little y.
But this is just the CDF of X evaluated at h of y.
We now carry out to the second step of our program.
We take derivatives of both sides and we find that the PDF
of Y is equal to the derivative of the right hand
side, the derivative of the CDF is a PDF.
And then the chain rule tells us that we also need to take
the derivative of the term inside here with respect to
its argument.
And this is a general formula for the PDF of a strictly
increasing function of a random variable X. How about
the case of a decreasing function?
So, let us assume that g now is a strictly decreasing
function of X.
So, we might have a plot for g that looks
something like this.
What happens in this case?
We can start doing a calculation of this kind.
But now, how can we rewrite this event?
The random variable Y will take a value less than or
equal to this number little y.
When does this happen?
When the value of g of x is less than y.
And that happens for x's in this range.
So, this is the set of x's for which is the value of g of x
is less than or equal to this particular number y.
So the event of interest in that case is the event that X
is larger than or equal to h of y, which is 1 minus the
probability that X is less than h of y.
Because X is a continuous random variable, we can change
this inequality to one that allows the
possibility of equality.
And so this is 1 minus the CDF of X evaluated at h of y.
Now we take the derivatives of both sides and we find the PDF
or Y being equal to, there's a minus sign here, then the
derivative of the CDF, which is the PDF.
And finally, the derivative of the function h.
Now in this case, g is a decreasing function of x.
So when x goes down, y goes up.
When x goes up, y goes down.
This means that when y goes up, x goes down.
So it means that the inverse function h is going to be also
monotonically decreasing.
Since it is decreasing, it means that the slope, the
derivative of the function h is going to
be either 0 or negative.
And so minus a negative value gives us the absolute value of
that number.
So we can rewrite this by removing this minus sign here,
and putting an absolute value in this place.
Of course, in the case where g is an increasing function,
when x goes up, y goes up.
This means that when y goes up, x goes up.
So h in that case would have been an increasing function,
so this number here would have been a non-negative number,
and so it would be the same as the absolute value.
So using these absolute values, we obtain formulas
that are exactly the same in both cases of increasing and
decreasing functions, and so our final conclusion is that
in either case, the PDF of Y is given in terms of the PDF
of X times the derivative of this inverse function.
Let us now apply the formula that we have in our hands for
the monotonic case to a particular example, where y is
the square of X, and where X is uniform on the
interval 0 to 1.
So the function g, in our case, the function g is the
square function.
Now, you could argue here that this function is not
monotonic, so how can we apply our results?
On the other hand, the random variable X takes values on the
interval from 0 to 1, and therefore the form of the
function g outside that range does not concern us.
Over the range of values of interest, the function g is a
monotonic function.
So, what is the correspondence?
y is going to be equal to x squared.
That's the g of x function.
And when that happens, we have the relation that x is going
to be the square root of y.
This tells us that the inverse function, h of y, which tells
us what is the particular x associated with a given y, the
inverse function takes the form square root of y.
So now we can go ahead and use the formula.
The density at some particular little y where that little y,
belongs to the range of values of interest, x things values
between 0 and 1, so y also takes values between 0 and 1.
So over that range, the density of Y is the density of
X, which is uniform, therefore it is equal to 1, times the
derivative of the square root function.
And the derivative of the square root function is 1 over
2 times the square root of y.
As you can see, the amount of calculations involved here are
rather simpler compared to what we would have to do if we
were to go through our two step program
and work with CDFs.
All that you need to do is essentially to identify the
inverse function that given a y produces x's, and write down
the corresponding derivative.


# 10. Exercise: Using the formula for the monotonic case

![](C:/Users/qp/Pictures/Screenshots/10. Exercise Using the formula for the monotonic case - 1.png)
![](C:/Users/qp/Pictures/Screenshots/10. Exercise Using the formula for the monotonic case - 2.png)
![](C:/Users/qp/Pictures/Screenshots/10. Exercise Using the formula for the monotonic case - 3.png)
![](C:/Users/qp/Pictures/Screenshots/10. Exercise Using the formula for the monotonic case - 4.png)
![](C:/Users/qp/Pictures/Screenshots/10. Exercise Using the formula for the monotonic case - 5.png)
![](C:/Users/qp/Pictures/20221025_211139.jpg)


# 11. The intuition for the monotonic case

![](C:/Users/qp/Pictures/Screenshots/11. The intuition for the monotonic case - 1.png)
![](C:/Users/qp/Pictures/Screenshots/11. The intuition for the monotonic case - 2.png)
![](C:/Users/qp/Pictures/Screenshots/11. The intuition for the monotonic case - 3.png)
![](C:/Users/qp/Pictures/Screenshots/11. The intuition for the monotonic case - 4.png)
![](C:/Users/qp/Pictures/Screenshots/11. The intuition for the monotonic case - 5.png)
![](C:/Users/qp/Pictures/Screenshots/11. The intuition for the monotonic case - 6.png)
![](C:/Users/qp/Pictures/Screenshots/11. The intuition for the monotonic case - 7.png)
![](C:/Users/qp/Pictures/Screenshots/11. The intuition for the monotonic case - 9.png)
![](C:/Users/qp/Pictures/Screenshots/11. The intuition for the monotonic case - 10.png)
![](C:/Users/qp/Pictures/Screenshots/11. The intuition for the monotonic case - 11.png)
The formula that we just derived for the monotonic case
has a nice intuitive explanation that we will
develop now.
Suppose that g is a monotonic function of x and that it's
monotonically increasing.
Let us fix a particular x and a corresponding y so that the
two of them are related as follows-- y is equal to g of
x, or we could argue in terms of the inverse function so
that x is equal to h of y.
Recall that h is the inverse function, that given a value
of y, tells us which one is the corresponding value of x.
Now let us consider a small interval in the
vicinity of this x.
Whenever x falls somewhere in this range, then y is going to
fall inside another small interval.
The event that x belongs here is the same as the event that
y belongs there.
So these two events have the same probability.
And we can, therefore, write that the probability that Y
falls in this interval is the same as the probability that X
falls in the corresponding little interval on the x-axis.
This interval has a certain length delta 1.
This interval has a certain length delta 2.
Now remember our interpretation of
probabilities of small intervals in terms of PDFs so
this probability here is approximately equal to the PDF
of Y evaluated at the point y times the length of the
corresponding interval.
Similarly, on the other side, the probability that X falls
on the interval is the PDF of X times the
length of that interval.
So this gives us already a relation between the PDF of Y
and the PDF of X, but it involves those two numbers
delta 1 and delta 2.
How are these two numbers related?
If x moves up by the amount of delta 1, how much is y going
to move up?
It's going to move up by an amount which is delta 1 times
the slope of the function g at that particular point.
So that gives us one relation that delta 2 is approximately
equal to delta 1 times the derivative of the function of
g at that particular x.
However, it's more useful to work the other way, thinking
in terms of the inverse function.
The inverse function maps y to x, and it maps y plus delta to
2 to x plus delta 1.
When y advances by delta 2, x is going to advance by an
amount which is how much y advanced times the slope, or
the derivative, of the function that
maps y's into x's.
And this function is the inverse function.
So this is the relation that we're going to use.
And so we replace delta 1 by this expression that we have
here in terms of delta 2.
And now we cancel the delta 2 from both sides of this
equality, and we obtain the final formula that the PDF of
Y evaluated at a certain point is equal to the PDF of x
evaluated at the corresponding point, or we could write this
as the PDF of X evaluated at the value x that's associated
to that y that's given by the inverse function, times the
derivative of the function h, the inverse function.
And this is just the same formula as the one that we had
derived earlier using CDFs.
This derivation is quite intuitive.
It associates probabilities of small intervals on the x-axis
to probabilities of corresponding small intervals
on the y-axis.
These two probabilities have to be equal, and this implies
a certain relation between the two PDFs.


# 12. A nonmonotonic example

![](C:/Users/qp/Pictures/Screenshots/12. A nonmonotonic example - 1.png)
![](C:/Users/qp/Pictures/Screenshots/12. A nonmonotonic example - 2.png)
![](C:/Users/qp/Pictures/Screenshots/12. A nonmonotonic example - 3.png)
![](C:/Users/qp/Pictures/Screenshots/12. A nonmonotonic example - 4.png)
![](C:/Users/qp/Pictures/Screenshots/12. A nonmonotonic example - 5.png)
![](C:/Users/qp/Pictures/Screenshots/12. A nonmonotonic example - 6.png)
![](C:/Users/qp/Pictures/Screenshots/12. A nonmonotonic example - 7.png)
![](C:/Users/qp/Pictures/Screenshots/12. A nonmonotonic example - 8.png)
![](C:/Users/qp/Pictures/Screenshots/12. A nonmonotonic example - 9.png)
![](C:/Users/qp/Pictures/Screenshots/12. A nonmonotonic example - 10.png)
All of our examples so far have involved functions, g of
x, that are monotonic in X, at least over the
range of x's of interest.
Let us now look at an example that involves a
non-monotonic function.
We're going to consider the square function, which has the
shape shown in this diagram.
And we'll assume that X has a general distribution so that
it can take both positive and negative values.
And so over the range of values of X, the function that
we're dealing with is decreasing and then
increasing.
So it is not monotonic.
How can we find the distribution of Y?
As a warm-up, let's look at the discrete case.
And as an example of the calculation, let us find the
formula for the probability that the random variable Y
takes a value of 9.
This event can happen in two ways.
It can happen if X is equal to 3.
But it can also happen if x is equal to negative 3.
And these are the two and only two ways that y can take a
value of 9.
We can generalize this calculation.
The probability that the random variable y takes on a
specific value little y, this probability can be found by
adding the probabilities of all of the different x's that
lead to this particular value.
Now, for X squared to be equal to little y, we need to have X
to be equal either to the positive square root of y or
to be equal to the negative square root of y.
And this is the general formula for the PMF of the
random variable Y. And it involves two terms, because
any given value of little y can happen in two ways, either
by having X equal to the negative square root of y or
by having X be equal to the positive square root of y.
We have here a situation where the function that we're
dealing with is not invertible.
For a given value of y, we cannot find a single value of
x that will lead to that y.
But instead, we typically have two values of x that lead to
that particular y, namely the positive and the negative
square roots of y.
So we cannot use the tools that we used before in the
monotonic case, where we dealt with the inverse function.
What we can do instead is to proceed from first principles
and calculate the CDF of the random variable Y. The CDF of
Y is the probability that the random variable is less than
or equal to a certain number.
And we're going to focus only on the case where that certain
number is non-negative.
If little y is negative, then we know that this probability
is going to be 0, because the random variable Y cannot take
negative values.
Now, this is the probability that the random variable X
squared is less than or equal to y.
So what we did here is to express this event in terms of
the original random variable, capital X, whose PDF is
presumably available.
Now, this event is the same as requiring the absolute value
of X to be less than or equal to the square root of y.
And this event, again, is the same as having the random
variable X be between the negative and the positive
square root of y.
In terms of a picture, the random variable capital Y
takes a value less than or equal to this particular
little y if and only if the random variable x falls inside
this range.
Now, we want to express this probability in terms of the
CDF of X. The probability that we're looking at, the
probability of this interval, is equal to the probability
that x is less than or equal to the square root of y.
This is the probability from minus infinity up to the
square root of y.
But from this, we need to subtract the probability of
this interval.
And that would be the CDF of the random variable x up to
the point [negative]
square root of y.
So we now have an expression for the CDF of Y in terms of
the CDF of X. At this point, now we can take derivatives
and use the chain rule.
The PDF of Y is going to be equal to the derivative of
this expression.
The derivative of the first term, by the chain rule, is
the PDF of X, evaluated at the square root of y times the
derivative of this argument with respect to y, which is 1
over 2 square root of y.
And then we need the derivative of the second term.
We have a minus sign, then the derivative of the CDF
which is the PDF.
Evaluate it at minus square root of y.
And then the derivative of this term with respect to y,
which is minus 1 over two square root of y.
Now, we have a minus sign here and a minus sign there.
So the two cancel out.
We can get rid of this minus 1 term and change this minus
into a plus.
And this is the final form of the answer.
So we see that the PDF of Y evaluated at a particular
point, which tells us something about the
probability that the random variable takes values around
this point, has to do with the probabilities that the random
variable X takes values around here or around there.
There are two contributions, and this is because there are
two different ways that a value of y may occur, either X
falling here or X falling there.


# 13. Exercise: Nonmonotonic functions

![](C:/Users/qp/Pictures/Screenshots/13. Exercise Nonmonotonic functions - 1.png)
![](C:/Users/qp/Pictures/Screenshots/13. Exercise Nonmonotonic functions - 2.png)
![](C:/Users/qp/Pictures/Screenshots/13. Exercise Nonmonotonic functions - 3.png)
![](C:/Users/qp/Pictures/20221025_215916.jpg)


# 14. A function of multiple r.v.'s

![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 2.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 3.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 4.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 5.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 6.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 7.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 8.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 9.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 10.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 11.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 12.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 13.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 14.png)
![](C:/Users/qp/Pictures/Screenshots/14. A function of multiple r.v.'s - 15.png)
In all of the examples that we have seen so far, we have calculated the distribution of a random variable, Y, which is defined as a function of another random variable, X.  What about the case where we define a random variable, Z, as a function of multiple random variables?  For example, here is the function of two random variables.  How can we find a distribution of Z?  

The general methodology is exactly the same.  We somehow calculate the CDF of the random variable Z and then differentiate to find its PDF.  Let us illustrate this methodology with a simple example.  So suppose that X and Y are independent random variables and each one of them is uniform on the unit interval.  So their joint distribution is going to be a uniform PDF on the unit square.  

We're interested in the random variable, which is defined as the ratio of Y divided by X.  So we will now calculate the CDF of Z and then differentiate.  It is useful to work in terms of a diagram.  This is essentially our sample space, the unit square.
The PDF of X is 1 on the unit interval.
The PDF of Y is 1 on the unit interval.
Because of independence, the joint PDF is the product of
their individual PDFs.
So the joint PDF is equal to 1 throughout this unit square.
So now let us write an expression for the CDF of Z,
which, by definition, is the probability that the random
variable Z, which in our case is Y divided by X, is less
than or equal than a certain number, little z.
What is the probability of this event?
Let us consider a few different cases.
Suppose that z is negative.
What is the probability that this ratio is negative?
Well, since X and Y are non-negative numbers, there's
no way that the ratio is going to be negative.
So if little z is a negative number, the probability of
this event is going to be equal to 0.
This is the easier case.
Now suppose that z is a positive number.
Let us draw a line that has a slope of little z.
y/z being less than or equal to little z is the same as
saying that y is less than or equal to little z times x.
This is the line on which y is equal to z times x.
So below that line, y is going to be less than or
equal to z times x.
So the event of interest is actually this triangle here.
And the probability of this event, since we're dealing
with a uniform distribution on the unit square, is just the
area of this triangle.
Now, since this line rises at slope z, this point here, this
intercept is at z.
And so the sides of the triangle are 1 and z.
And so this formula here gives us the value of the CDF for
the case where little z is positive.
And the same formula would also be true if z also were
equal to 0, in which case, we get 0 probability.
But is this correct for all positive z's?
Well, not really.
This calculation was based on this picture.
And in this picture, this line intercepted this side of the
unit square.
And for that to happen, this slope must be less than or
equal to 1.
So this formula is only correct in the case where we
have a slope of less than or equal to 1.
And now we need to deal with the remaining case in which
little z is strictly larger than 1.
In this case, we get a somewhat different picture.
If we draw a line with slope, again, little z, because
little z is bigger than 1, it's going to intercept this
side of the rectangle.
Now, the event that Y/X is less than or equal to little z
is, again, the event that the pair, X, Y, lies below this
line that has a slope of z.
So all we need is to find the area of this region.
One way of finding the area of this region is to take the
area of the entire unit square, which is equal to 1,
and subtract the area of this triangle.
What is the area of this triangle?
Well, since this line has a slope of z, in order for it to
rise to a value of 1, x must be equal to 1 over little z.
Therefore, this side of the triangle is 1/z.
And therefore, the area of the triangle is 1/2 times 1/z,
which is this expression here.
And so we have found the value of the CDF for all possible
choices of little z.
We can draw the CDF.
And the picture is as follows.
For z negative, the CDF is equal to 0.
For z between 0 and 1, the CDF rises linearly
at a slope of 1/2.
And so when z is equal to 1, the CDF has risen
to a value of 1/2.
And then as z goes to infinity, this term disappears
and the CDF will converge to 1.
So it converges to 1 monotonically but in a
non-linear fashion.
So we get a picture of this type.
The next step, the final step, is to differentiate the CDF
and obtain the PDF.
In this region, the CDF is constant, so its derivative is
going to be equal to 0.
In this region, the CDF is linear, so its derivative is
equal to this factor of 1/2.
So the CDF is equal to 1/2 for z's between 0 and 1.
And finally, in this region, this is the
formula for the CDF.
When we take the derivative, we get the expression 1 over
2z squared, which is a function that decreases as z
goes to infinity.
So it has a shape like this one.
So we have completed the solution to this problem.
We found the CDF, and we found the corresponding PDF.
This methodology works more generally for more complicated
functions of X and Y and for more complicated distributions
for X and Y. Of course, when the functions or the
distributions are more complicated, the calculus
involved and the geometry may require a lot more work.
But conceptually, the methodology
is exactly the same.


# 15. Exercise: A function of multiple r.v.'s

![](C:/Users/qp/Pictures/Screenshots/15. Exercise A function of multiple r.v.'s - 1.png)
![](C:/Users/qp/Pictures/Screenshots/15. Exercise A function of multiple r.v.'s - 2.png)
![](C:/Users/qp/Pictures/20221025_224639.jpg)


## Course  /  Unit 6: Further topics on random variables  /  Lec. 12: Sums of independent r.v.'s; Covariance and correlation

# 1. Lecture 12 overview and slides

This lecture covers two different topics: (i) the calculation of the PMF or PDF of the sum of independent random variables; and (ii) the concepts of covariance and correlation, and their main properties. 


![](C:/Users/qp/Pictures/Screenshots/1. Lecture 12 overview and slides - 1.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 12 overview and slides - 2.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 12 overview and slides - 3.png)
This lecture consists of two parts that deal with two
rather different topics.
In the first part, we look into an important special case
of a derived distribution problem.
We start with two independent random variables with known
distributions and wish to find the distribution of their sum.
We will see that for either the discrete or the continuous
case, there is a nice formula that gives us the answer.
We will develop this formula and then we will talk a little
bit about a graphical way of carrying out the
calculations involved.
As we will discuss, this formula also allows us to
establish the very important fact that the sum of two
independent, normal random variables is normal.
In the second part, we introduce the covariance of
two random variables and the correlation coefficient.
These are certain quantities that allow us to quantify the
degree to which two dependent random variables are related.
For example, a high value of the correlation coefficient
will indicate a strong relation between
these random variables.
We will see the basic mathematical properties of
these quantities and provide some interpretation.
Later on in this class, we will see that they play an
important role in the problem of estimating one random
variable, given the value of another.


Printable transcript available here.
https://courses.edx.org/assets/courseware/v1/8c5e4986cc7565020aa8757c831108f0/asset-v1:MITx+6.431x+2T2022+type@asset+block/transcripts_L12-Overview.pdf

Lecture slides: [clean] [annotated]
https://courses.edx.org/assets/courseware/v1/e042ffdc66eeeb8c463107a8e78b6a5c/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L12cleanslides.pdf
https://courses.edx.org/assets/courseware/v1/6f388c061df5f3f1d47bd83cbc315598/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L12annotatedslides.pdf

More information is given in Section 4.1 and Section 4.2 of the text.
https://courses.edx.org/courses/course-v1:MITx+6.431x+2T2022/pdfbook/0/chapter/1/31
https://courses.edx.org/courses/course-v1:MITx+6.431x+2T2022/pdfbook/0/chapter/1/32


# 2. The sum of independent discrete random variables

![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 2.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 3.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 4.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 5.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 6.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 7.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 8.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 9.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 10.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 11.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 12.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 13.png)
![](C:/Users/qp/Pictures/Screenshots/2. The sum of independent discrete random variables - 14.png)
The subject of this segment is the calculation of the PMF of
the sum of two independent, discrete random variables.
This is the simplest example of a function of two random
variables, a function of the form of g of X and Y, where
the function g happens to be just the
sum of the two arguments.
This is a very important example, because there are
many situations where random variables get
added to each other.
We work with discrete random variables as a warm up.
And later, we will consider the case of
continuous random variables.
So suppose that we know the PMFs of X and Y and that we
want to compute the probability that the sum is
equal to 3.
It always helps to have a picture.
The sum of X and Y will be equal to 3.
This is an event that can happen in many ways.
For example, x could be 3 and Y could be 0, or X could be 1
and Y equal to 2.
The probability of the event of interest, that the sum is
equal to 3, is going to be the sum of the probabilities of
all the different ways that this event can happen.
So it is going to be a sum of various terms.
And the typical term would be the probability, let's say of
this outcome, which is that X is equal to 0 and
Y is equal to 3.
Another typical term in the sum will be the probability of
this outcome here, the probability that X is equal to
1, Y is equal to 2, and so on.
Now, here comes an important step.
Because we have assumed that X and Y are independent, the
probability of these two events happening is the
product of the probabilities of each one of these events.
So it is the product of the probability that X is equal to
0, where now I'm using PMF notation, times the
probability that Y is equal to 3.
Similarly, the next term is the probability that X is
equal to 1 times the probability that
Y is equal to 2.
Again, we can do this because we are assuming that our two
random variables are independent of each other.
Now, let us generalize.
In the general case, the probability that the sum takes
on a particular value little z can be calculated as follows.
We look at all the different ways that the sum of little z
can be obtained.
One way is that the random variable X takes on a specific
value little X. And at the same time, the random variable
Y takes the value that's needed so that the sum of the
two is equal to little Z. For a given value of little X, we
have a particular way that the sum is equal to Z. And this
particular way has a certain probability.
But little X could be anything.
And different choices of little x give us different
ways that the event of interest can happen.
So we add those probabilities over all possible X's.
And then we proceed as follows.
We invoke independence of X and Y to derive this
probability as a product of two probabilities.
And then we use PMF notation instead of probability
notation to obtain this expression here.
This formula is called the convolution formula.
It is the convolution of two PMFs.
What convolution means is that somebody gives us the PMF of
one random variable, gives us also the PMF of
another random variable.
And when we say we're given the PMF, it means we're given
the values of the PMFs for all the possible choices of little
X and little y, the arguments of the two PMFs.
Then the convolution formula does a certain calculation and
spits out now a new PMF, which is the PMF of the random
variable Z. Let's now take a closer look at what it takes
to carry out of the calculations involved in this
convolution formula.
Let's proceed by a simple example that will illustrate
the methodology.
We're given two PMFs of two random variables.
And assuming that they are independent, the PMF of their
sum is determined by this formula here.
And we want to see what those terms in this
summation would be.
Suppose that we're interested in the probability that the
sum is equal to 3.
Now, the sum is going to be equal to 3.
This can happen in several ways.
We could have X equal to 1 and and Y equal to 2.
This combination is one way that the
sum of 3 can be obtained.
And that combination has a probability of 1/3 times 3/6.
And that would be one of the terms in this summation.
Another way that the sum of 3 can be obtained is by having X
equal to 4 and y equal to minus 1.
And by multiplying this probability 2/3 with 2/6, we
obtain another contribution to this summation.
However, keeping track of these correspondences here can
become a little complicated if we have richer our PMFs.
So an alternative way of arranging the calculation is
the following.
Let us take the PMF of Y, flip it along this vertical axis.
So these two terms would go to the left side, and this term
will go to the right hand side.
And then draw it underneath the PMF of X.
This is what we obtain.
Then let us take this drawing here and shift it to
the right by 3.
So the entry of minus 2 goes to 1, minus 1 goes to 2,
and 1 goes to 4.
So what have we accomplished by these two transformations?
Well, the term that had probability 3/6 and which were
to be multiplied with the probability 1/3 on that side,
now this 3/6 sits here.
So we have this correspondence.
And we need to multiply 1/3 by 3/6.
Similarly, the multiplication of 2/3 with 2/6 corresponds to
the multiplication of this probability here times the
probability of this term here.
So when the diagrams are arranged this way, then we
have a simpler job to do.
We look at corresponding terms, those that sit on top
of each other, multiply them, do that for all the possible
choices, and then add those products together.
And this is what we do if we're shifting by 3.
Now, if we wanted to find the probability that Z equal to 4,
we would be doing the same thing, except that this
diagram would need to be shifted by one more unit to
the right so that we have a total shift of 4.
So we just repeat this procedure for all possible
values of Z which corresponds to taking this diagram here
and shifting it progressively by different amounts.
This turns out to be a fairly simple and systematic way of
arranging the calculations, at least if you're
doing them by hand.
Of course, an alternative is to carry out the calculations
on a computer.
This is a pretty simple formula that is not hard to
implement on a computer.


# 3. Exercise: Discrete convolution

![](C:/Users/qp/Pictures/Screenshots/3. Exercise Discrete convolution - 1.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Discrete convolution - 2.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Discrete convolution - 3.png)
![](C:/Users/qp/Pictures/20221026_193810.jpg)


# 4. The sum of independent continuous r.v.'s

![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 1.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 2.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 3.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 4.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 5.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 6.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 7.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 8.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 9.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 10.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 11.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 12.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 13.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 14.png)
![](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 15.png)
We now develop a methodology for finding the PDF of the sum
of two independent random variables, when these random
variables are continuous with known PDFs.
So in that case, Z will also be continuous and
so will have a PDF.
The development is quite analogous to the one for the
discrete case.
And in the discrete case, we obtained
this convolution formula.
This convolution formula corresponds to a summation
over all ways that a certain sum can be realized.
In this picture, these are all the ways that the sum of 3 can
be realized.
In the continuous case, the different ways that the
constant sum can be realized corresponds to a line.
So this is a line in which X plus Y is equal to a constant.
And we need to somehow add over all the possible ways
that the sum can be obtained, add over all the
points on this line.
Now, when we're summing over all the points of the line we
really need to employ an integral.
And this leads to the following
guess for the formula.
Instead of having a summation, we will have an integral.
And the integral is over all the X, Y pairs whose sum is a
constant number, little z.
So we have here the family recipe--
that sums are replaced by integrals and PMFs are
replaced by PDFs.
So this formula is entirely plausible.
And it is called the
continuous convolution formula.
What we want to do next is to actually justify this formula
more rigorously.
We will use the following trick.
We will first condition on the random variable X, taking on a
specific value.
If we do this conditioning, then the random variable Z
becomes little x plus Y. And to make the argument more
transparent, we're going to look first at the special case
where little x is let's say, the number 3.
In which case our random variable Z is
equal to Y plus 3.
Let us now calculate the conditional PDF of Z in a
universe in which we are told that the random variable X
takes on the value of 3.
Now, given that X takes on the value of 3, the random
variable Z is the same as the random variable Y plus 3.
And now we have the conditional PDF of y plus 3
given X.
However, we have assumed that X and Y are independent.
So the conditional PDF is going to be the same as the
unconditional PDF of Y plus 3.
And we obtain this expression.
Now, what is this?
We know the PDF of Y. But now we want the PDF of Y plus 3,
which is a simple version of a linear function of a single
random variable Y. For a linear function of this form,
we have already derived a formula.
In the notation we have used in the past, if we have a
random variable X, and we add the constant to it, the PDF of
the new random variable is the PDF of X but shifted by an
amount equal to b to the right.
And that's what the shifting corresponds to mathematically.
Now, let's us apply this formula to the case
that we have here.
We need to keep track of the different symbols.
So capital Y corresponds to X, b corresponds to 3, little x
corresponds to Z. And by using these correspondences, what we
obtain is f sub Y of this argument, which is Z in our
case minus b, which is 3 in our case.
And this is the final form for the conditional density of Z
given that X takes a specific value.
It's nothing more than the density of Y, but shifted by 3
units to the right.
Let us now generalize this.
Instead of using X equal to 3, let us use a general number.
And this gives us the more general formula, that the
conditional PDF of Z given that X takes on a specific
value is equal to--
just use little x here instead of 3.
It takes this form.
So we do have now in our hands a formula for the conditional
density of Z given X.
Since we have the conditional, and we also know the PDF of X,
we can use the multiplication rule to find the joint PDF of
X and Z. By the multiplication rule, it is the marginal PDF
of X times the conditional PDF of Z given X, which in our
case takes this particular form.
And now that we have the joint PDF in our hands, we can use
another familiar formula that takes us from the
joint to the marginal.
It would take the joint PDF and integrate with respect to
one argument, we obtain the marginal PDF of the other
random variable.
Using this specific form that we have for the joint PDF in
this formula, we have finally obtained this expression.
This is the integral of the joint PDF of X with Z
integrated over all xs.
And this proves this convolution formula.
In terms of the mechanics of carrying out the calculation
of the convolution, the mechanics are exactly the same
as in the discrete case.
If you want to solve a problem graphically, what you will do
is to take the PDF of Y, flip it horizontally, and then
shift it by an amount of little z, cross multiply
terms, and integrate them out.


# 5. Exercise: Continuous convolution

![](C:/Users/qp/Pictures/Screenshots/5. Exercise Continuous convolution - 1.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise Continuous convolution - 2.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise Continuous convolution - 3.png)


# 6. The sum of independent normal r.v.'s

![](C:/Users/qp/Pictures/Screenshots/6. The sum of independent normal r.v.'s - 1.png)
![](C:/Users/qp/Pictures/Screenshots/6. The sum of independent normal r.v.'s - 2.png)
![](C:/Users/qp/Pictures/Screenshots/6. The sum of independent normal r.v.'s - 3.png)
![](C:/Users/qp/Pictures/Screenshots/6. The sum of independent normal r.v.'s - 4.png)
![](C:/Users/qp/Pictures/Screenshots/6. The sum of independent normal r.v.'s - 5.png)
![](C:/Users/qp/Pictures/Screenshots/6. The sum of independent normal r.v.'s - 6.png)
![](C:/Users/qp/Pictures/Screenshots/6. The sum of independent normal r.v.'s - 7.png)
In this brief segment, we will discuss
an important application of the convolution formula.
Suppose that X is a normal random variable with a given
mean and variance.
So that the PDF of X takes this form.
And similarly, Y is normal with a given mean and variance.
So its PDF takes this form.
We assume that X and Y are independent.
And we're interested in the sum of the two random variables
X and Y. And we wish to derive the PDF of Z.
Of course, the PDF of Z is given by the convolution formula.
And now we plug in here, the form for the density of X.
And here, we plug the form of the density of Y.
Except that instead of the argument Y,
we need to put in the argument z minus x.
So we obtain this form, where here we
have a z minus x instead of y.
Now this is an integral that looks pretty complicated.
But it is not too hard to do.
One just needs to be patient, rearrange terms, collect terms.
And the details of the calculations
are not as interesting.
So we will skip them for now.
And I will just tell you that the final answer
takes this form.
What is this form?
Well, it's exponential of minus z
minus something squared divided by a constant.
And we recognize that this is the form
of a normal random variable.
It's a normal random variable whose mean
is given by this term here, it's mu x plus mu y.
And the variance of that normal random variable
is that constant that appears next to the factor of 2
in the denominator.
So the sum of these two normal random variables,
these two independent normal random variables,
is also normal.
The fact that this is the mean and this
is the variance of the sum, of course, is not a surprise.
What is important in this result that we have here
is that the sum is actually normal.
Now, we carried out this argument
for the case of the sum of two normal random variables.
But suppose that we had the sum of three
independent normal random variables,
what can we say about it?
By the result that we just discussed, this sum is normal.
This is assumed to be normal.
We assume that X, Y, and W are independent.
Therefore, this sum is independent from W.
So we're dealing with the sum of two
independent normal random variables again.
So this sum here is going to be normal as well.
And we continue this argument by induction,
and conclude that more generally,
the sum of any finite number of independent normal random
variables is normal.
This is a very important, but also useful fact.
It means that when we start working
with normal random variables, very often
we stay within the realm of normal random variables.
We can form linear functions of them,
take linear combinations of them,
and still remain in the world of normal random variables.


# 7. Exercise: Sum of normals

![](C:/Users/qp/Pictures/Screenshots/7. Exercise Sum of normals - 1.png)
![](C:/Users/qp/Pictures/Screenshots/7. Exercise Sum of normals - 2.png)
![](C:/Users/qp/Pictures/Screenshots/7. Exercise Sum of normals - 3.png)
![](C:/Users/qp/Pictures/20221029_203707.jpg)
![](C:/Users/qp/Pictures/20221029_203727.jpg)


# 8. Covariance

![line 10000, go back to study that again](C:/Users/qp/Pictures/Screenshots/10. Independence and expectations.png)
![](C:/Users/qp/Pictures/Screenshots/8. Covariance - 1.png)
![](C:/Users/qp/Pictures/Screenshots/8. Covariance - 2.png)
![](C:/Users/qp/Pictures/Screenshots/8. Covariance - 3.png)
![](C:/Users/qp/Pictures/Screenshots/8. Covariance - 4.png)
![](C:/Users/qp/Pictures/Screenshots/8. Covariance - 5.png)
![](C:/Users/qp/Pictures/Screenshots/8. Covariance - 6.png)
![](C:/Users/qp/Pictures/Screenshots/8. Covariance - 7.png)
![](C:/Users/qp/Pictures/Screenshots/8. Covariance - 8.png)
In this segment we start a new topic.
We will talk about the covariance of two random
variables, which gives us useful information about the
dependencies between these two random variables.
Let us motivate the concept by looking first
at a special case.
Suppose that X and Y have zero means and that they there are
discrete random variables.
If X and Y are independent, then the expectation of the
product is the product of the expectations.
And since we have assumed zero means, this is going to be
equal to zero.
But suppose instead that the joint PMF of X and Y is of the
following kind.
Each point in this diagram is equally likely, so we have
here a discrete uniform distribution on the discrete
set which consists of the points shown in this diagram.
What we have in this particular example is that at
most outcomes, positive values of X tend to go together with
positive values of Y. And negative values of X tend to
go together with negative values of Y. So most of the
time we have outcomes in this quadrant, in which x times y
is positive, or in this quadrant where x times y is,
again, positive.
But some of the time we fall in this quadrant where x times
y is negative, or in this quadrant where
x times y is negative.
Since we have many more points here and here, on the average,
the value of x times y is going to be positive.
On the other hand, if the diagram takes this form, then,
most of the time, the pair x, y lies in this quadrant or in
that quadrant where the product of
x times y is negative.
So the random variables X and Y typically have opposite
signs, and on the average, the expected value of X times Y is
going to be negative.
So here we have a positive expectation, here we have a
negative expectation of X times Y. This quantity, the
expected value of X times Y, tells us whether X and Y tend
to move in the same or in opposite directions.
And this quantity is what we call the covariance, in the
zero mean case.
Let us now generalize.
The random variables do not have to be discrete.
This quantity is well defined for any
kind of random variables.
And if we have non-zero means, the covariance is defined by
this expression.
What we have here is that we look at the deviation of X
from its mean value, and the deviation of Y from its mean
value, and we're asking whether these two deviations
tend to have the same sign or not, whether they move in the
same direction or not.
If the covariance is positive, what it tells us is that
whenever this quantity is positive so that X is above
its mean, then, typically or usually, the deviation of Y
from its mean will also tend to be positive.
To summarize, the covariance, in general, tells us whether
two random variables tend to move together, both being high
or both being low, in some average or typical sense.
Now, if the two random variables are independent, we
already saw that in the zero mean case, this quantity--
the covariance--
is going to be 0.
How about the case where we have non-zero means?
Well, if we have independence, then we have the expected
value of the product of two random variables.
X and Y are independent, so X minus the expected value,
which is a constant, is going to be independent from Y minus
its expected value.
And so, the covariance is going to be the product of two
expectations.
But the expected value of X minus this constant is 0, and
the same is true for this term as well.
So the covariance in this case is going to be equal to 0.
So in the independent case, we have zero covariances.
On the other hand, the converse is not true.
There are examples in which we have dependence but zero
covariance.
Here is one example.
In this example there are four possible outcomes.
At any particular outcome, either X or Y
is going to be 0.
So in this example the random variable X times Y is
identically equal to 0.
The mean of X is also 0, the mean of Y is also 0 by
symmetry, so the covariance is the expected
value of this quantity.
And so the covariance, in this example, is equal to 0.
On the other hand, the two random variables,
X and Y, are dependent.
If I tell you that X is equal to 1, then you know that this
outcome has occurred.
And in that case, you are certain that Y is equal to 0.
So knowing the value of X tells you a lot about the
value of Y and, therefore, we have dependence between these
two random variables.


# 9. Exercise: Covariance calculation

![](C:/Users/qp/Pictures/Screenshots/9. Exercise Covariance calculation - 1.png)
![](C:/Users/qp/Pictures/Screenshots/9. Exercise Covariance calculation - 2.png)
![Why not 0, you need to be sceptical, and thinking](C:/Users/qp/Pictures/Screenshots/9. Exercise Covariance calculation - 3.png)
![](C:/Users/qp/Pictures/Screenshots/9. Exercise Covariance calculation - 4.png)
![Why cant we???](C:/Users/qp/Pictures/20221030_110735.jpg)
![independent or not](C:/Users/qp/Pictures/20221030_110803.jpg)


# 10. Covariance properties

![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 1.png)
![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 2.png)
![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 3.png)
![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 4.png)
![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 5.png)
![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 6.png)
![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 7.png)
![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 8.png)
![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 9.png)
![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 10.png)
![](C:/Users/qp/Pictures/Screenshots/10. Covariance properties - 11.png)


# 11. Exercise: Covariance properties

![](C:/Users/qp/Pictures/Screenshots/11. Exercise Covariance properties - 1.png)
![](C:/Users/qp/Pictures/Screenshots/11. Exercise Covariance properties - 2.png)
![](C:/Users/qp/Pictures/Screenshots/11. Exercise Covariance properties - 3.png)
![](C:/Users/qp/Pictures/20221030_114835.jpg)
![](C:/Users/qp/Pictures/20221030_114857.jpg)
![](C:/Users/qp/Pictures/20221030_114913.jpg)


# 12. The variance of the sum of r.v.'s

![Think independent or not independent](C:/Users/qp/Pictures/Screenshots/12. Independence, variances, and the binomial variance - 1.png)
![](C:/Users/qp/Pictures/Screenshots/12. The variance of the sum of r.v.'s - 1.png)
![](C:/Users/qp/Pictures/Screenshots/12. The variance of the sum of r.v.'s - 2.png)
![](C:/Users/qp/Pictures/Screenshots/12. The variance of the sum of r.v.'s - 3.png)
![](C:/Users/qp/Pictures/Screenshots/12. The variance of the sum of r.v.'s - 4.png)
![](C:/Users/qp/Pictures/Screenshots/12. The variance of the sum of r.v.'s - 5.png)
![remember expectation value of each term is 0](C:/Users/qp/Pictures/Screenshots/12. The variance of the sum of r.v.'s - 6.png)
![](C:/Users/qp/Pictures/Screenshots/12. The variance of the sum of r.v.'s - 7.png)
![](C:/Users/qp/Pictures/Screenshots/12. The variance of the sum of r.v.'s - 8.png)
![](C:/Users/qp/Pictures/20221030_143609.jpg)
![Think, think, think](C:/Users/qp/Pictures/20221030_143635.jpg)
![](C:/users/qp/Pictures/20221030_143649.jpg)
One situation where covariances show up
is when we try to calculate the variance
of a sum of random variables.
So let us look at the variance of the sum
of two random variables, X1 and X2.
If the two random variables are independent,
then we know that the variance of the sum
is the sum of the variances.
Let us now look at what happens in the case where
we may have dependence.
By definition, the variance is the expected value
of the difference of the random variable we're
interested in from its expected value, squared.
And now we rearrange terms here and write
what is inside the expectation as follows.
We put together X1 with the term minus the expected value of X1
and then X2 together with negative the expected value
of X2.
So now we have the square of the sum of two terms.
We expand the quadratic to obtain
expected value of the square of the first term
plus the square of the second term plus 2 times a cross term.
And what do we have here?
The expected value of the first term
is just the variance of X1.
The expected value of this second term
is just the variance of X2.
And finally, the cross term, the expected value of it,
we recognize that it is the same as the covariance of X1
with X2.
And we also have this factor of 2 up here.
So this is the general form for the variance
of the sum of two random variables.
In the case of independence, the covariance is 0,
and we just have the sum of the two variances.
But when the random variables are dependent,
it is possible that the covariance will be non-zero,
and we have one additional term.
Let us now not generalize this calculation.
Here is for reference and comparison
the formula for the case where we add two random variables.
But now let us look at the variance
of the sum of many of them.
To keep the calculation simple, we're
going to assume that the means are zero.
But the final conclusion will also
be valid for the case of non-zero means.
Since we have assumed zero means,
the variance is the same as the expected value
of the square of the random variable involved,
which is this one.
And now we expand this quadratic to obtain the expected value
of: we will have a bunch of terms
of this, where i ranges from 1 up to n.
And then we will have a bunch of cross terms of the form Xi, Xj.
And we obtain one cross term for each choice of i
from 1 to n and for each choice of j from 1 to n,
as long as i is different from j.
So overall here, this sum will have n squared minus n terms.
Now, we use linearity to move the expectation
inside the summation.
And so from here, we obtain the sum
of the expected value of Xi squared, which
is the same as the variance of Xi,
since we assumed zero means.
And similarly here, we're going to get this double sum over i's
that are different from j of the expected value of Xi, Xj.
And in the case of 0 means again, this
is the same as the covariance of Xi with Xj.
And so we have obtained this general formula
that gives us the variance of a sum of random variables.
If the random variables have 0 covariances,
then the variance of the sum is the sum of the variances.
And this happens in particular when the random variables
are independent.
For the general case, where we may have dependencies
and non-zero variances, then the variance of the sum
involves also all the possible covariances
between the different random variables.
And let me finally add that this formula is also
valid for the general case where we do not
assume that the means are zero.
And the derivation is very similar,
except that there's a few more symbols
that are floating around.


# 13. Exercise: The variance of a sum

![](C:/Users/qp/Pictures/Screenshots/13. Exercise The variance of a sum - 1.png)
![](C:/Users/qp/Pictures/Screenshots/13. Exercise The variance of a sum - 2.png)
![mostly important note](C:/Users/qp/Pictures/20221030_151154.jpg)


# 14. The correlation coefficient

![](C:/Users/qp/Pictures/Screenshots/14. The correlation coefficient - 1.png)
![](C:/Users/qp/Pictures/Screenshots/14. The correlation coefficient - 2.png)
![](C:/Users/qp/Pictures/Screenshots/14. The correlation coefficient - 3.png)
![](C:/Users/qp/Pictures/Screenshots/14. The correlation coefficient - 4.png)
![](C:/Users/qp/Pictures/Screenshots/14. The correlation coefficient - 5.png)
![](C:/Users/qp/Pictures/Screenshots/14. The correlation coefficient - 6.png)
![in Cov, because X and E{X} both scaled by a](C:/Users/qp/Pictures/Screenshots/14. The correlation coefficient - 7.png)
![](C:/Users/qp/Pictures/Screenshots/14. The correlation coefficient - 8.png)
The covariance between two random variables
tells us something about the strength
of the dependence between them.
But it is not so easy to interpret qualitatively.
For example, if I tell you that the covariance of X and Y
is equal to 5, this does not tell you
very much about whether X and Y are closely related or not.
Another difficulty is that if X and Y are in units,
let's say, of meters, then the covariance
will have units of meters squared.
And this is hard to interpret.
A much more informative quantity is the so-called correlation
coefficient, which is a dimensionless version
of the covariance.
It is defined by this formula here.
We just take the covariance and divide it
by the product of the standard deviations of the two
random variables.
Now, if X has units of meters, then the standard deviation
also has units of meters.
And so this ratio will be dimensionless.
And it is not affected by the units that we're using.
The same is true for this ratio here,
and this is why the correlation coefficient does not
have any units of its own.
One remark-- if we're dealing with a random variable whose
standard deviation is equal to 0--
so its variance is also equal to 0--
then we have a random variable, which
is identically equal to a constant.
Well, for such cases of degenerate random variables,
then the correlation coefficient is not
defined, because it would have involved a division by 0.
A very important property of the correlation coefficient
is the following.
It turns out that the correlation coefficient
is always between minus 1 and 1.
And this allows us to judge whether a certain correlation
coefficient is big or not, because we now
have an absolute scale.
And so it does provide a measure of the degree to which two
random variables are associated.
To interpret the correlation coefficient,
let's now look at some extreme cases.
Suppose that X and Y are independent.
In that case, we know that the covariance
is going to be equal to 0.
And therefore, the correlation coefficient
is also going to be equal to 0.
And in that case, we say that the two random variables
are uncorrelated.
However, the converse statement is not true.
We have seen already an example in which we
have zero covariance and therefore zero correlation,
but yet the two random variables were dependent.
Let us now look at the other extreme,
where the two random variables are
as dependent as they can be.
So let's look at the correlation coefficient
of one random variable with itself.
What is it going to be?
The covariance of a random variable with itself
is just the variance of that random variable, now,
sigma X is going to be the same as sigma Y,
because we're taking Y to be the same as X.
So we're dividing by sigma X squared.
But the square of the standard deviation is the variance.
So we obtain a value of 1.
So a correlation coefficient of 1
shows up in such a case of an extreme dependence.
If instead we had taken the correlation coefficient of X
with the negative of X, in that case,
we would have obtained a correlation coefficient
of minus 1.
A somewhat more general situation
than the one we considered here is the following.
If we have two random variables that
have a linear relationship-- that is, if I know Y
I can figure out the value of X with absolute certainty,
and I can figure it out by using a linear formula.
In this case, it turns out that the correlation coefficient
is either plus 1 or minus 1.
And the converse is true.
If the correlation coefficient has absolute value of 1,
then the two random variables obey
a deterministic linear relation between them.
So to conclude, an extreme value for the correlation coefficient
of plus or minus 1 is equivalent to having
a deterministic relation between the two random variables
involved.
A final remark about the algebraic properties
of the correlation coefficient- What can we
say about the correlation coefficient
of a linear function of a random variable with another?
Well, we already know something about what
happens to the covariance when we form a linear function.
And the covariance of aX plus b with Y is related this way
to the covariance of X with Y. Now, let us use this property
and calculate the correlation coefficient between aX
plus b and Y.
In the numerator, we have the covariance of aX
plus b with Y, which is equal to a times
the covariance of X with Y. At the denominator,
we have the standard deviation of this random variable.
Now, the standard deviation of this random variable
is equal to a times the standard deviation of X,
if a is positive.
If a is negative, then we need to put the minus sign.
But in either case, we will have here the absolute value
of a times the standard deviation
of X. And then we divide by the standard deviation
of the second random variable, which is Y.
And so what we obtain here is this ratio, which
is a correlation coefficient of X with Y times
this quantity, which is the sign of a.
So we have the sign of a times the correlation coefficient
of X with Y. So in particular, the magnitude
of the correlation coefficient is not
going to change when we replace X by aX plus b.
And this essentially means that if we
change the units of the random variable X,
for example, suppose that X was degrees
Celsius and aX plus b is degrees Fahrenheit, going
from one set of units, Celsius degrees,
to another set of units, degrees in Fahrenheit,
is not going to change the correlation
coefficient of the temperature with some other random
variable.
So this is a nice property of the correlation coefficient,
again, which reflects the fact that it's dimensionless,
it doesn't have any units of its own,
and it doesn't depend on what kinds of units
we use for each one of the random variables.


# 15. Exercise: Correlation coefficient

![](C:/Users/qp/Pictures/Screenshots/15. Exercise Correlation coefficient - 1.png)
![](C:/Users/qp/Pictures/Screenshots/15. Exercise Correlation coefficient - 2.png)
![](C:/Users/qp/Pictures/20221030_180158.jpg)
![](C:/Users/qp/Pictures/20221030_180216.jpg)


# 16. Derivation of key properties of the correlation coefficient

![](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 1.png)
![](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 2.png)
![](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 3.png)
[][*+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++*]
![Think, think, think](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 4.png)
[][*+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++*]
![](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 5.png)
![](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 6.png)
![](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 7.png)
![](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 8.png)
![](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 9.png)
![](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 10.png)
![](C:/Users/qp/Pictures/Screenshots/16. Derivation of key properties of the correlation coefficient - 11.png)
In this segment, we justify some of the property is that
the correlation coefficient that we
claimed a little earlier.
The most important properties of the correlation coefficient
lies between minus 1 and plus 1.
We will prove this property for the special case where we
have random variables with zero means and unit variances.
So standard deviations are also 1, so most of the terms
here disappear and the correlation coefficient is
simply the expected value of X times Y.
We will show that in this special case the expected
value of X times Y lies between minus 1 and 1.
But the proof of this fact remains valid with a little
bit of more algebra along similar lines
for the general case.
What we will do is we will consider this quantity here
and expand this quadratic and write it as
expected value of X squared.
Then there's a cross term, which is minus 2 rho, the
expected value of X times Y, plus rho squared, expected
value of Y squared.
Now since we assume that the random variables have 0 mean,
this is the same as the variance and we assume that
the variance is 1, so this term here is equal to 1.
Now, the expected value of X times Y is the same as the
correlation coefficient in this case.
So we have minus 2 rho squared and from
here we have rho squared.
And by the previous argument, again this quantity, according
to our assumptions, is equal to 1 so we're left with this
expression, which is 1 minus rho squared.
Now, notice that this is the expectation of a non-negative
random variable so this quantity here must be
non-negative.
Therefore, 1 minus rho squared is non-negative, which means
that rho squared is less than or equal to 1.
And that's the same as requiring that rho lie between
minus 1 and plus 1.
And so we have established this important property, at
least for the special case of 0 means and unit variances.
But as I mentioned, it remains valid more generally.
Now let us look at an extreme case, when the absolute value
of rho is equal to 1.
What happens in this case?
In that case, this term is 0 and this implies that the
expected value of the square of this random variable is
equal to 0.
Now here we have a non-negative random variable,
and its expected value is 0, which means that when we
calculate the expected value of this there will be no
positive contributions and so the only contributions must be
equal to 0.
This means that X minus rho Y has to be equal to 0 with
probability 1.
So X is going to be equal to rho times Y and this will
happen with essential certainty.
Now also because the absolute value overall is equal to 1,
this means that we have either X equal to Y or X equals to
minus Y, in case rho is equal to minus 1.
So we see that if the correlation coefficient has an
absolute value of 1, then X and Y are related to each
other according to a simple linear relation, and it's an
extreme form of dependence between
the two random variables.


# 17. Interpreting the correlation coefficient

![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 1.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 2.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 3.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 4.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 5.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 6.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 7.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 8.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 9.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 10.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 11.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 12.png)
![](C:/Users/qp/Pictures/Screenshots/17. Interpreting the correlation coefficient - 13.png)
The mathematics of the correlation
coefficient are important.
But it is perhaps more important to be able to
interpret it correctly.
A correlation coefficient of let's say 0.5, tells us that
something interesting is going on as far as the relation of X
and Y is concerned.
But what exactly?
It tells us that the two random variables are
associated in some sense.
But this is often misinterpreted to mean that
there is a causal relation between the two.
But this is wrong.
A large correlation coefficient in general does
not indicate that there is a causal relation between the
random variables.
As an example, suppose that X somehow quantifies the
mathematical aptitude of a person.
And Y somehow quantifies the musical ability of a person.
In general, it has been found that mathematical aptitude and
musical ability are correlated.
People who score high on one will score high on
the other as well.
Is there a causal relation?
If you study math a lot and you become very good at math,
does it mean that you would become a better musician?
Not necessarily.
Or if you practice the violin day in and day out, does it
mean that you will score better in the math exam?
Again, not necessarily.
Perhaps what is going on is that there's a certain feature
of the human brain and when that feature is well
developed, then that feature helps both in math and in
musical ability.
And this is a typical situation of how a correlation
coefficient may arise.
That is often a correlation coefficient that's
significant, reflects that there is an underlying common
but perhaps hidden factor that affects both of the random
variables X and Y.
Let's us go through a simple numerical example that models
a situation of this kind.
Suppose that Z, V, and w are independent random variables.
And that we have two more random variables defined by
these relations.
Not that there's no direct influence from X to Y or from
Y to X.
But on the other hand, there's a common underlying factor,
this random variable Z that affects both X and Y. Because
of this, we expect that X and Y will somehow have some kind
of relation or association between them.
And we would like to measure the strength of that
association.
The way to measure it will be in terms of the correlation
coefficient, which we will now proceed to compute.
To have a complete example in our hands and in order to also
keep things simple, let's us assume that the basic
underlying random variable Z, V, and W all have 0 means and
unit variances.
And now let us take the definition of the correlation
coefficient and start calculating.
Let us look at the variance of X. Because X is the sum of two
independent random variables, its variance is going to be
the sum of those variances.
And we have assumed that each one of those variances is
equal to 1.
So the variance of X is equal to 2.
And that implies that the standard deviation of X is
equal to the square root of 2.
By a similar argument, the standard deviation of Y is
also equal to the square root of 2.
Now, let us look at the covariance between X and Y.
Because X and Y have 0 means, the covariance is just the
expected value of the product of the two random variables.
And using the definition of what these two random
variables are, it's this particular product here.
We expand the product into a sum of four terms.
And take the expected value of each one of the four terms.
Which leaves us with this particular expression here.
Now, Z has 0 mean and unit variance.
Therefore, the expected value of Z squared is equal to 1.
How about the next term?
V and Z are independent.
So the expected value of the product is the product of the
expected values.
But the expected values are zero, so this term is zero.
And with a similar argument, the other
terms are zero as well.
So the co-variance is equal to 1.
And from this, we can conclude our calculation and write that
the correlation coefficient between X and Y is equal to 1
divided by the square root of 2 times square root of 2,
which is 1/2.
This example also serves to give you a rough idea of what
it may mean to have a correlation
coefficient of 1/2.
It means that the two random variables
have some common elements.
And they also have some idiosyncratic elements.
And these two elements are roughly equal in weight.
If V and W were completely absent, the correlation
coefficient would have been 1.
If on the other hand V and W had a huge variance, so as to
completely hide the effect of Z, then the value of the
correlation coefficient would have been much, much smaller
perhaps closer to 0.
And in the extreme case of course where Z is completely
absent, then X and Y are independent, and we get a
correlation coefficient of 0.


# 18. Exercise: Correlation properties

![](C:/Users/qp/Pictures/Screenshots/18. Exercise Correlation properties - 1.png)
![](C:/Users/qp/Pictures/Screenshots/18. Exercise Correlation properties - 2.png)
![](C:/Users/qp/Pictures/Screenshots/18. Exercise Correlation properties - 3.png)
![](C:/Users/qp/Pictures/Screenshots/18. Exercise Correlation properties - 4.png)
![](C:/Users/qp/Pictures/Screenshots/18. Exercise Correlation properties - 5.png)
![](C:/Users/qp/Pictures/20221031_214641.jpg)
![](C:/Users/qp/Pictures/20221031_214650.jpg)
![](C:/Users/qp/Pictures/20221031_214657.jpg)
![](C:/Users/qp/Pictures/20221031_214708.jpg)
![](C:/Users/qp/Pictures/20221031_214718.jpg)


# 19. Correlations matter

![](C:/Users/qp/Pictures/Screenshots/19. Correlations matter - 1.png)
![](C:/Users/qp/Pictures/Screenshots/19. Correlations matter - 2.png)
![](C:/Users/qp/Pictures/Screenshots/19. Correlations matter - 3.png)
![](C:/Users/qp/Pictures/Screenshots/19. Correlations matter - 4.png)
![](C:/Users/qp/Pictures/Screenshots/19. Correlations matter - 5.png)
![](C:/Users/qp/Pictures/Screenshots/19. Correlations matter - 6.png)
![](C:/Users/qp/Pictures/Screenshots/19. Correlations matter - 7.png)
![](C:/Users/qp/Pictures/Screenshots/19. Correlations matter - 8.png)
![](C:/Users/qp/Pictures/Screenshots/19. Correlations matter - 9.png)
![](C:/Users/qp/Pictures/20221031_214344.jpg)
![](C:/Users/qp/Pictures/20221031_214417.jpg)
In this segment, we make a connection between the
correlation coefficient and some fairly realistic real
world situations.
The bottom line will be that the presence or absence of
correlations can make a huge difference.
Suppose that you run an investment company that
invests in real estate, and you have 100 million of
capital that you want to invest.
Now you have learned or believe that it helps to
diversify, to not put all of your eggs in the same basket.
And for that reason, you're going to invest some of your
money into different states.
You will be investing in 10 different states, and in each
state, you will invest 10 million so that your total
investment is spread between those 10 states.
For each state, you have a model that tells you that the
return on your investment, that is your profit--
It's, of course, random, but you expect it to be 1 million
on the average, that is, in terms of the expected value,
but there's also a fair amount of randomness, and so the
standard deviation is 1.3.
Now, if you look at one state in isolation, it would be a
pretty risky investment because the standard deviation
is comparable to the mean.
It's not an unlikely event to have a return that's one
standard deviation below the mean.
And if that happens, your return is going to be
negative, and you're losing money.
But then you argue that you're investing in
10 different states.
Yes, you might lose money in some of them, but overall, you
would expect to have a pretty high confidence that you will
end up having a positive return.
Is this correct or not?
Let us do some calculations.
We will look at the variance of your total return.
The variance of the sum of random variables is given by
the formula that we have developed.
It's the sum of the variances.
But then you also have a bunch of covariance terms that have
to do with the relation of the different random variables.
Now, you make the assumption that the different states are
different markets--
one doesn't affect the other--
so that the Xi's are uncorrelated.
In that case, in this variance formula, the covariance terms
are all 0, and they disappear and you're left with the sum
of 10 variance terms.
Now, each one of these variances is equal to the
square of the standard deviation.
And we have a variance of 16.9.
You then take the square root to find the standard deviation
and the square root of this number is 4.1.
Now, your expected return is equal to 10, which is 2 and
1/2 standard deviations.
You will only lose money if the outcome is 2 and 1/2
standard deviations below the mean.
And that's a fairly unlikely outcome, and so in this
situation you feel very confident that you will have a
positive profit.
Suppose, however, that your assumption is wrong, and that
actually the different Xi's are
correlated with each other.
And suppose that the correlation is
pretty high, 0.9.
Essentially, this means that the real estate market in one
state is strongly related to the behavior of the market in
another state.
And that could be, perhaps, because the markets in
different states are affected by some more global phenomenon
that operates on a national level.
So in this case, the covariance of Xi with Xj is
going to be the correlation coefficient times the standard
deviation of Xi times the standard deviation of Xj,
which is 0.9 times 1.3 times 1.3.
And so the co-variance turns out to be 1.52.
And in that case, the variance of the sum, using this formula
here, is going to be equal to 10 times the variance that you
have in each state, which is 1.3 squared, plus you have a
bunch of terms here.
How many terms?
There's 90 of them, and each one of these
terms is equal to 1.52.
And the variance turns out to be 154.
Now you take the square root of that, and you find a
standard deviation of 12.4.
Now, your expected profit is 10, but the standard
deviation is 12.4.
And if you happen to be one standard deviation below the
expectation, which is something that has a sizable
probability of occurring, then your profit
is going to be negative.
So in the uncorrelated case, you're pretty certain that you
will have a positive profit, but if the correlations
actually turn out to be significant, then you're
facing a very risky situation.
To some extent, this is similar to what happened
during the great financial crisis.
That is, many investment companies thought that they
were secure by diversifying and by investing in different
housing markets in different states, but then when the
economy moved as a whole, it turned out that there were
high correlations between the different states, and so the
unthinkable, that is large losses, actually did occur.


## Course  /  Unit 6: Further topics on random variables  /  Lec. 13: Conditional expectation and variance revisited; Sum of a random number of independent r.v.'s

# 1. Lecture 13 overview and slides

![](C:/Users/qp/Pictures/Screenshots/1. Lecture 13 overview and slides - 1.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 13 overview and slides - 2.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 13 overview and slides - 3.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 13 overview and slides - 4.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 13 overview and slides - 5.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 13 overview and slides - 6.png)
![](C:/Users/qp/Pictures/Screenshots/1. Lecture 13 overview and slides - 7.png)
By this point in this class, you must have realized that a
lot of revolves around the concept of conditioning.
Conditional expectations play a central role.
For this reason, it is useful to revisit this concept and
view it in a more abstract manner.
The basic idea is that the value of a conditional
expectation is affected by a random quantity by the value
of the random variable Y on which we are conditioning.
It is a function of Y and, therefore, a random variable.
Based on this observation, we will redefine the conditional
expectation as a random variable and then try to
understand its properties.
In particular, we will develop a formula for the expected
value of the conditional expectation.
This will be what as known as the law of iterated
expectations.
After doing all this, we will follow a similar program for
the conditional variance.
Once more, we will see that it can be
viewed as a random variable.
And then we will relate its expected value with the
unconditional variance.
This will be the so-called law of total variance.
As an illustration of the tools we are introducing in
this lecture, we will consider various examples that will
hopefully clarify the concepts involved.
Our final and most important example will involve the sum
of a random number of independent random variables.
The setting here is more challenging than the case
where we add a fixed number of random variables.
But by using conditioning, we will be able to derive
formulas for the mean and the variance.


Printable transcript available here.
https://courses.edx.org/assets/courseware/v1/16d55e3b0b2a85b6cfc7f606df214a73/asset-v1:MITx+6.431x+2T2022+type@asset+block/transcripts_L13-Overview.pdf

Lecture slides: [clean] [annotated]
https://courses.edx.org/assets/courseware/v1/b06ba69158bbe4e438087f16e240a8d1/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L13cleanslides.pdf
https://courses.edx.org/assets/courseware/v1/5e5e30f21d0b68d7fea3d7c8f3ec7031/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L13annotatedslides.pdf

More information is given in Section 4.3 and Section 4.5 of the text.
https://courses.edx.org/courses/course-v1:MITx+6.431x+2T2022/pdfbook/0/chapter/1/33
https://courses.edx.org/courses/course-v1:MITx+6.431x+2T2022/pdfbook/0/chapter/1/36


# 2. Conditional expectation as a r.v.

![](C:/Users/qp/Pictures/Screenshots/7. Conditional PMFs and expectations given an event - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional expectation as a r.v. - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional expectation as a r.v. - 2.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional expectation as a r.v. - 3.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional expectation as a r.v. - 4.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional expectation as a r.v. - 5.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional expectation as a r.v. - 6.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional expectation as a r.v. - 7.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional expectation as a r.v. - 8.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional expectation as a r.v. - 9.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional expectation as a r.v. - 10.png)
In this segment we revisit the concept of conditional
expectation and view it as an abstract object
of a special kind.
To get going, let us start with something simple, the
concept of a function.
Let's say a function h that maps real
numbers to real numbers.
As a concrete instance, consider the quadratic
function that maps a number x to its square.
Consider now a random variable, capital X. What do
we mean when we write h of X?
For h defined--
for example in this particular way as a quadratic function--
h of X is defined to be a random variable.
Which random variable?
It is the random variable that takes the value little x
squared whenever capital X, the random variable, happens
to take the value little x.
And this is the random variable that we usually
denote as the random variable X squared.
Now let this come to conditional expectations.
The conditional expectation of a discrete random variable is
defined by this formula.
It is like the ordinary expectation except that we now
live in a conditional universe in which the random variable
capital Y is known to have taken a value little y.
And therefore, instead of using the ordinary formula for
expectations that involve the PMF of X, we now use that
formula but with the conditional PMF of X, which is
the appropriate PMF that applies to
this conditional universe.
And if it happens that the random variable capital X is
continuous, we would have an alternative formula but of the
same kind, where the summation is replaced by an integral and
the PMF is replaced by a PDF.
Now let us look at this quantity here.
We have fixed some particular little y.
Calculate this quantity.
And what we get is a number.
It is a number, but the value of that number depends on the
choice of little y.
If I give you a different little y then you will get
another number for this conditional expectation.
This means that this quantity here is really a
function of little y.
And let us give a name to this function.
Let us call this function g.
Now that we have defined g we can ask, what is this object?
It's a function of capital Y. It's a
function of a random variable.
So it should be a random variable by itself.
By analogy, with the earlier concrete example, it is the
random variable that takes the numerical value g of little y
whenever capital Y happens to take the value little y.
But g of little y has been defined to be the same as this
conditional expectation.
So it's the random variable whose value is this
conditional expectation, which is a particular number, if
capital y happens to take the value little y.
This particular random variable that we have defined
here, g of capital Y, we call it the abstract conditional
expectation of the random variable X, given the random
variable Y.
To summarize, this notation here stands
for a random variable.
It is the random variable whose numerical value turns
out to be this one if the value of the random variable
capital Y happens to be little y.
It is a function of capital Y. Once we know the value of
capital Y, then the value of the conditional expectation is
well defined.
It is known.
And it's equal to this particular number.
It is of course a random variable.
And as a random variable, it has all the attributes that
random variables have.
For example, it has a distribution, that
is, a PMF or a PDF.
It has a mean of its own.
And it has a variance of its own.
So what will be next in our agenda is to talk about these
attributes of this special random variable, and also to
use it in several examples.


# 3. Exercise: Conditional expectation

![](C:/Users/qp/Pictures/Screenshots/3. Exercise Conditional expectation - 1.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Conditional expectation - 2.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Conditional expectation - 3.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Conditional expectation - 4.png)
![](C:/Users/qp/Pictures/20221101_192742.jpg)


# 4. The law of iterated expectations

![](C:/Users/qp/Pictures/Screenshots/4. The law of iterated expectations - 1.png)
![](C:/Users/qp/Pictures/Screenshots/4. The law of iterated expectations - 2.png)
![](C:/Users/qp/Pictures/Screenshots/4. The law of iterated expectations - 3.png)
![](C:/Users/qp/Pictures/Screenshots/4. The law of iterated expectations - 4.png)
![](C:/Users/qp/Pictures/Screenshots/4. The law of iterated expectations - 5.png)
![](C:/Users/qp/Pictures/Screenshots/4. The law of iterated expectations - 6.png)
![](C:/Users/qp/Pictures/Screenshots/4. The law of iterated expectations - 7.png)
![](C:/Users/qp/Pictures/Screenshots/4. The law of iterated expectations - 8.png)
![](C:/Users/qp/Pictures/Screenshots/4. The law of iterated expectations - 9.png)
![](C:/Users/qp/Pictures/20221101_215052.jpg)
We have previously defined the abstract conditional
expectation of one random variable given
another random variable.
And we discussed that it is, by itself, a random variable.
In particular, it has an expectation, or
mean, of its own.
What is this mean?
This is what we want to find out.
Let us recall our development.
We look at the conditional expectation of a random
variable given a specific numerical value of another
random variable.
This is a number that depends on little y.
And this can be used to define a function little g.
The function little g for any particular little y tells us
the numerical value of the conditional expectation.
Since little g is a well defined function, we can also
now define this particular function, which is now a
function of a random variable.
It's a well defined object.
It's a random variable.
And then we introduced this abstract notation.
We defined this object to be exactly this
particular random variable.
So now we want to calculate the expected value of this
object, which is written this way.
Now this notation, here, may look quite formidable, but
let's see what is happening.
Inside here, we have a random variable.
And we take the expected value of that random variable.
Or, more crisply, think of that as the expected value of
g of capital Y, where g of capital Y is defined through
these correspondences here.
How do we calculate the expected value of a function
of a random variable?
Here we use the Expected Value Rule.
Assuming that Y is a discrete random variable, the Expected
Value Rule takes this form.
And the next step is to substitute the particular form
for g of Y that we have.
g of Y was defined in this manner.
So we're dealing with the sum over all little y's of the
expected value of X, given that Y takes the value little
y, weighted by the PMF of little y.
Now if we look at this expression, then it should
look familiar.
It is the expression that appears in the Total
Expectation Theorem.
We take the conditional expectation under different
scenarios and weigh those conditional expectations
according to the probabilities of those scenarios.
And this just gives us the overall expectation of the
random variable X.
So this step, here, was carried out using the Total
Expectation Theorem.
So we have proved this important fact, that the
expectation of a conditional expectation is the same as the
unconditional expectation.
This important fact is called the Law of Iterated
Expectations.
The proof was carried out assuming that Y is discrete.
So we use this particular version involving a PMF, but
the proof is exactly the same for the continuous case.
You would be using an integral and the PDF, instead the PMF.
As the proof indicates, the Law of Iterated Expectations
is nothing but an abstract version of the Total
Expectation Theorem.
It is really the Total Expectation Theorem written in
more abstract notation.
But this turns out to be powerful and also we avoid
having to deal separately with discrete or
continuous random variables.


# 5. Exercise: Iterated expectations

![](C:/Users/qp/Pictures/Screenshots/5. Exercise Iterated expectations - 1.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise Iterated expectations - 2.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise Iterated expectations - 3.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise Iterated expectations - 4.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise Iterated expectations - 5.png)
![](C:/Users/qp/Pictures/20221101_220728.jpg)


# 6. Stick-breaking revisited

![](C:/Users/qp/Pictures/Screenshots/6. Stick-breaking revisited - 1.png)
![](C:/Users/qp/Pictures/Screenshots/6. Stick-breaking revisited - 2.png)
![](C:/Users/qp/Pictures/Screenshots/6. Stick-breaking revisited - 3.png)
![](C:/Users/qp/Pictures/Screenshots/6. Stick-breaking revisited - 4.png)
![](C:/Users/qp/Pictures/Screenshots/6. Stick-breaking revisited - 5.png)
![](C:/Users/qp/Pictures/Screenshots/6. Stick-breaking revisited - 6.png)
![](C:/Users/qp/Pictures/Screenshots/6. Stick-breaking revisited - 7.png)
![](C:/Users/qp/Pictures/Screenshots/6. Stick-breaking revisited - 8.png)
![](C:/Users/qp/Pictures/Screenshots/6. Stick-breaking revisited - 9.png)
Here is a simple application of the law of iterated
expectations.
We revisit the stick-breaking example, which we have seen
sometime in the past.
So in this example, we start with a stick that has a
certain length and which we break at a point that's chosen
uniformly at random throughout the length of the stick.
And we call the point at which we cut the stick capital Y.
So the random variable Y has a uniform distribution on the
interval from 0 to l, and is described by
this particular PDF.
Then we take the piece of the stick that's left and we break
it at a point that's chosen uniformly over the length of
the stick that's left.
So the stick that was left has a length Y, and the place at
which we cut it, X, is chosen uniformly over that interval.
So in particular, X-- or rather the conditional
distribution of X given Y--
is uniform on that interval.
So in this example, what is the expected value of X if I
tell you the value of Y?
Well, given the value of Y, the random variable X is
uniform on that range.
So the expected value is going to be at the midpoint that is
equal to y over 2.
This is an equality between numbers.
For any particular number, little y,
we have this equality.
Now let us convert this concrete equality between
numbers to a more abstract equality
between random variables.
This object is a random variable that takes this value
whenever capital Y is little y.
So this is an object that takes the value little y over
2 whenever the random variable capital Y happens
to be little y.
But that's the same as the random variable
capital Y over 2.
This is a random variable that takes this value whenever
capital Y happens to be the same as little y.
So the conditional expectation--
the abstract conditional expectation is a random
variable because its value is determined by the random
variable capital Y, and it is this particular function of
the random variable capital Y.
And now we can proceed and calculate the expected value
of X using the law of iterated expectations.
The law of iterated expectations takes this form.
We have already calculated what this random variable is.
It is the random variable that's equal to Y over 2.
So this is the same as 1/2 the expected value of Y. And since
Y is uniform in the range from 0 to l, the expected value of
Y is equal to l over 2, which gives us an
answer of l over 4.
This is the same as the answer that we got in the past where
we actually found it using the total expectation theorem.
The calculations were exactly the same as what went on here
except that here we carry out the calculation in a more
abstract form.
And what is important to appreciate from this example
is the distinction between these two lines.
This is an equality between numbers, which is true for any
specific little y.
Whereas this is an equality between random variables.
This quantity is random and this quantity is also random,
meaning that their values are not known until the experiment
is carried out and the specific value of
capital Y is realized.


# 7. Exercise: Conditional expectation example

![](C:/Users/qp/Pictures/Screenshots/7. Exercise Conditional expectation example - 1.png)
![](C:/Users/qp/Pictures/Screenshots/7. Exercise Conditional expectation example - 2.png)
![](C:/Users/qp/Pictures/Screenshots/7. Exercise Conditional expectation example - 3.png)
![](C:/Users/qp/Pictures/20221102_211841.jpg)


# 8. Forecast revisions

![](C:/Users/qp/Pictures/Screenshots/8. Forecast revisions - 1.png)
![](C:/Users/qp/Pictures/Screenshots/8. Forecast revisions - 2.png)
![](C:/Users/qp/Pictures/Screenshots/8. Forecast revisions - 3.png)
![](C:/Users/qp/Pictures/Screenshots/8. Forecast revisions - 4.png)
This is an example, or rather, a story,
that's supposed to give us some insight
and some intuition about what the law
of iterated expectations really means.
Suppose that you work for a forecasting company.
And suppose that you make forecasts
by calculating expected values of the quantities
that you want to forecast.
And of course, when you calculate an expected value,
you always use whatever information you have.
So here we have the beginning of the year.
And you're working for a company that's
trying to forecast the sales during the month of February.
That's a random variable, capital
X. You're sitting in your office in the beginning of the year.
What is going to be your forecast?
It's going to be the expected value of the random variable X.
So this is a forecast that you make at this point in time.
Now, time goes by, and we're sitting now
in the beginning of February or the end of January.
At that time, you obtain some new information,
which is the value, little y, of a random variable, capital
Y. What should your new forecast be?
Well, once you have this information in your hands,
your new forecast should be the expected value
of x, given the specific available information
that you have.
So this is the revised forecast as
calculated at the end of January.
But if you're sitting here in the beginning of the year
and you ask yourself, what is the revised forecast going
to be, your answer would be, I don't
know what it's going to be.
It's random.
It depends on what capital Y would end up being.
My revised forecast is a random variable, the expected value
of X given Y, which will take this particular numerical value
if it turns out the random variable Y
takes a specific value, little y.
So this is the forecast calculated
at this point in time.
This is the forecast viewed at the beginning
of the year, at which time we do not know yet
the value of the revised forecast.
Now, what does the law of iterated expectations
tell us in this case?
It tells us that the expected value of the revised forecast
is the same as the original forecast.
What does this mean in practical terms?
It means that given today's forecast,
the original forecast, you do not
expect the next forecast, the revised
one, to be higher or lower.
It could be either higher or lower.
But on the average, you expect the revision
of the forecast going from this one to that one,
the revised one, that revision on the average,
to be equal to 0.
You do not expect forecasts to be revised either upwards
or downwards on the average.
Of course, this is not what happens always in real life.
So suppose that capital X, the quantity you're forecasting,
is the cost of some big project.
And your original budget or original forecast,
expected value of X, is what you expect
the cost of the project to be.
Well, from experience with real life,
we kind of know that budgets or cost estimates
tend to be revised upwards more often than downwards.
Does this real life fact contradict
the law of iterated expectations?
Well, not really.
What is going on here is that real life forecasts are not
really honestly calculated expected values.
But maybe they're calculated with some implicit or hidden
biases so that the forecasts that are given
are actually not the expected values.
So there's no contradiction between this mathematical fact
and possible life experiences.


# 9. The conditional variance

![](C:/Users/qp/Pictures/Screenshots/9. The conditional variance - 1.png)
![](C:/Users/qp/Pictures/Screenshots/9. The conditional variance - 2.png)
![](C:/Users/qp/Pictures/Screenshots/9. The conditional variance - 3.png)
![](C:/Users/qp/Pictures/Screenshots/9. The conditional variance - 4.png)
![](C:/Users/qp/Pictures/Screenshots/9. The conditional variance - 5.png)
![](C:/Users/qp/Pictures/Screenshots/9. The conditional variance - 6.png)
![](C:/Users/qp/Pictures/Screenshots/9. The conditional variance - 7.png)
![](C:/Users/qp/Pictures/20221103_212147.jpg)
![](C:/Users/qp/Pictures/20221103_212211.jpg)
We have defined the conditional expectation of a
random variable given another as an abstract object, which
is itself a random variable.
Let us now do something analogous with the notion of
[the]
conditional variance.
Let us start with the definition of the variance,
which is the following.
We look at the deviation of the random variable from its
mean, square it, then take the average of that quantity.
If we live in a conditional universe where we are told the
value of some other random variable, capital Y, then
inside that conditional universe the variance becomes
the following.
It is defined the same way.
Well, in the conditional universe, this is the expected
value of X.
So this quantity here is the deviation of X from its
expected value in that conditional universe.
We square this quantity, we find the squared deviation,
and we look at the expected value of
that squared deviation.
But because we live in a conditional universe, of
course, this expectation has to be a conditional one given
the information that we have available.
So this is nothing but the ordinary variance, but it's
the variance of the conditional distribution of
the random variable, capital X. This is an
equality between numbers.
If I tell you the value of little y, the conditional
variance is defined by this particular
quantity, which is a number.
Now, we proceed in the same way as we proceeded for the
case where we defined the conditional expectation as a
random variable.
Namely, we think of this quantity as a function of
little y, and that function can be now used to define a
random variable.
And that random variable, which would denote this way,
this is the random variable which takes this specific
value when capital Y happens to be equal to little y.
Once we know the value of capital Y, then this quantity
takes a specific value.
But before we know the value of capital Y, then this
quantity is not known.
It's random.
It's a random variable.
Let us look at an example to make this more concrete.
Suppose that Y is a random variable.
We draw that random variable.
And we're told that conditioned on the value of
that random variable, X is going to be uniform on this
particular interval from 0 to Y.
So if I tell you that capital Y takes on a specific
numerical value, then the random variable X is uniform
on the interval from 0 to little y.
A random variable that's uniform on an interval of
length little y has a variance that we know what it is.
It's y squared over 12.
So this is an equality between numbers.
For any specific value of little y, this is the
numerical value of the conditional variance.
Let us now change this equality between numbers into
an abstract equality between random variables.
The random variable, variance of X given Y, is a random
variable that takes this value whenever
capital Y is little y.
But that's the same as this random variable.
This is a random variable that takes this value whenever
capital Y happens to be equal to little y.
So we have defined the abstract concept of a
conditional variance, similar to the case of conditional
expectations.
For conditional expectations, we had the law of iterated
expectations.
That tells us that the expected value of the
conditional expectation is the unconditional expectation.
Is it true that the expected value of the conditional
variance is going to be the same as the
unconditional variance?
Unfortunately, no.
Things are a little more complicated.
The unconditional variance is equal to the expected value of
the conditional variance, but there is an extra term, that
is, the variance of the conditional expectation.
The entries here in red are all random variables.
So the conditional variance has been defined as a random
variable, so it has an expectation of its own.
The conditional expectation, as we have already discussed,
is also a random variable, so it has a variance of its own.
And by adding those terms, we get the total variance of the
random variable X.
So what we will do next will be first to prove this
equality, and then give a number of examples that are
going to give us some intuition about what these
terms mean and why this equality makes sense.


# 10. Exercise: Conditional variance II

![](C:/Users/qp/Pictures/Screenshots/10. Exercise Conditional variance II - 1.png)
![](C:/Users/qp/Pictures/Screenshots/10. Exercise Conditional variance II - 2.png)
![](C:/Users/qp/Pictures/Screenshots/10. Exercise Conditional variance II - 3.png)
![](C:/Users/qp/Pictures/Screenshots/10. Exercise Conditional variance II - 4.png)
![](C:/Users/qp/Pictures/20221103_224016.jpg)


# 11. Exercise: Conditional variance definition

![](C:/Users/qp/Pictures/Screenshots/11. Exercise Conditional variance definition - 1.png)
![](C:/Users/qp/Pictures/Screenshots/11. Exercise Conditional variance definition - 2.png)
![](C:/Users/qp/Pictures/Screenshots/11. Exercise Conditional variance definition - 3.png)
![](C:/Users/qp/Pictures/Screenshots/11. Exercise Conditional variance definition - 4.png)
![](C:/Users/qp/Pictures/20221104_184511.jpg)


# 12. Derivation of the law of total variance

![](C:/Users/qp/Pictures/Screenshots/12. Derivation of the law of total variance - 1.png)
![](C:/Users/qp/Pictures/Screenshots/12. Derivation of the law of total variance - 2.png)
![](C:/Users/qp/Pictures/Screenshots/12. Derivation of the law of total variance - 3.png)
![](C:/Users/qp/Pictures/Screenshots/12. Derivation of the law of total variance - 4.png)
![](C:/Users/qp/Pictures/Screenshots/12. Derivation of the law of total variance - 5.png)
![](C:/Users/qp/Pictures/Screenshots/12. Derivation of the law of total variance - 6.png)
![](C:/Users/qp/Pictures/Screenshots/12. Derivation of the law of total variance - 7.png)
![](C:/Users/qp/Pictures/20221104_201327.jpg)
We will now go through a derivation of the
law of total variance.
This particular derivation is not insightful.
It will not really give you any intuition as to why the
law of total variance is correct.
On the other hand, it involves some interesting manipulations
that will be useful to be able to follow, and understand the
kinds of objects that they're being moved around, and why
each step is valid.
Our derivation relies on the standard formula that we have
on how to calculate variances.
And our first step is to apply this formula to the
conditional variance.
Now, the conditional variance is like an ordinary variance,
except that it is calculated in a conditional universe.
So we apply this formula, except that the expectation of
X squared is the expectation calculated in
the conditional universe.
And similarly, for the next term it is the square of the
expected value of X. But it's the expected value of X as
calculated in the conditional universe.
So this is an equality between numbers.
What does it translate to?
This has been defined as a random variable that takes
this value when capital Y is equal to little y.
What is the random variable that takes this value when
capital Y is little y?
Well, this random variable here is a random variable that
takes this value when capital Y is equal to little y.
And this random variable here is a random variable that
takes this numerical value when capital Y is
equal to little y.
So to summarize, this is the random variable that takes
this numerical value when capital Y is
equal to little y.
And this is a random variable that takes this value when
capital Y is equal to little y.
This expression, the left hand side is equal to the right
hand side for all y's.
And therefore, this random variable and that random
variable always take the same numerical values no matter
what y happens to be.
So these are identical random variables.
And so we have this equality between random variables.
The next step as we're working towards calculating this first
term here in the law of total variance is to take the
expectation of this expression.
What is it?
We take the expectation of the first term.
It's the expectation of a conditional expectation.
And according to the law of iterated expectations, it is
the same as the unconditional expectation.
And then we have the expected value of the next term.
Next, we want to make some progress towards calculating
this second quantity in the law of total variance.
And the way to calculate it is to just apply this general
property of variances to the special case where X gets
replaced by the expected value of X given Y.
So the first term will be the expected value of our random
variable squared.
Our random variable is the expected value of X given Y.
And the second term involves the expected value of the
random variable whose variance we're considering.
So it's the expected value of this random variable.
So it's the expected value of the conditional expectation.
And everything gets squared.
What is this term?
By the law of iterated expectations, the expected
value of a conditional expectation is the same as the
unconditional expectation.
So this last term here is of this form.
What we will do next is to take this expression here and
that expression here, and add them together.
When we add them, we notice that this term and that term
are the same.
So they cancel out.
And we're left with the expected value of X squared
minus the square of the expected value.
But we know that this is the same as the variance of X. So
we have proved that the sum of these two terms, which are the
two terms up here, give us the variance of X.


# 13. A simple example

![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 1.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 2.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 3.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 4.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 5.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 6.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 7.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 8.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 9.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 10.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 11.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 12.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 13.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 14.png)
![](C:/Users/qp/Pictures/Screenshots/13. A simple example - 15.png)
We will now go through an example, which is essentially
a drill, to consolidate our understanding of the
conditional expectation and the conditional variance.
Consider a random variable X, which is continuous and is
described by a PDF of this form.
Whenever we have a PDF that seems to consist of different
pieces, it's always useful to divide and conquer.
And the way we will do that will be to consider two
different scenarios.
That X falls in this range.
And in that scenario, we say that the certain random
variable Y is equal to 1.
And another scenario in which X falls in this range, and in
that case, we say that Y is equal to 2.
Let us now look at the conditional expectation of X
given Y. What is it?
Well, it is a random variable which can take a different
values depending on what Y is.
If Y happens to take a value of 1, then
we are in this range.
And the conditional PDF of X, given that Y falls in this
range, keeps the same shape, it's uniform.
And so it's mean is going to be equal to the midpoint of
this interval, which is 1/2.
And this is something that happens when Y is equal to 1.
What is the probability of this happening?
The probability that Y is equal to 1 is the area under
the PDF in this range.
And since the height of the PDF is 1/2, this
probability is 1/2.
The alternative scenario is that Y happens to take the
value of 2.
In which case, X lives in this interval.
Given that X has fallen in this interval, the conditional
expectation of X is the midpoint of this interval.
And the midpoint of this interval is at 2.
And this is an event that, again, happens with
probability 1/2, because the area under the PDF in this
region is equal to 1/2.
So the conditional expectation is a random variable that
takes these values with these probabilities.
Since we now have a complete probabilistic description of
this random variable, we're able to calculate the
expectation of this random variable.
What is it?
With probability 1/2, the random variable takes the
value of 1/2.
And with probability 1/2, it takes a value of 2.
And so the expected value of the conditional
expectation is 5/4.
But the law of iterated expectations tells us that
this quantity is also the same as the expected value of X. So
we have managed to find the expected value of X by the
divide and conquer method, by considering different cases.
Let us now turn to the conditional variance of X
given Y. Once more, this quantity is a random variable.
The value of that quantity depends on what Y
turns out to be.
And we have, again, the same two possibilities.
Y could be equal to 1, or Y could be equal to 2.
And these possibilities happen with equal probabilities.
If Y is equal to 1, conditional on that event, X
has a uniform PDF on this range, on an interval of
length one.
And we know that the variance of a uniform PDF on an
interval of length one is 1/12.
If on the other hand, Y takes a value of 2, then X is a
uniform random variable on an interval of length 2.
And the variance in this case is 2 squared, where this 2
stands for the length of the interval, divided by 12, which
is the same as 4/12.
So we now have a complete probabilistic description of
the conditional variance as a random variable.
It's a random variable that with these probabilities,
takes these two particular values.
Since we know the distribution of this random variable, we
can certainly calculate its expected value.
And the expected value is found as follows.
With probability 1/2, the random variable of interest
takes a value of 1/12.
And with probability 1/2, this random variable
takes a value of 4/12.
And this number happens to be 5/24.
Finally, let us calculate the variance of the conditional
expectation.
Since we have complete information about the
distribution of the conditional expectation,
calculating its variance is not going to be difficult.
So what is it?
With probability 1/2, the conditional expectation takes
a value of 1/2.
We subtract from this is the mean of the conditional
expectation, which is 5/4.
And we take the square of that.
So this term is the square or the deviation of the value of
the random variable of 1/2 from the mean
of that random variable.
And we get a similar term.
If Y happens to be equal to 2.
With probability 1/2 half, our random variable takes a value
of 2, which is so much away from the mean
of the random variable.
And then we square that quantity.
If we carry out the algebra, the answer turns out
to be 9 over 16.
And now we can go back to the law of the total variance and
calculate that the total variance is equal to the
expected value of the variance, which is 5/24.
And then we have the variance of the expected
value, which is 9/16.
And this number evaluates to 37/48.
So we have managed to find the variance of this random
variable using the divide and conquer methods and the law of
the total variance.


# 14. Section means and variances

![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 1.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 2.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 3.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 4.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 5.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 6.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 7.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 8.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 9.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 10.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 11.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 12.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 13.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 14.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 15.png)
![](C:/Users/qp/Pictures/Screenshots/14. Section means and variances - 16.png)
![](C:/Users/qp/Pictures/20221104_220724.jpg)
We will now go through another example to consolidate our
intuition about the content of the law of iterated
expectations and the law of the total variance.
The example is as follows.
We have a class, and that class consists of 30 students
in total who are divided into sections--
the first and the second section.
Let xi be the score of students i, let's say the
final grade in the class.
We consider the following probabilistic experiment.
We pick a student at random, uniformly, so that each
student is equally likely to be picked.
And we define two random variables--
X is a numerical random variable that gives us the
score of the selected student.
So if student i is selected, the value of the random
variable capital X is xi.
And capital Y is defined as the random variable, which is
the section of the selected student, so that y takes
values 1 or 2.
We're given some information.
For the first section, the average of the
student scores is 90.
For the second section, the average of the
student scores is 60.
Given that information, what is the expected value of the
student score?
Well, each student is equally likely to be picked, so has
probability 1 over 30 to be picked.
And this multiplies the score of the student, so this is the
expected value of the random variable of interest.
What is this number?
Well, we need to calculate the sum of the xi's.
The sum of the first 10 xi's is equal to 90 times 10, and
the sum of the xi's in the other section is
equal to 60 times 20.
And we carry out the calculation, and we find that
the answer is 70.
Now let us look at conditional expectations.
If Y is equal to 1, this means that a student from section
one was picked.
And within that section, each student is equally likely to
be picked, so the outcome of this random variable is
equally likely to be any one of these xi's.
Each xi gets picked with probability of 1 over 10.
And so, the expected value of this random variable is 90.
Similarly for the second section, the expected value of
the score of a randomly selected student, given that
the student belongs in that section, is equal to 60.
With this information available, now we can describe
the abstract conditional expectation,
which is a random variable.
This random variable takes the value of 90 if a student from
the first section was picked, and the value of 60 if a
student from the second section was picked.
What is the probability of this event that the student
from the first section was picked?
Given that the first section has 10 out of a total of 30
students, this probability is 1/3, and therefore, this
probability is 2/3.
Now that we have the distribution of this random
variable, we can calculate the expected value of this random
variable, which is 1/3 times 90 plus 2/3 times 60.
And this number evaluates to 70, which of course, it's no
coincidence, it's the same as the average
over the entire class.
By the law of iterated expectations, we know that
this quantity should be the same as this quantity.
So the law of iterated expectations allows us to
calculate the overall average in the entire class by taking
the section averages, and weigh them according to the
sizes of the different sections.
It's a divide and conquer method, and it is similar to
what we have been doing when we use the total expectation
theorem to divide and conquer.
We continue with our example, and here is a summary of what
we found so far.
The conditional expectation is a random variable that takes
these two values with certain probabilities.
And the mean of this random variable is equal to 70.
Let us now calculate the variance
of this random variable.
This random variable, with probability 1/3, takes a value
90, which is this much away from the mean of this random
variable, which we square.
And with probability 2/3, it takes a value of 60, which is
this much away from the mean of the random variable.
We square this, as well.
And when we carry out the calculation, we find that this
number is equal to 200.
Let us now continue.
And suppose that somebody gave us this piece of information.
For the first section, this is the deviation of the i-th
student from the mean of that section.
So this is the sum of the squares of the deviations and
then we average over all the students.
We will use this data to calculate certain quantities--
for example, the variance of the scores
in the first section.
Now in the first section, with probability 1/10, we pick the
ith student that has this score.
And this is the deviation of that student from the mean of
that section.
So this is the same as the mean squared deviation from
the mean of the section.
And this is exactly the variance within that section.
It is the variance of the random variable, which is the
score of a random student, given that we are selecting a
student from the first section.
For the second section, the story similar.
We're given this information, and this tells us the variance
of the student scores within the second section.
So now we can describe the abstract conditional variance.
It is a random variable that takes this value with
probability equal to the probability of selecting
someone from this section, which is 1/3.
Or it takes a value of 20, which is the variance in the
second section.
And the second section is selected with probability 2/3.
With this information at hand, now we can calculate the
expected value of this random variable, which is 1/3 times
10 plus 2/3 times 20, which is 50/3.
At this point, we have the two quantities that are necessary
to apply the law of total variance.
According to the law of total variance, the variance of the
student scores throughout the entire class is equal to this
number, which is 50/3, plus this number, which is 200.
And this is the overall variance.
Now let us interpret the law of total
variance in this context.
The interpretation is as follows.
The variance of the student scores in the entire class
consists of two pieces.
The first piece looks at the variance inside each section,
which is 10 or 20, depending on which section
we're looking at.
And we take the average over the different sections.
So we look at the variability of the scores within a typical
section, and then we average over all the sections.
The other term looks at the means in the different
sections, and figures out how different are these means.
How much do they vary from the overall class average?
It measures the variability between different sections.
So the overall randomness in the test scores can be broken
down into two pieces of randomness.
One source of randomness is that the different sections
have different means.
The other source of randomness is that inside each section,
the students are different from the
means of their section.
And these two pieces of randomness together add up to
the total randomness of the student scores as measured by
the variance of the entire class.


# 15. Exercise Sections of a class

![](C:/Users/qp/Pictures/Screenshots/15. Exercise Sections of a class - 1.png)
![](C:/Users/qp/Pictures/Screenshots/15. Exercise Sections of a class - 2.png)
![](C:/Users/qp/Pictures/Screenshots/15. Exercise Sections of a class - 3.png)
![](C:/Users/qp/Pictures/20221104_222304.jpg)
![](C:/Users/qp/Pictures/20221104_222316.jpg)


# 16. Mean of the sum of a random number of random variables

![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 1.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 2.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 3.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 4.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 5.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 6.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 7.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 8.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 9.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 10.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 11.png)
![](C:/Users/qp/Pictures/Screenshots/16. Mean of the sum of a random number of random variables - 12.png)
We now study a model that involves the sum of
independent random variables, but with a twist.
It's going to be the sum of a random number of independent
random variables, as opposed to a fixed number.
This is a model that shows up in a variety of applications,
but it will also help us fine tune our command of the law of
iterated expectations, and the law of total variance.
The story goes as follows--
you go shopping and you visit a number of stores, except
that the number of stores that you will visit, is itself a
random variable.
At each one of the stores, you spend a
certain amount of money.
We denote it by Xi.
And we make the assumption that the Xi's are drawn from a
certain distribution.
They're identically distributed.
And they're independent of each other.
We also make the assumption that the Xi's are independent
of capital N. This means that no matter how many stores you
visit, the Xi, the amount of money you spend in each one of
the stores that you visit, is a random variable that's drawn
from a common distribution, which does not change, no
matter what capital N is.
With these assumptions in place, let us now focus on the
total amount of money that you're spending.
This is the sum of random variables, but with the extra
twist that the index goes up to capital N, which is itself
a random variable.
How do we deal with this situation?
One approach that's always worth trying when faced with a
complicated problem is to try to condition on some
information that will make the problem easier.
In this case, we can condition on the value of capital N
taking a fixed specific value because in that case, we will
be dealing with the sum of a finite number of random
variables where that number is a fixed, specific number.
And this is a situation we have encountered before and
know how to deal with it.
So let us get started.
Let us calculate the expected value of Y, if we condition on
the number of stores.
Let's say, for example, someone tells us that we
visited five stores.
Then, the expected value of Y is going to be the expected
value of the sum of the amount of money you spent in each one
of those five stores.
In our instance, it's that random variable, capital N.
But since I told you that capital N takes a specific
numerical value, this means that this instance of capital
N, in the index of the summation, can be
replaced by little n.
If I tell you that capital N is equal to little n, then
this number here, capital N, becomes the same as little n.
Here we use now the assumption that capital N is independent
from the Xi's.
Here we have the sum of a fixed
number of random variables.
All of them are independent of capital N.
If I give you some information on capital N, this does not
change the distribution of the Xi's, so the conditioning does
not affect the answer.
The conditional expectation is going to be the same as the
unconditional expectation.
And now we have the expected value of a
sum of random variables.
Each one of them has a common expectation that's denoted
with this notation.
This is the common expected value of all the Xi's, and
we're adding n of them, so we obtain n times this
expectation.
Now let us apply the total expectation theorem.
We take the familiar form of the total expectation theorem,
and in here, ' we can plug in the expression that we have
just found, which is n times expected value of X. Now the
expected value of X is just a number.
And then we have this summation here, which we
recognize to be just the definition of the expected
value of N.
And so we come to the conclusion that the expected
amount of money that you will be spending is equal to the
following product--
the expected number of stores that you visit times the
expected amount of money that you will be
spending in each store.
This is a quite plausible answer.
It makes sense.
On the average, the amount of money you spend is equal to
the average number of stores times the average amount of
money in each store.
So it is intuitively what you might expect.
On the other hand, we know that reasoning "on the
average" does not always give us the right answers.
So it's important to corroborate this particular
formula by working out a mathematical derivation.
Now let us carry out a second mathematical derivation using
the law of iterated expectations.
To use the law of iterated expectations, we need to put
our hands on this random variable--
the abstract conditional expectation.
What is this object?
It's a random variable that takes this value whenever
capital N is equal to little n.
So it's an object that takes this value whenever capital N
is equal to little n.
But that object is the same as this random variable because
this is the random variable that takes the value here when
capital N is equal to little n.
Therefore, the abstract conditional expectation takes
this particular form here, which we can substitute inside
this expectation here.
And now notice that the expected value of X is a
constant, so it can be pulled outside this expectation.
And we're left with a product of the expected value of N
times the expected value of X. So this completes the
derivation of the expected value of the sum of a random
number of random variables.


# 17. Variance of the sum of a random number of random variables

![](C:/Users/qp/Pictures/Screenshots/17. Variance of the sum of a random number of random variables - 1.png)
![](C:/Users/qp/Pictures/Screenshots/17. Variance of the sum of a random number of random variables - 2.png)
![](C:/Users/qp/Pictures/Screenshots/17. Variance of the sum of a random number of random variables - 3.png)
![](C:/Users/qp/Pictures/Screenshots/17. Variance of the sum of a random number of random variables - 4.png)
![](C:/Users/qp/Pictures/Screenshots/17. Variance of the sum of a random number of random variables - 5.png)
![](C:/Users/qp/Pictures/Screenshots/17. Variance of the sum of a random number of random variables - 6.png)
![](C:/Users/qp/Pictures/Screenshots/17. Variance of the sum of a random number of random variables - 7.png)
![](C:/Users/qp/Pictures/Screenshots/17. Variance of the sum of a random number of random variables - 8.png)
![](C:/Users/qp/Pictures/Screenshots/17. Variance of the sum of a random number of random variables - 9.png)
![](C:/Users/qp/Pictures/20221105_174024.jpg)
![](C:/Users/qp/Pictures/20221105_174036.jpg)
We now continue the study of the sum of a random number of
independent random variables.
We already figured out what is the expected value of this
sum, and we found a fairly simple answer.
When it comes to the variance, however, it's pretty hard to
guess what the answer will be, and it turns out that the
answer is not as simple.
So this is what we will try to calculate now.
The way to proceed will be to use the law of total variance,
which effectively breaks down the problem by conditioning on
the value of the random variable capital
N. So let us start.
We have already figured out that if I tell you the value
of capital N, then the expected value of the random
variable Y is just this number, capital N, the number
of stores you are visiting, times how much you are
spending in each one of the stores.
Using this information, we can now calculate this term, the
variance of the conditional expectation.
What is it?
It's the variance of capital N times the expected value of X.
Now, the expected value of X is a constant, and when we
multiply a random variable with a constant, what that
does to the variance is it multiplies the variance with
the square of that constant.
And this gives us this term in the law of total variance.
Let us now work towards the second term.
If I tell you the number of stores, then the random
variable Y is just a sum of a given
number of random variables.
And as we discussed before, the conditioning that we have
here may be eliminated because these random variables are now
independent of this random variable, capital N. Their
distribution does not change based on this information, and
so we obtain the unconditional variance.
Now, the unconditional variance of a sum of n random
variables is just n times the variance of each one of them,
which we denote with this notation.
Now, let us take this equality, which is an equality
between numbers, and it's true for any particular choice of
little n, and turn it into an equality
between random variables.
This is the random variable that takes this specific value
when capital N is equal to little n.
So this is a random variable that takes this specific value
when capital N is equal to little n, but this is also the
same as this random variable, n times the variance of X,
because this random variable takes this particular
numerical value when capital N is equal to little n.
Now that we have an expression for the conditional variance
as a random variable, we can take the next step and
calculate the expected value of the conditional variance.
The expected value of the conditional variance is simply
the expected value of this expression that we
calculated up here.
And now the variance of X is a constant and can be pulled
outside the expectation, which leaves us with
this expression here.
Now that we have calculated both terms that go into the
law of total variance, we can add these two terms.
We have one contribution from here, this is this term, and
another contribution from here, which is this term.
What this expression tells us is that the variance of the
total amount that you spend, which is a certain measure of
the amount of randomness in how much you are spending
overall, this amount of randomness
is due to two causes.
One cause is the randomness that there is in how much
money you spend in any given store, and that's captured by
the variance of X. It's the variance of the distribution
of the amount of money that you spend in a typical store.
But there is another source of randomness, and that source of
randomness comes from the fact that the number of stores
itself is random, and this gives us this contribution to
the variance of Y.
By taking into account these two sources of randomness, we
can figure out the overall variance of the random
variable Y. As you can see, this is a formula that would
be hard to guess by just reasoning intuitively.
And so it's a demonstration of the power of the law of the
total variance.


# 18. Exercise: Second generation offspring

![](C:/Users/qp/Pictures/Screenshots/18. Exercise Second generation offspring - 1.png)
![](C:/Users/qp/Pictures/Screenshots/18. Exercise Second generation offspring - 2.png)
![](C:/Users/qp/Pictures/20221105_203859.jpg)
![](C:/Users/qp/Pictures/20221105_203920.jpg)
![](C:/Users/qp/Pictures/20221105_203932.jpg)


## Course  /  Unit 6: Further topics on random variables  /  Solved problems

# 1. The PDF of the absolute value of X

![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the absolute value of X - 0.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the absolute value of X - 1.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the absolute value of X - 2.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the absolute value of X - 3.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the absolute value of X - 4.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the absolute value of X - 5.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the absolute value of X - 6.png)
![](C:/Users/qp/Pictures/20221105_215541.jpg)


# 2. Derived distribution example

![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 2.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 3.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 4.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 5.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 6.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 7.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 8.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 9.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 10.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 11.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 12.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 13.png)
![](C:/Users/qp/Pictures/Screenshots/2. Derived distribution example - 14.png)
![](C:/Users/qp/Pictures/20221106_122245.jpg)
![](C:/Users/qp/Pictures/20221106_122300.jpg)
Hi.
In this problem, we'll work through an example
of calculating a distribution for a random variable using
the method of derived distributions.
So in general, the process goes as follows.
We know the distribution for some random variable, X.
And what we want is the distribution
for another random variable Y, which is somehow
related to X through some function, g.
So Y is some g of X.
And the steps that we follow, we can actually just summarize
them using these four steps.
The first step is to write out the CDF of Y.
So Y is the thing that we want.
And what we'll do is, we'll write out the CDF first.
So remember the CDF is just capital F of Y, y,
is the probability that the random variable Y
is less than or equal to some value of little y.
The next thing we'll do is, we'll use this relationship
that we know between Y and X, and we'll
substitute in, instead of writing the random variable Y
in here, we'll write it in terms of X.
Instead of Y, we'll plug in X. And we'll
use this function g in order to do that.
So what we have now is that up to here,
we would have that the CDF of Y is now the probability
that the random variable X is less than
or equal to some value of little y.
Next, what we'll do is, we'll actually
rewrite this probability as a CDF of X. So the CDF of X,
remember, would be F X is the probability that X is less than
or equal to some little x.
And then once we have that, when we differentiate the CDF of X,
we get the PDF of X. And what we presume
is that we know this PDF already.
And from that, what we get is, when
we differentiate this thing, we get the PDF of Y.
So through this whole process, what
we get is, we'll get a relationship between the PDF
of Y and the PDF of X. So that is the process
for calculating out the PDF of Y using X. So
let's go into our specific example in this case.
What we're told is that X, the one that we know,
is a standard normal random variable,
meaning that it's mean 0 and variance 1.
And so we know the form of the PDF.
The PDF of X is this-- 1 over square root of 2 pi
e to the minus x squared over 2.
And then the next thing that we're
told is this relationship between X and Y.
So what we're told is, if X is negative, then Y is minus X.
If X is positive, then Y is the square root of X.
So this is a graphical representation
of the relationship between X and Y.
So we have everything that we need.
And now, let's just go through this process
and calculate what the PDF of Y is.
So the first thing we do is, we write out the [CDF] of Y.
So the [CDF] of Y is what we've written.
It's the probability that the random variable Y
is less than or equal to some little y.
Now, the next step that we do is,
we have to substitute in, instead of in terms of Y,
we want to substitute it in terms of X,
because we actually know stuff about X,
but we don't know anything about Y.
So what is the probability that the random variable Y
is less than or equal to some little y?
Well, let's go back to this relationship
and see if we can figure it out.
So let's pretend that here's our little y.
Well, if the random variable Y is
less than or equal to little y, it
has to be underneath this horizontal line.
And in order for it to be underneath this horizontal
line, that means that X has to be between this range.
And what is this range?
This range goes from minus y to y squared.
So why is that?
It's because in this portion, X and Y
related as Y is negative X.
And here it's Y is square root of X. So if X is y squared,
then Y would be y.
If X is negative y, then Y would be y.
So this is the range that we're looking for.
So if the random variable Y is less than equal to little y,
then this is the same as if the random variable
X is between negative y and y squared.
So let's plug that in.
This is the same as the probability
that X is between negative y and y squared.
So those are the first two steps.
Now, the third step is, we have to rewrite this
as the CDF of X. So right now, we
have it in terms of a probability of X
of some events related to X. Let's actually transform
that to be explicitly in terms of the CDF of X.
So how do we do that?
Well, this is just the probability
that X is within some range.
So we can turn that into the CDF by writing it
as the difference of two CDFs.
So this is the same as the probability
that X is less than or equal to y squared minus the probability
that X is less than or equal to negative y.
So in order to find the probability
that X is between this range, we take the probability
that it's less than y squared, which is everything here.
And then we subtract out the probability
that it's less than negative y.
So what we're left with is just within this range.
So these actually are now exactly CDFs of X.
So this is F of X evaluated at y squared.
And this is F of X evaluated at negative y.
So now we've completed step three.
And the last step that we need to do is differentiate.
So if we differentiate both sides
of this equation with respect to y,
we'll get that the left side, we get what we want,
which is the PDF of Y.
Now, we differentiate the right side.
We'll have to invoke the chain rule.
So the first thing that we do is, this is a CDF of X.
So when we differentiate, we'll get the PDF of X.
But then we also have to invoke the chain
rule for this argument inside.
So the derivative of y squared would give us an extra term 2y.
And then similarly, this would give us
the PDF of X, evaluated at negative y.
Plus, the chain rule will give us an extra term of negative 1.
So let's clean this up a little bit.
So it's 2y fX y squared plus fX minus y.
So now, we're almost done.
We differentiated.
We have the PDF of Y, which is what we're looking for.
And we've written it in terms of the PDF X.
And fortunately, we know what that is.
So once we plug that in, we're essentially done.
So what is the PDF?
Well, the PDF of X evaluated at y squared
is going to give us 1 over the square root of 2 pi
e to the minus-- so in this case, x is y squared.
So we get y to the fourth over 2.
And then we get another one over square root of 2 pi
e to the minus y squared over 2.
And now, we're almost done.
The last thing that we need to take care of
is, what is the range?
Remember it's important when you calculate out PDFs
to always think about the ranges where things are valid.
So when we think about this, what
is the range where this actually is valid?
Well, Y, remember, is related to X in this relationship.
So if we look at this, we see that Y can never be negative,
because no matter what X is, Y gets
transformed into some non-negative version.
So what we know is that this is now
actually valid only for y greater than 0.
And for y less than 0, the PDF is 0.
So this gives us the final PDF of Y.
So it seems like at first when you
start doing these derived distribution
problems that it's pretty difficult.
But if we just remember that there
are these pretty straightforward steps that we follow,
and as long as you go through these steps
and do them methodically, then you
can actually come up with the solution
for any of these problems.
And one last thing to remember is
to always think about what are the ranges
where these things are valid, because the relationship
between these two random variables
could be pretty complicated, and you need to always
be aware of when things are nonzero and when they're 0.
So I hope that was helpful.
And see you next time.


# 3. Ambulance travel time

![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 1.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 2.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 3.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 4.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 5.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 6.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 7.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 8.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 9.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 10.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 11.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 12.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 13.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 14.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 15.png)
![](C:/Users/qp/Pictures/Screenshots/3. Ambulance travel time - 16.png)
![](C:/Users/qp/Pictures/20221106_203638.jpg)
![](C:/Users/qp/Pictures/20221106_203727.jpg)
![use your brain, think](C:/Users/qp/Pictures/Screenshots/4. The sum of independent continuous r.v.'s - 15.png)
In this problem, we're looking at an ambulance that
is traveling back and forth in an interval of size l, say
from 0 to l.
At some point in time, there's an accident occurring,
let's say, at location X. We'll assume the accident occurs
at a random location so that X is uniformly
distributed between 0 and l.
Now, at this point in time, let's
say the ambulance turns out to be at location Y. Again,
we'll assume that Y is a uniform random variable between 0
and l, and also that X and Y are independently distributed.
The question we're interested in answering
is how long it would take for the ambulance
to respond, to travel from point Y to point X.
Let's call this time, T. And in particular, we
want to know everything about distribution of T.
For example, what is the CDF of T given by the probability of T
less than or equal to little t, or the PDF, which
is done by differentiating this CDF once we have it.
Now, to start, we'll express T as a function of X and Y.
Since we know that the ambulance travels
at a speed v, v meters or v units of distance per second,
then we can write that big T is simply
equal to Y minus X absolute value, the distance between X
and Y, divided by the speed at which the ambulance is
traveling at, v. Now, if we look at the probability of T
less than or equal to little t, this
is then equal to the probability that Y minus X,
[absolute value], divided by v less than or equal to little t.
We now take off the absolute value
by writing the expression as negative vt less than
or equal to Y minus X less than or equal to positive vt.
Here we multiply v on the other side, t,
and then took out the absolute value sign.
As a final step, we'll also move X
to the other side of the inequalities
by writing this as X minus vt less than
or equal to Y less than or equal to X plus vt.
To compute this quantity, we'll define set A
as the set of all points that satisfies this condition right
here.
In particular, it's a pair of all x and y such
that x minus vt less than or equal to little y less than
or equal to x plus vt, and also that x is within 0 and l,
and so is y.
So the set A will be the set of values
we'll be integrating over.
Now that we have A, we can express the probability
as the integral of all x and y, this pair, within the set A,
integrating the PDF f of XY, little x, little y.
Let's now evaluate this expression right here
in a graphical way.
On the right, we're plotting out what we just illustrated here,
where the shaded region is precisely
the set A, as we can see.
This is a set of values, x and y,
where y is sandwiched between two lines,
the upper one being x plus vt right here,
and the lower line being x minus vt right here.
So these are the values that correspond to the set A.
Now that we have A, let's look at f of XY.
We know that both X and Y are uniform random variables
between 0 and l.
And therefore, since they're independent,
the probability density of X and Y being at any point
between 0 and l is precisely 1 over l squared, where
l squared is the size of the square box right here.
So given this picture, all we need to do
is to multiply by 1 over l squared, the area of the region
A. And depending on the value of t,
we'll get different answers as right here.
If t is less than zero obviously the area of A
diminishes to nothing, so we get 0.
If t is greater than l/v, the area of A
fills up the entire square, and we get 1.
Now, if t is somewhere in between 0 and l/v,
we will have 1 over l squared multiplied by the area looking
like something like that, right here, the shaded region.
Now, if you wonder how we arrive at exactly this expression
right here, here's a simple way to calculate it.
What we want is 1 over l squared times the area,
A. Now, area A can be viewed as the entire square,
l squared, minus what is not in area A, which
is these two triangles right here.
Now, each triangle has area 1/2 l minus vt squared.
This multiply by 2.
And this, after some algebra, will give the answer
right here.
At this point, we have obtained the probability of big T
less than or equal to little t.
Namely, we have gotten the CDF for T. And as a final step,
we can also compute the probability density function
for T. We'd call it little f of T.
And we do so by simply differentiating the CDF
in different regions of t.
To begin, we'll look at t between 0 and l/v
right here, at differentiating this expression
right here with respect to t.
And doing so will give us 2v/l minus 2v squared t
over l squared.
And this applies to t greater or equal to zero less than l/v.
Now, in any other region, either t less than 0
or t greater than l/v, we have a constant for the CDF.
And hence its derivative will be 0.
So this is for any other t.
We call it otherwise.
Now, this completely characterized the PDF of big T.
And hence we also finished the problem.


# 4. The difference of two independent exponential r.v.'s

![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 1.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 2.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 3.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 4.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 5.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 6.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 7.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 8.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 9.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 10.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 11.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 12.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 13.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 14.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 15.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 16.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 17.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 18.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 19.png)
![](C:/Users/qp/Pictures/Screenshots/4. The difference of two independent exponential r.v.'s - 20.png)
![](C:/Users/qp/Pictures/20221106_222839.jpg)
![](C:/Users/qp/Pictures/20221106_222857.jpg)
![](C:/Users/qp/Pictures/20221106_222911.jpg)
![](C:/Users/qp/Pictures/20221106_222927.jpg)
![](C:/Users/qp/Pictures/20221106_222938.jpg)
In this problem, Romeo and Juliet
are to meet up for a date, where Romeo arrives at time X
and Juliet at time Y, where X and Y are
independent exponential random variables,
with parameters lambda.
And we're interested in knowing the difference between the two
times of arrivals, we'll call it Z, written as X minus Y.
And we'll like to know what the distribution of Z is,
expressed by the probability density function, f of Z.
Now, we'll do so by using the so-called convolution formula
that we learn in the lecture.
Recall that if we have a random variable W that
is the sum of two independent random variables, X plus Y,
now, if that's the case, we can write the probability density
function, fW, of little w as the following integration--
negative infinity to infinity fX little x times f of Y, w
minus x, integrated over x.
And to use this expression to calculate f of Z,
we need to do a bit more work.
Notice W is expressed as a sum of two random variables,
whereas Z is expressed as the subtraction of Y from X.
But that's fairly easy to fix.
Now, we can write Z. Instead of a subtraction,
write it as addition of X plus negative Y.
So in the expression of the convolution formula,
we'll simply replace Y by negative Y,
as it will show on the next slide.
Using the convolution formula, we can write f of Z
little z as the integration of f of X little x
and f of negative Y, z minus x, dx.
Now, we will use the fact that f of negative Y,
evaluated z minus x, is simply equal to f
of Y evaluated at x minus z.
To see why this is true, let's consider, let's say,
a discrete random variable, Y. And now,
the probability that negative Y is equal to negative 1
is simply the same as probability
that Y is equal to 1.
And the same is true for probability density functions.
With this fact in mind, we can further
write equality as the integration f X times f
of Y, x minus z, dx.
We're now ready to compute.
We'll first look at the case where z is less than 0.
On the right, I'm writing out the distribution
of an exponential random variable with a parameter
lambda.
In this case, using the integration above,
we could write it as 0 to infinity, lambda
e to the negative lambda x times lambda e to the negative lambda
x minus z dx.
Now, the reason we chose a region
to integrate from 0 to positive infinity
is because anywhere else, as we can
verify from the expression of fX right here,
that the product of fX times fY here is 0.
Follow this through.
We'll pull out the constant.
Lambda e to the lambda z, the integral from 0 to infinity,
lambda e to the negative 2 lambda x dx.
This will give us lambda e to the lambda
z minus 1/2 e to the negative 2 lambda
x infinity minus this expression evaluated at 0.
And this will give us lamdba over 2 e to the lambda z.
So now, we have an expression for f
of Z evaluated at little z when little z is less than 0.
Now that have the distribution of f of Z
when z is less than 0, we'd like to know
what happens when z is greater or equal to 0.
In principle, we can go through the same procedure
of integration and calculate that value.
But it turns out, there's something much simpler.
Z is the difference between X and Y,
and negative Z is simply the difference between Y and X.
Now, X and Y are independent and identically distributed.
And therefore, X minus Y has the same distribution as Y minus X.
So that tells us Z and negative Z have the same distribution.
What that means is, is the distribution of Z
now must be symmetric around 0.
In other words, if we know that the shape of f of Z below 0
is something like that, then the shape of it above 0
must be symmetric.
So here's the origin.
For example, if we were to evaluate f of Z at 1,
well, this will be equal to the value of f of Z at negative 1.
So this will equal to f of Z at negative 1.
Well, with this information in mind,
we know that in general, f of Z little z
is equal to f of Z negative little z.
So what this allows us to do is to get all the information
for z less than 0 and generalize it to the case
where z is greater or equal to 0.
In particular, by the symmetry here, we
can write, for the case Z greater or equal to 0,
as lambda over 2 e to the negative lambda z.
So the negative sign comes from the fact
that the distribution f of Z is symmetric around 0.
And simply, we can go back to the expression
here to get the value.
And all in all, this implies that f of Z little z
is equal to lambda over 2 e to the negative lambda absolute
value of z.
This completes our problem.


# 5. The sum of discrete and continuous r.v.'s

![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 1.png)
![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 2.png)
![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 3.png)
![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 4.png)
![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 5.png)
![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 6.png)
![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 7.png)
![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 8.png)
![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 9.png)
![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 10.png)
![](C:/Users/qp/Pictures/Screenshots/5. The sum of discrete and continuous r.v.'s - 11.png)
![](C:/Users/qp/Pictures/20221107_210900.jpg)
![](C:/Users/qp/Pictures/20221107_210913.jpg)
In this video, we're going to do an example in which we derive
the probability density function of the sum of two
random variables.
The problem tells us the following.
We're given that X and Y are independent random variables.
X is a discrete random variable with PMF pX.
Y is continuous with PDF fY.
And we'd like to compute the PDF of Z which
is equal to X plus Y. We're going
to use the standard approach here-- compute the CDF of Z
and then take the derivative to get the PDF.
So in this case, the CDF, which is FZ, by definition
is the random variable Z being less than little z.
But Z is just X plus Y. So now, we'd actually like to,
instead of having to deal with two random variables, X and Y,
we'd like to deal with one at a time.
And the total probability theorem
allows us to do this by conditioning
on one of the two random variables.
Conditioning on Y here is a bit tricky,
because Y is continuous, and you have
to be careful with your definitions.
So conditioning on X seems like the way to go.
So let's do that.
This is just the probability that X equals little
x, which is exactly equal to the PMF of X evaluated at x.
Now we're given we're fixing X equal to little x.
So we can actually replace every instance of the random variable
with little x.
And now I'm going to just rearrange this
so that it looks a little nicer.
So I'm going to have Y on the left and say Y is less than z
minus x, where z minus x is just a constant.
Now, remember that X and Y are independent.
So telling us something about X shouldn't change our beliefs
about Y. So in this case, we can actually drop the conditioning.
And this is exactly the CDF of Y evaluated at z minus x.
So now we've simplified as far as we could.
So let's take the derivative and see where that takes us.
So the PDF of Z is, by definition,
the derivative of the CDF, which we just computed here.
This is sum over x FY z minus x pX.
What next?
Interchange the derivative and the summation.
And a note of caution here.
So if X took on a finite number of values,
you'd have a finite number of terms here.
And this would be completely valid.
You can just do this.
But if X took on, for example, a countably infinite number
of values-- a geometric random variable, for example--
this would actually require some formal justification.
But I'm not going to get into that.
So here, the derivative with respect
to z-- this is actually z-- is you use chain rule here.
pX doesn't matter, because it's not a function of z.
So we have fY evaluated at z minus x according to the chain
rule, and then the derivative of the inner quantity, z
minus x, which is just 1.
So we don't need to put anything there.
And we get pX of x.
So there we go.
We've derived the PDF of Z. Notice
that this looks quite similar to the convolution formula
when you assume that both X and Y are
either continuous or discrete.
And so that tells us that this looks right.
So in summary, we've basically computed the PDF of X plus Y
where X is discrete and Y is continuous.
And we've used the standard two-step approach--
compute the CDF and then take the derivative to get the PDF.


# 6. Using conditional expectation and variance

![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 1.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 2.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 3.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 4.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 5.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 6.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 7.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 8.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 9.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 10.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 11.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 12.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 13.png)
![](C:/Users/qp/Pictures/Screenshots/6. Using conditional expectation and variance - 14.png)
![](C:/Users/qp/Pictures/20221107_210943.jpg)
![](C:/Users/qp/Pictures/20221107_215748.jpg)
![](C:/Users/qp/Pictures/20221107_220837.jpg)
Hey guys.
Welcome back.
Today we're going to do a fun problem that
will test your knowledge of the law of total variance.
And in the process, we'll also get more practice
dealing with joint PDFs and computing
conditional expectations and conditional variances.
So in this problem, we are given a joint PDF for X and Y.
So we're told that X and Y can take on the following values
in the shape of this parallelogram, which
I've drawn.
And moreover, that X and Y are uniformly distributed.
So the joint PDF is just flat over this parallelogram.
And because the parallelogram has an area of 1,
the height of the PDF must also be 1 so
that the PDF integrates to 1.
OK.
And then we are asked to compute the variance of X plus Y.
So you can think of X plus Y as a new random variable
whose variance we want to compute.
And moreover, we're told we should compute this variance
by using something called the law of total variance.
So from lecture, you should remember
or you should recall that the law of total variance
can be written in these two ways.
And the reason why there's two different forms for this case
is because the formula always has
you conditioning on something.
Here we condition on X, here we condition on Y.
And for this problem, the logical choice
you have for what to condition on is X or Y.
So again, we have this option.
And my claim is that we should condition on X.
And the reason has to do with the geometry of this diagram.
So notice that if you freeze an x and then
you sort of vary x, the width of this parallelogram
stays constant.
However, if you condition on Y and look at the width this way,
you see that the width of the slices you get by conditioning
vary with y.
So to make our lives easier, we're going to condition on X.
And I'm going to erase this bottom one, because we're not
using it.
So this really can seem quite intimidating,
because we have nested variances and expectations going on,
but we'll just take it slowly step by step.
So first, I want to focus on this term--
the conditional expectation of X plus Y conditioned on X.
So coming back over to this picture,
if you fix an arbitrary x in the interval, 0 to 1,
we're restricting ourselves to this universe.
So Y can only vary between this point and this point.
Now, I've already written down here
that the formula for this line is given by y is equal to x.
And the formula for this line is given by y
is equal to x plus 1.
So in particular, when we condition on X,
we know that Y varies between x and x plus 1.
But we actually know more than that.
We know that in the unconditional universe,
X and Y were uniformly distributed.
So it follows that in the conditional universe,
Y should also be uniformly distributed,
because conditioning doesn't change
the relative frequency of outcomes.
So that reasoning means that we can draw the conditional PDF
of Y conditioned on X as this.
We said it varies between X and X plus 1.
And we also said that it's uniform,
which means that it must have a height of 1.
So this is [fY] given X, y given x.
Now, you might be concerned, because, well, we're
trying to compute the expectation of X plus Y
and this is the conditional PDF of Y,
not of the random variable, X plus Y.
But I claim that we're OK, this is still useful,
because if we're conditioning on X,
this X just acts as a constant.
It's not really going to change anything
except shift the expectation of Y by an amount of X.
So what I'm saying in math terms is that this is actually just X
plus the expectation of Y given X.
And now our conditional PDF comes into play.
Conditioned on X, this is the PDF of Y.
And because it's uniformly distributed
and because expectation acts like center of mass,
we know that the expectation should be the midpoint, right?
And so to compute this point, we simply
take the average of the endpoints, x plus 1
plus x over 2, which gives us 2x plus 1 over 2.
So plugging this back up here, we get 2X/2 plus 2X
plus 1 over 2, which is 4X plus 1 over 2, or 2X plus 1/2.
OK.
So now I want to look at the next term, the next inner term,
which is this guy.
So this computation is going to be very similar in nature,
actually.
So we already discussed that the joint-- sorry, not
the joint, the conditional PDF of Y given X is this guy.
So the variance of X plus Y conditioned on X,
we sort of have a similar phenomenon occurring.
X now in this conditional world just
acts like a constant that shifts the PDF
but doesn't change the width of the distribution at all.
So this is actually just equal to the variance of Y
given X, because constants don't affect the variance.
And now we can look at this conditional PDF
to figure out what this is.
So we're going to take a quick tangent over here,
and I'm just going to remind you guys that we
have a formula for computing the variance of a random variable
when it's uniformly distributed between two endpoints.
So say we have a random variable whose
PDF looks something like this.
Let's call it, let's say, W. This is [fW(w)].
We have a formula that says variance of W
is equal to b minus a squared over 12.
So we can apply that formula over here.
b is x plus 1, a is x.
So b minus a squared over 12 is just 1/12.
So we get 1/12.
So we're making good progress, because we
have this inner quantity and this inner quantity.
So now all we need to do is take the outer variance
and the outer expectation.
So writing this all down, we get variance of X plus Y
is equal to variance of this guy, 2X plus 1/2
plus the expectation of 1/12.
So this term is quite simple.
We know that the expectation of a constant or of a scalar
is simply that scalar.
So this evaluates to 1/12.
And this one is not bad either.
So similar to our discussion up here,
we know constants do not affect variance.
You know they shift your distribution,
they don't change the variance.
So we can ignore the 1/2.
This scaling factor of 2, however,
will change the variance.
But we know how to handle this already from previous lectures.
We know that you can just take out this scalar scaling
factor as long as we square it.
So this becomes 2 squared, or 4 times the variance
of X plus 1/12.
And now to compute the variance of X,
we're going to use that formula again,
and we're going to use this picture.
So here we have the joint PDF of X and Y,
but really we want now the PDF of X,
so we can figure out what the variance is.
So hopefully you remember a trick
we taught you called marginalization.
To get the PDF of X given a joint PDF,
you simply marginalize over the values of Y.
So if you freeze x is equal to 0,
you get the probability density line
over x by integrating over this interval, over y.
So if you integrate over this strip, you get 1.
If you move x over a little bit and you integrate over
this strip, you get 1.
This is the argument I was making earlier
that the width of this interval stays the same,
and hence, the variance stays the same.
So based on that argument, which was slightly hand wavy,
let's come over here and draw it.
We're claiming that the PDF of X, [fX] of x, looks like this.
It's just uniformly distributed between 0 and 1.
And if you buy that, then we're done, we're home free,
because we can apply this formula, b
minus a squared over 12, gives us the variance.
So b is 1, a is 0, which gives variance of X is equal to 1/12.
So coming back over here, we get 4 times
1/12 plus 1/12, which is 5/12.
And that is our answer.
So this problem was straightforward in the sense
that our task was very clear.
We had to compute this, and we had
to do so by using the law of total variance.
But we sort of reviewed a lot of concepts along the way.
We saw how, given a joint PDF, you
marginalize to get the PDF of X. We
saw how constants don't change variance.
We got a lot of practice finding conditional distributions
and computing conditional expectations and variances.
And we also saw this trick.
And it might seem like cheating to memorize formulas,
but there's a few important ones you should know.
And it will help you sort of become faster
at doing computations.
And that's important, especially if you guys take the exams.
So that's it.
See you next time.


# 7. The variance in the stick-breaking problem

![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 1.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 2.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 3.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 4.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 5.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 6.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 7.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 8.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 9.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 10.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 11.png)
![](C:/Users/qp/Pictures/Screenshots/7. The variance in the stick-breaking problem - 12.png)
Hi.
In this problem, we'll get a chance
to see the usefulness of conditioning
in helping us to calculate quantities that would otherwise
be difficult to calculate.
Specifically, we'll be using the law of iterated expectations
and the law of total variance.
Before we get started, let's just take a quick moment
to interpret what these two laws are saying.
Really, what it's saying is, in order
to calculate the expectation or the variance
of some random variable X, if that's difficult to do,
we'll instead attack this problem in stages.
So the first stage is, we'll condition
on some related random variable, Y.
And the hope is that by conditioning on this
and reducing it to this conditional universe,
the expectation of X will be easier to calculate.
Now, recall that this conditional expectation
is really a random variable, which
is a function of the random variable Y. So what we've done
is we first average out X given some Y. What remains
is some new random variable, which is a function of Y.
And now, what we have is randomness in Y, which we'll
then average out again to get the final expectation of X.
OK, so in this problem, we'll actually
see an example of how this plays out.
One more thing before we get started that's useful to recall
is if Y is a uniform random variable, distributed
between a and b, then the variance of Y
is b minus a squared over 12, and the expectation of Y
is just a midpoint, a plus b over 2.
All right, so let's get started on the problem.
So what we have is we have a stick of some fixed length, l,
and what we do is we break it uniformly at random.
So what we do is we choose a point uniformly at random
along this stick.
And we break it there, and then we
keep the left portion of that stick.
So let's call the length of this left portion
after the first break random variable Y.
So it's random because the point where we break it is random.
And then what we do is we repeat this process.
We'll take this left side of the stick that's left.
And we'll pick another point, uniformly at random,
along this left remaining side.
And we'll break it again, and keep
the left side of that break.
And we'll call that the length of the final remaining piece,
X, which again is random.
The problem is really asking us to calculate the expectation
and variance of X. So at first, it seems difficult to do,
because the expectation and variance of X
depends on where you break it the second time and also
where you break it the first time.
So let's see if conditioning can help us here.
So the first thing that we'll notice
is that, if we just consider Y, the length of the stick
after the first break, it's actually pretty easy
to calculate the expectation and variance of Y. Because Y, when
you think about it, is actually just
a simple uniform random variable, uniformly
distributed between 0 and l, the length of the stick.
And this is because we're told that we choose
the point of the break uniformly at random between 0 and l.
And so wherever we choose it, that's
going to be the length of the left side of the stick.
And so because of this, we know that the expectation of Y
is just l/2, and the variance of Y is l squared over 12.
But unfortunately, calculating the expectation and variance
of X is not quite as simple, because X isn't just uniformly
distributed between 0 and some fixed number.
Because it's actually uniformly distributed between 0 and Y,
wherever the first break was.
But where the first break is is random too.
And so we can't just say that X is a uniformly distributed
random variable.
So what do we do instead?
Well, we'll make the nice observation
that let's pretend that we actually know what Y is.
If we knew what Y was, then calculating
the expectation of xX would be simple, right?
So if we were given that Y is just some little y,
then X would in fact just be uniformly distributed
between 0 and little y.
And then if that's the case, then our calculation
is simple, because the expectation of X
would just be y/2, and the variance
would just be y squared over 12.
All right, so let's make that a little bit more formal.
What we're saying is that the expectation of X,
if we knew what Y was, would just be Y/2.
And the variance of X if we knew what Y was
would just be Y squared over 12.
All right, so notice what we've done.
We've taken the second stage and we've
said, let's pretend we know what happens in the first stage
where we break it.
And we know what Y, the first break, was.
Then the second stage becomes simple,
because the average of X is just going to be the midpoint.
Now what we do to calculate the actual expectation of X,
well, we'll invoke the law of iterated expectations.
So expectation of X is expectation
of the conditional expectation of X
given Y, which in this case is just expectation of Y/2.
And we know what the expectation of Y is.
It's l/2.
And so this is just l/4.
All right, and so notice what we've done.
We've taken this calculation and done it in stages.
So we assume we know where the first break is.
Given that, the average location of the second break
becomes simple.
It's just in the midpoint.
And then, we move up to the higher stage.
And that now we average out over where
the first break could have been.
And that gives us our final answer.
And notice that this actually makes sense,
if we just think about it intuitively,
because on average, the first break will be somewhere
in the middle.
And then that will leave us with half the stick left,
and we break it again.
On average, that will leave us with another half.
So on average, you get a quarter of the original stick
left, which makes sense.
All right, so that's the first part,
where we use the law of iterated expectations.
Now, let's go to part (b), where we're actually
asked to find the variance.
The variance is given by the law of total variance.
So let's do it in stages.
We'll first calculate the first term, the expectation
of the conditional variance.
Well, what is the expectation of the conditional variance?
We've already calculated out what
this conditional variance is.
The conditional variance is Y squared over 12.
So let's just plug that in.
It's expectation of Y squared over 12.
All right, now this looks like it
could be a little difficult to calculate.
But let's just first pull out the 1/12.
And then remember, one way to calculate
the expectation of the square of a random variable
is to use the variance.
So recall that the variance of any random variable
is just expectation of the square
minus the square of the expectation.
So if we want to calculate the expectation of the square,
we can just take the variance and add
the square of the expectation.
So this actually we can get pretty easily.
It's actually just the variance of Y
plus the square of the expectation of Y.
And we know what these two terms are.
The variance of Y is l squared over 12.
And the expectation of Y is l/2.
So when you square that, you get l squared over 4.
So l squared over 12 plus l squared
over 4 gives you l squared over 3.
And you get that the first term is l squared over 36.
All right, now let's calculate the second term.
Second term is the variance of the conditional expectation.
So the variance of expectation of X
given Y. Well, what is the expectation of X given Y?
We've already calculated that.
That's Y/2.
So what we really want is the variance of Y/2.
And remember, when you have a constant inside the variance,
you pull it out but you square it.
So what you get is 1/4 the variance of Y,
which we know that the variance of Y is l squared over 12.
So we get that this is l squared over 48.
OK, so we've calculated both terms
of this conditional variance.
So all we need to do to find the final answer is just
to add them.
So it's l squared over 36 plus l squared over 48.
And so, the final answer is 7 l squared over 144.
OK, and so this is the first, the expectation of X,
maybe you could have guessed intuitively.
But the variance of X is not something
that looks like something that you could have calculated off
the top of your head.
And so I guess the lesson from this example
is that it is often very helpful if you condition
on some things, because it allows you to calculate things
in stages and build up from the bottom.
But it's important to note that the choice of what
you condition on-- so the choice of Y--
is actually very important, because you could choose lots
of other Y's that wouldn't actually help you at all.
And so how to actually choose this Y
is something that you can learn based on just having
practiced with these kinds of problems.
So again, the overall lesson is, conditioning can often
help when you calculate these problems.
And so you should look to see if that
could be a possible solution.
So I hope that was helpful, and see you next time.


# 8. A coin with random bias

![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 1.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 2.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 3.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 4.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 5.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 6.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 7.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 8.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 9.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 10.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 11.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 12.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 13.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 14.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 15.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 16.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 17.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 18.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 19.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 20.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 21.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 22.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 23.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 24.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 25.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 26.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 27.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 28.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 29.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 30.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 31.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 32.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 33.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 34.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 35.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 36.png)
![](C:/Users/qp/Pictures/Screenshots/8. A coin with random bias - 37.png)
![](C:/Users/qp/Pictures/20221109_215514.jpg)
![](C:/Users/qp/Pictures/20221109_215550.jpg)
![](C:/Users/qp/Pictures/20221109_215609.jpg)
![](C:/Users/qp/Pictures/20221109_215622.jpg)
![](C:/Users/qp/Pictures/20221109_215641.jpg)
![](C:/Users/qp/Pictures/20221109_215658.jpg)
![](C:/Users/qp/Pictures/20221109_215714.jpg)
![](C:/Users/qp/Pictures/20221109_215731.jpg)
![](C:/Users/qp/Pictures/20221109_215745.jpg)
![](C:/Users/qp/Pictures/20221109_215756.jpg)
Hi.
In this problem, we're going to be dealing
with a variation of the usual coin flip problem.
But in this case, the bias itself of the coin
is going to be random.
So you could think of it as you don't even
know what the probability of heads for the coin is.
So as usual, we're still taking one coin
and we're flipping it n times.
But the difference here is that the bias
is a random variable Q. And we're
told that the expectation of this bias is some mu
and that the variance of the bias is some sigma squared,
which we're told is positive.
And what we're going to be asked is
to find a bunch of different expectations, covariances,
and variances.
And we'll see that this problem gives us
some good exercise in a few concepts--
law of iterated expectations, which, again, tells you
that when you take the expectation
of a conditional expectation, it's just
the expectation of the inner random variable.
The covariance of two random variables
is just the expectation of the product
minus the product of the expectations.
Law of total variance is the expectation of a variance,
of a conditional variance, plus the variance
of a conditional expectation.
And the last thing, of course we're
dealing with a bunch of Bernoulli random variables--
coin flips.
So as a reminder for Bernoulli random variable,
if you know what the bias is, some known quantity p,
then the expectation of a Bernoulli is just p.
And the variance of the Bernoulli is p times 1 minus p.
OK, so let's get started.
The problem tells us that we're going
to define some random variables.
So Xi is going to be a Bernoulli random variable
for the i-th coin flip.
And so Xi is going to be 1 if the i-th coin
flip was heads and 0 if it was tails.
And one very important thing that the problem states
is that conditional on Q, the random bias--
so if we know what the random bias is, then
all the coin flips are independent.
And that's going to be important for us
when we calculate all these different values.
So the first thing that we need to calculate
is the expectation of each of these individual Bernoulli
random variables, Xi.
So how do we go about calculating what this is?
Well, the problem gives us a hint.
It tells us to try using the law of iterated expectations.
But in order to use it, you need to figure out
what you need to condition on.
What is this Y?
What takes the place of Y?
And in this case, a good candidate
for what you condition on would be
the bias, the Q that we're unsure about.
And so let's try doing that and see what we get.
So we write out the law of iterated expectations with Q.
So now, hopefully we can simplify
what this inner conditional expectation is.
Well, what is it really?
It's saying, given what Q is, what
is the expectation of this Bernoulli random variable, Xi?
Well, we know that if we knew what the bias was,
then the expectation is just the bias itself.
But in this case, the bias is random.
But remember, a conditional expectation
is still a random variable.
And so in this case, this actually just
simplifies into Q. So whatever the bias is,
the expectation is just equal to the bias.
And so that's what that tells us.
And this part is easy because we're
given that, the expectation Q is mu.
And then, the problem also defines a random variable
X. X is the total number of heads within the n tosses.
Or, you can think of it as a sum of all these individual Xi
Bernoulli random variables.
And now, what can we do with this?
Well, we can remember that linearity of expectations
allows us to split up this sum.
An expectation of a sum we can split up
into a sum of expectations.
So this is actually just expectation
of X1 plus dot, dot, dot.
Plus all the way to expectation of Xn.
All right.
And now, remember that we're flipping the same coin.
We don't know what the bias is, but for all the n flips,
it's the same coin.
And so each of these expectations of Xi
should be the same, no matter what i is.
And each one of them is mu.
We already calculated that earlier.
And there's n of them, so the answer would be n times mu.
So let's move on to part (b).
Part (b) now asks us to find what
the covariance is between Xi and Xj.
And we have to be a little bit careful here
because there are two different scenarios.
One where i and j are different indices, different tosses,
and another where i and j are the same.
So we have to consider both of these cases separately.
Let's first do the case where [j] and i are different.
So i does not equal j.
In this case, we can just apply the formula
that we talked about in the beginning.
So this covariance is just equal to the expectation
of Xi times Xj minus the expectation of Xi times
expectation of Xj.
All right.
So we actually know what these two are.
Expectation of Xi is mu.
Expectation of Xj is also mu.
So this part is just mu squared.
But we need to figure out what this expectation of Xi times Xj
is.
Well, the expectation of Xi times Xj,
we can, again, use the law of iterated expectations.
So let's try conditioning on Q again.
And remember, we said that this second part is just mu squared.
Well, how can we simplify this inner conditional expectation?
Well, we can use the fact that the problem tells us
that conditioned on Q, the tosses are independent.
So that means that conditioned on Q, Xi and Xj
are independent.
And remember when random variables are independent,
the expectation of a product-- you
can simplify that to be the product of the expectations.
And because we're in the conditional world on Q,
you have to remember that it's going
to be a product of two conditional expectations.
So this will be expectation of Xi given Q times expectation
of Xj given Q minus mu squared still.
All right.
Now, what is this?
Well, the expectation of Xi given Q, we already
argued earlier here that that should just be Q.
And then, the same thing for Xj, that should also be Q.
So this is just expectation of Q squared minus mu squared.
All right.
Now, if we look at this, what is the expectation of Q squared
minus mu squared?
Well, remember mu is just-- we're
told that mu is the expectation of Q.
So what we have is the expectation of Q
squared minus the quantity expectation of Q squared.
And what is that exactly?
That is just the formula or the definition
of what the variance of Q should be.
So this is, in fact, exactly equal to the variance of Q,
which we're told is sigma squared.
All right.
So what we found is that for i not equal to j,
the covariance of Xi and Xj is exactly equal to sigma squared.
And remember, we're told that sigma squared is positive.
So what does that tell us?
That tells us that Xi and Xj, where i not equal to j,
these two random variables are correlated.
And so because they're correlated,
they can't be independent.
Remember, if two random variables are independent,
that means they're uncorrelated.
The converse isn't true.
But if we do know that two random variables are
correlated, that means that they can't be independent.
And now, let's finish this by considering the second case.
The second case is when i actually does equal j.
And in that case, well, the covariance of Xi and Xi
is just another way of writing the variance of Xi.
So covariance of Xi, Xi, it's just the variance of Xi.
And what is that?
That is just the expectation of Xi squared minus expectation
of Xi quantity squared.
And again, we know what the second term is.
The second term is expectation of Xi quantity squared.
Expectation of Xi, we know from part (a), is just mu.
So that's just the second term, which is mu squared.
But what is the expectation of Xi squared?
Well, we can think about this a little bit more.
And you can realize that Xi squared is actually
exactly the same thing as just Xi.
And this is just a special case because Xi
is a Bernoulli random variable.
Because Bernoulli is either 0 or 1.
And if it's 0 and you square it, it's still 0.
And if it's 1 and you square it, it's still 1.
So squaring it doesn't actually change anything.
It's exactly the same thing as the original random variable.
And so because this is a Bernoulli random variable,
this is exactly just the expectation of Xi.
And we said this part is just mu squared.
So this is just expectation of Xi, which we said was mu.
So the answer is just mu minus mu squared.
So this completes part (b).
And the answer that we wanted was that, in fact, Xi and Xj
are not independent.
All right.
So let's write down some facts that we'll want to remember.
One of them is that expectation of Xi is mu.
And we also want to remember what this covariance is.
The covariance of Xi and Xj is equal to sigma
squared when i does not equal j.
So we'll be using these facts again later.
And the variance of Xi is equal to mu minus mu squared.
All right.
So now, let's move on to the last part,
part (c), which asks us to calculate the variance of X
in two different ways.
So the first way we'll do it is using
the law of total variance.
So the law of total variance will tell us
that we can write the variance of X
as a sum of two different parts.
So the first is variance of X-- expectation
of the variance of X conditioned on something
plus the variance of the conditional expectation of X
conditioned on something.
And as you might have guessed, what
we're going to condition on is Q. OK.
Let's calculate what these two things are.
So let's do the two terms separately.
What is the expectation of the conditional variance of X
given Q?
This we can write out X. Because X, remember,
is just the sum of a bunch of these Bernoulli random
variables.
And now, what we'll do is we'll, again, use the important fact
that the X's, we're told, are conditionally
independent, conditioned on Q. And because they're
independent, remember the variance of a sum
is not the sum of the variance.
It's only the sum of the variance
if the terms in the sum are independent.
In this case, they are conditionally independent
given Q. So we can, in fact, split this up and write it
as the variance of X1 given Q plus-- all the way
to the variance of Xn given Q.
And in fact, all these are the same.
So we just have n copies of the variance of, say,
X1 given Q. Now, what is the variance of X1 given Q?
Well, X1 is just a Bernoulli random variable.
But the difference is that for X, we don't know what the bias
or what the p is.
Because it's some random bias Q. But just
like we said earlier in part (a),
when we talked about the expectation of A1 given Q,
this is actually just Q times 1 minus Q.
Because if you knew what the bias were,
it would be p times 1 minus p.
So the bias times 1 minus the bias.
But you don't know what it is.
But if you did, it would just be Q. So
what we do is we just plug in Q and you
get Q times 1 minus Q. All right.
And now, this is expectation of-- and we can pull out the n.
So it's n times the expectation of Q minus Q squared, which
is just n times expectation of Q. We
can use linearity of expectations again.
Expectation of Q is mu.
And expectation of Q squared is-- well,
we can do that on the side.
Expectation of Q squared is the variance of Q
plus expectation of Q quantity squared.
So that's just sigma squared plus mu squared.
And so this is just going to be then
minus sigma squared minus mu squared.
All right, so that's the first term.
Now, let's do the second term, the variance
of the conditional expectation of X given Q.
And again, what we can do is we can write X as the sum of all
these Xi's.
And now we can apply linearity of expectations.
So we would get n times one of these expectations.
And remember we said earlier, the expectation of X1 given Q
is just Q. So it's the variance of n times Q. And remember now,
n is just a-- it's not random.
It's just some number.
So when you pull it out of a variance, you square it.
So this is n squared times the variance of Q.
And the variance of Q we're given is sigma squared.
So this is n squared times sigma squared.
All right, so the final answer is just
a combination of these two terms, this one and this one.
So let's write it out.
The variance of X then is equal to-- we
could combine the terms a little bit.
So the first one, let's take the mu's and we'll
put them together.
So it's n mu minus mu squared.
And then we have n squared times sigma
squared from this term and a minus n
times sigma squared from this term.
So it would be n squared minus n times sigma squared.
Or, n times n minus 1 times sigma squared.
So that is the final answer that we get for the variance of X.
And now, let's try doing it another way.
So that's one way of doing it.
That's using the law of total expectations and conditioning
on Q. Another way of finding the variance of X
is to use the formula involving covariances.
We can use that because X is actually
a sum of multiple random variables X1 through Xn.
And the formula for this is you have
n variance terms plus all these other ones-- where i is not
equal to j, you have the covariance terms.
And really, you think of it as a double sum
of all pairs of Xi and Xj where if i and j happen just
to be the same, then it simplifies
to be just the variance.
Now, so we pull these n terms out
because they are different than these,
because they have a different value.
And now, fortunately, we've already
calculated what these values are in part (b).
So we can just plug them in.
All the variances are the same.
And there's n of them, so we get n times
the variance of each one.
The variance of each one we calculated already
was mu minus mu squared.
And then, we have all the terms where i is not equal j.
There are actually n squared minus n of them.
So because you can take any one of the n
to be the first-- to be i, any one of the n to be j.
So that gives you n squared pairs.
But then you have to subtract out
all the ones where i and j are the same.
And there are n of them, so that leaves you
with n squared minus n of these pairs
where i does not equal to j.
And the covariance for this case,
where i is not equal to j, we also calculated in part (b).
That's just sigma squared.
And now, if we compare these two,
we'll see that they are, fortunately, exactly the same.
So we've used two different methods
to calculate the variance.
One using this summation and one using
the law of total variance.
So what did we learn from this problem?
Well, we saw that first of all, in order
to find some expectations, it's very useful
to use law of iterated expectations.
But the trick is to figure out what you should condition on.
And that's kind of an art that you
learn through more practice.
But one good rule of thumb is when
you have kind of a hierarchy or layers of randomness where
one layer of randomness depends on the randomness of the layer
above.
So in this case, whether or not you
get heads or tails depends on-- that's random.
But that depends on the randomness on the level above,
which was the random bias of the coin itself.
So the rule of thumb is when you want
to calculate the expectations for the layer
where you're talking about heads or tails,
it's useful to condition on the layer
above, where that is, in this case, the random bias.
Because once you condition on the layer above,
that makes the next level much simpler.
Because you kind of assume that you know what
all the previous levels of randomness are.
And that helps you calculate what
the expectation for this current level is.
And the rest of the problem was just
kind of going through exercises of actually applying
these tools.
So I hope that was helpful, and see you next time.


# 9. Widgets and crates

![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 1.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 2.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 3.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 4.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 5.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 6.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 7.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 8.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 9.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 10.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 12.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 13.png)
![](C:/Users/qp/Pictures/Screenshots/9. Widgets and crates - 14.png)
Hi.
In this problem, we'll get more practice
using conditioning to help us calculate expectations
or variances.
We'll see that in this problem, which
deals with widgets and crates, it's
actually similar in flavor to an earlier problem
that we did, involving breaking a stick twice.
And you'll see that in this problem,
we'll again use the law of iterated expectations
and the law of total variance to help us calculate expectations
of variances.
And again, we'll be taking the approach
of attacking the problem by splitting into the stages
and building up from the bottom up.
So in this problem, what we have is
a crate, which contains some number of boxes.
And we don't know how many boxes are.
It's random.
And it's given by some discrete random variable, N.
And in each box, there are some number of widgets.
And again, this is also random.
And in each box, say for Box i, there
are Xi number of widgets in each one.
What we're really interested in in this problem is,
how many widgets are there total in this crate?
So in the crate, there are boxes, and in the boxes,
there are widgets.
How many widgets are there total within the crate?
And we'll call that a random variable, T. And the problem
gives us some information.
It tells us that the expectation of the number of widgets
in each box for all the boxes is the same.
It's 10.
And also, the expectation of the number of boxes is also 10.
And furthermore, the variance of X of the number of widgets
and the number of boxes, they're all 16.
And lastly, an important fact is that all the Xi's, so
all the widgets for each box, and the total number of boxes,
these random variables are all independent.
So to calculate T, T is just a sum of X1 through XN.
So X1 is the number of widgets in Box 1, X2
is the number of widgets in Box 2,
and all the way through Box N.
So what makes this difficult is that N is actually random.
We don't actually know how many boxes there are.
So we don't even know how many terms there are in the sum.
Well, let's take a slightly simpler problem.
Let's pretend that we actually know
there are exactly 12 boxes.
And in that case, the only thing that's random
now is how many widgets there are in each box.
And so let's call this sum a new random variable,
S, the sum of X1 through X12.
So this would tell us, this is the number
of widgets in 12 boxes.
All right.
And because each of these Xi's are independent,
and they have the same expectation, just
by linearity of expectations, we know that the expectation of X
is just 12 copies of the same expectation of Xi.
And similarly, because we also assume
that all the Xi's are independent, the variance of S,
we can just add the variances of each of these terms.
So again, there are 12 copies of the variance of Xi.
So we've done a simpler version of this problem, where we've
assumed we know what N is, that N is 12.
And we've seen that in this simpler case,
it's pretty simple to calculate what the expectation of the sum
is.
So let's try to use that knowledge
to help us calculate the actual problem, where N is actually
random.
So what we'll do is use the law of iterated expectations.
And so this is written in terms of X and Y,
but we can very easily just substitute in
for the random variables that we care about.
Where in this case, what we see is
that in order to build things up,
it would be helpful if we condition
on something that is useful.
And in this case, it's fairly clear
that it would be helpful if we condition
on N, the number of boxes.
So if we knew how many boxes there were,
then we can drop down to the level
of widgets within each box.
And then once we have that, we can build up and average over
the total number of boxes.
So what we should do is condition
on N, the number of boxes.
So what have we discovered through this simpler exercise
earlier?
Well, we've discovered that if we knew the number of boxes,
then the expectation of the total number of widgets
is just the number of boxes times the number of widgets
in each one, or the expectation of the number of widgets
in each one.
So we can use that information to help us here.
Because now, this is basically the same scenario,
except that the number of boxes is now random.
Instead of being 12, it could be anything.
But if we just condition on the number of boxes being
equal to N, then we know that there are exactly
N copies of this.
But notice that N here is still random.
And so what we get is that the expectation
is N times the expectation of the number of widgets
in each box, which we know is 10.
So it's expectation of 10 times N or 10 times the expectation
of N, which gives us 100.
Because there are, on expectation, 10 boxes.
So this, again, makes intuitive sense.
Because we know that on average, there are 10 boxes.
And on average, each box has 10 widgets inside.
And so on average, we expect that there will be 100 widgets.
And the key thing here is that we actually
relied on this independence.
So if the number of widgets in each box vary depending on--
or if the distribution of the number of widgets in each box
vary depending on how many boxes there were,
then we wouldn't be able to do it this simply.
OK, so that gives us the answer to the first part,
the expectation of the total number of widgets.
Now let's do the second part, which is the variance.
The variance, we'll again use this idea of conditioning
and splitting things up, and use the law of total variance.
So the variance of T is going to be
equal to the expectation of the conditional variance
plus the variance of the conditional expectation.
So what we have to do now is just
to calculate what all of these pieces are.
So let's start with this thing here, the conditional variance.
So what is the conditional variance?
Well, again, let's go back to our simpler case.
We know that if we knew what N is, then
the variance would just be N times the variance of each Xi.
So what does that tell us?
That tells us that, well, if we knew what N was, so conditioned
on N, the variance would just be N
times the variance of each Xi.
So we've just taken this analogy and generalized it
to the case where we don't actually know what N is.
We just condition on N, and we still have a random variable.
So then from that, we know that the expectation now,
to get this first term, take the expectation
of this conditional variance, it's just the expectation of N
and the variance of Xi, we're given that.
That's equal to 16.
So it's N times 16, which we know
is 160, because the expectation of N, we also know, is 10.
All right, let's do this second term now.
We need the variance of the conditional expectation of T
given N. Well, what is the conditional expectation of T
given N?
We've already kind of used that here.
And again, it's using the fact that if we knew what N was,
the expectation would just be N times the expectation
of the number of widgets in each box.
So it would be N times the expectation of each Xi.
Now, to get the second term, we just take the variance of this.
So the variance is the variance of N
times the expectation of each Xi.
And the expectation of each Xi is 10.
So it's N times 10.
And now remember, when you calculate variances,
if you have a constant term inside, when you pull it out,
you have to square it.
So you get 100 times the variance of N.
And we know that the variance of N is also 16.
So this gives us 1600.
All right.
So now we've calculated both terms here.
The first term is equal to 160.
The second term is equal to 1600.
So to get the final answer, all we have to do is add this up.
So we get that the final answer is equal to 1760.
And this is not as obvious as the expectation, where
you could have just kind of guessed
that it was equal to 100.
So again, this was just another example
of using conditioning and the laws
of total variance and iterated expectations
in order to help you solve a problem.
And in this case, you could kind of
see that there is a hierarchy, where you start with widgets.
Widgets are contained in boxes, and then crates
contain some number of boxes.
And so it's easy to just condition
and do it level by level.
So you condition on the number of boxes.
If you know what the number of boxes are,
then you can easily calculate how many widgets there are,
on average.
And then you average over the number of boxes
to get the final answer.
So I hope that was helpful.
And we'll see you next time.


# 10. Random number of coin flips

![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 1.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 2.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 3.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 4.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 5.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 6.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 7.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 8.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 9.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 10.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 11.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 12.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 13.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 14.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 15.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 16.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 17.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 18.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 19.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 20.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 21.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 22.png)
![](C:/Users/qp/Pictures/Screenshots/10. Random number of coin flips - 23.png)
![](C:/Users/qp/Pictures/20221112_131833.jpg)
![](C:/Users/qp/Pictures/20221112_132018.jpg)
![](C:/Users/qp/Pictures/20221112_132038.jpg)
![](C:/Users/qp/Pictures/20221112_132056.jpg)
![](C:/Users/qp/Pictures/20221112_132112.jpg)
![](C:/Users/qp/Pictures/20221112_132126.jpg)
![](C:/Users/qp/Pictures/20221112_132137.jpg)
PROFESSOR: Hey, everyone.
Welcome back.
Today, we're going to do another fun
problem that has to do with a random number of coin flips.
So the experiment we're going to run is as follows.
We're given a fair six-sided die, and we roll it.
And then we take a fair coin.
And we flip it the number of times indicated by the die.
That is to say, if I roll a 4 on my die,
then I flip the coin 4 times.
And then we're interested in some statistics regarding
the number of heads that show up in our sequence.
In particular, we want to compute
the expectation and the variance of the number of heads
that we see.
So the first step of this problem
is to translate the English to the math.
So we have to define some notation.
I went ahead and did that for us.
I defined N to be the outcome of the die roll.
Now, since we flip the coin, the number of times
shown by the die roll and is equivalently
the number of flips that we perform and of course,
is a random variable.
And I've written it's PMF up here.
So pN of n is just a discrete, uniform, random variable
between 1 and 6 because we're told that the die has six sides
and that it's fair.
Now, I also defined H to be the number of heads that we see.
So that's the quantity of interest.
And it turned out that Bernoulli random variables will be very
helpful to us in this problem.
So I defined x sub i as 1 if the i-th flip is
heads and 0 otherwise.
And what we're going to do now, is
we're going to use these x sub i's to come up
with an expression for H.
So if you want to count the number of heads,
one possible thing you could do is start with 0
and then look at the first coin flip.
If it's heads, you add 1 to 0, which I'm
going to call your running sum.
If the first flip is tails, you add 0.
And similarly after that, after every trial,
if you see a heads, you add 1 to your running sum.
If you see a tails, you add 0.
And in that way, we can precisely compute H.
So the mathematical statement of what I just said
is that H is equal to x1 plus x2 plus x3
all the way through x sub n.
OK.
So now, we are interested in computing e of H, right?
The expectation of H. So your knee-jerk reaction
might be to say, Oh, well, by linearity of expectation,
we know that this is expectation of x1,
et cetera, through the expectation of xn.
But in this case, you would actually be wrong.
Don't do that.
And the reason that this is not going to work for us
is because we are dealing with a random number
of random variables.
So each xi is a random variable.
And we have capital N of them.
But capital N is a random variable, right?
It denotes the outcome of our die roll.
So we actually cannot just take the sum of these expectations.
Instead, we're going to have to condition on N
and use iterated expectation.
So this is the mathematical statement of what I just said.
And the reason why this works is because conditioning on N
will take us to the case that we already
know how to deal with where we have a known
number of random variables.
And of course, iterated expectations
holds as you saw in the lecture.
I will briefly mention here that the formula
we're going to derive is derived in the book.
I know it's probably derived in lecture.
So if you want, you can just go to that formula immediately.
But I think the derivation of the formula
that we need is quick.
And it's helpful.
So I'm going to go through it quickly.
So plugging in, let's do it over here.
Plugging in our running sum for H, we get this expression--
x1 plus x2, et cetera, plus xN conditioned on N. And this,
of course--
let's see-- is N times expectation of x sub i.
So again, I'm going through this quickly
because it's in the book.
But this step holds because each of these xi's is
they have the same statistics.
They're all Bernoulli with parameter of 1/2
because our coin is fair.
And so I just replaced--
I used x sub i to say it doesn't really
matter which integer you pick for i because expectation of xi
is the same for all i.
So this now, x--
the expectation of x sub i, this is just a number, right?
It's just some constant.
So you can pull it out of the expectation.
So you get that expectation of x sub i times expectation of N.
So I gave away the answer to this a second ago.
But x sub i is just a Bernoulli random variable with parameter
of success of one half.
And we know already that the expectation
of such a random variable is just p or one half.
So this is 1/2 times the expectation of N.
And now N, we know as a discrete uniform random variable.
And there is a formula that I'm going to use, which hopefully,
some of you may remember.
If you have a discrete uniform random variable that takes
on values between a and b--
let's call-- let's say this--
let's see.
I'm out of letters.
Let's use W. If you call this random variable W,
then we have that the variance of W
is equal to b minus a times b minus a plus 2 divided by 12.
So that's the variance.
We don't actually need the variance,
but we will need this later.
And the expectation of W--
actually, let's just do it up here right ahead
for this problem.
Because we have a discrete uniform random variable,
the expectation is just the middle, right?
So you agree hopefully that the middle
is right at 3.5, which is also 7/2.
So this is times 7/2 which is equal to 7 over 4.
So we are done with part of part A.
I'm going to write this answer over here so I can erase.
OK, good.
And we're going to do something very similar
to compute the variance.
To compute the variance, we are going to also condition on end
so we get rid of this source of randomness.
And then we're going to use law of total variance, which
you've also seen in lecture.
And again, the formula for this variance
is derived in the book.
So I'm going to go through it quickly.
But make sure you understand this derivation
because it exercises a lot of stuff we've taught you.
So this just using law of total variance
is the variance of the expectation of H given N
plus the expectation of the variance of H given N.
And now, plugging in this running sum for H,
you get this.
It's a mouthful to write.
Bear with me.
x1 through xN given N. OK, so I didn't do anything fancy.
I just plugged this into here.
So this term is similar to what we saw in a previous problem
by linearity of expectation and due to the fact
that all of the x sub i's are distributed in the same way.
They have the same expectation.
This becomes N times expectation of x sub i.
And let's do this term over here.
This term, well, conditioned on N, this N is known.
So we essentially have a sum of--
so a finite known sum of independent random variables.
And we know that the variance of a sum
of independent random variables is the sum of the variances.
So this is the variance of x1 plus the variance of x2,
et cetera, plus the variance of xN.
And furthermore, again, because all of these xi's have
the same distribution, the variance is the same.
So we can actually write this as N times
the variance of x sub i where x sub
i is just corresponds to one of the trials.
It doesn't matter which one because they
all have the same variance in expectation.
So now, we're almost home free.
This is just some scalar.
So we can take it out of the variance,
but we have to square it.
So this becomes expectation of xi squared
times the variance of N. And then
this variance is also just a scalar
so we can take it outside.
So then we get variance of x sub i times expectation of N.
Now, we know that the expectation of x sub i
is just the probability of success, which is one half.
So we have 1/2 squared, or 1/4, times the variance of N. So
that's where this formula comes in handy.
b is equal to 6.
a is equal to 1.
So we get that the variance of N is equal to 5 times--
and then 5 plus 2 is 7 divided by 12.
So this is just a formula from the book
that you guys hopefully remember.
So we get 35 over 12 and then the variance of xi.
We know the variance of a Bernoulli random variable
is just p times 1 minus p.
So in our case, that's 1/2 times 1/2 which is 1/4.
So we get 1/4.
And then the expectation of N, we
remember from our previous computation is just 7 over 2.
So I will let you guys do this arithmetic on your own time.
But the answer comes out to be 77 over 48.
So I will go ahead and put our answer over here, 77 over 48,
so that I can erase.
OK, so I want you guys to start thinking about part B
while I erase.
Essentially, you do the same experiment
that we did in part A, except now we
use two dice instead of one.
OK, so in part B just to repeat, you now have 2 dice.
You roll them.
You look at the outcome.
If you have an outcome of 4 on one die and 6 on another die,
then you flip the coin 10 times.
So it's the same exact experiment.
We're interested in the number of heads.
We want the expectation and the variance.
But this step is now a little bit different.
OK, so let's again, let's approach this
by defining some notation first.
Now, I want to let N1 be the outcome of the first die.
And then you can let N2 be the outcome of the second die.
And we'll start with just that.
So one way you could approach this problem is say, OK,
if N1 is outcome of my first die,
and N2 is the outcome of my second die,
then the number of coin flips that I'm going to make
is N1 plus N2--
N1 plus N2.
This is the total coin flips.
So you could just repeat the same exact math
that we did in part A, except everywhere that you see an N,
you replace that N with N1 plus N2.
So that will get you to your answer,
but it will require slightly more work.
We're going to think about this problem slightly differently.
So the way we were thinking about it just now,
we roll two dice at the same time.
We add the results of the die rolls.
And then we flip the coin that number of times.
But another way you can think about this
is you roll one dice--
one die.
And then you flip the coin the number
of times shown by that die and count the number of heads.
And then you take the second die, and you roll it.
And then you flip the coin that many more times
and count the number of heads after that.
So you could define each one to be number of heads
in the first N1 coin flips.
And you could let each two be the number of heads in the--
let's see-- in the last N2 coin flips.
So hopefully, that terminology is not confusing you.
Essentially what I'm saying is N1 plus N2
means you'll have N1 flips followed
by N2 flips for a total of N1 plus N2 flips.
And then within the first N1 flips,
you can get some number of heads, which we're calling H1.
And in the last N2 flips, you can get some number
of heads, which is H2.
So the total number of heads that we get at the end,
I'm going to call it H star is equal to H1 plus H2.
And what part B is really asking us for
is the expectation of H star and the variance of H star.
But here's where something really beautiful happens.
H1 and H2 are independent.
And they are statistically the same.
So the reason why they're independent
is because, well, first of all, all of our coin flips
are independent.
And they're statistically the same
because the experiment is exactly the same.
And everything's independent.
So instead of imagining one person rolling 2 die
and then summing the outcomes and flipping a coin
that many times and counting heads,
you can imagine one person takes 1 die and goes into one room.
A second person takes a second die and goes into another room.
They run their experiments.
Then they report back to a third person the number of heads.
And that person adds them together to get H star.
And in that scenario, everything is very clearly independent,
right?
So the expectation of H star, you actually
don't need independence for this part
because linearity of expectation always holds.
But you get the expectation of H1 plus the expectation of H2.
And because these guys are statistically equivalent,
this is just 2 times the expectation of H.
And the expectation of H, we calculated it in part A.
So this is 2 times 7 over 4.
And now for the variance, here's where
the independence comes in.
I'm actually going to write this somewhere where
I don't have to bend over.
So the variance of H star is equal to the variance
of H1 plus the variance of H2 by independence.
And that's equal to 2 times the variance of H
because they're statistically the same.
And the variance of H, we compute it already.
So this is just 2 times 77 over 48.
So the succinct answer to part b is
that both the mean and the variance double from part a.
So hopefully you guys enjoy this problem.
We covered a bunch of things.
So we saw how to deal with having a random number
of random variables.
Usually, we have a fixed known number of random variables.
In this problem, the number of random variables
we were adding together was itself random.
So to handle that, we conditioned on N.
And to compute expectation, we use iterated expectation.
To compute variance, we used a law of total variance.
And then in part a--
sorry-- in part b, we were just a little bit clever.
We thought about how can we reinterpret this experiment
to reduce computation?
And we realized that part b is essentially
two independent trials of part a.
So both the mean and the variance should double.


## Course  /  Unit 6: Further topics on random variables  /  Additional theoretical material

# 1. A linear function of two independent continuous r.v.s

![](C:/Users/qp/Pictures/Screenshots/1. A linear function of two independent continuous r.v.s - 1.png)
![](C:/Users/qp/Pictures/Screenshots/1. A linear function of two independent continuous r.v.s - 2.png)
![](C:/Users/qp/Pictures/Screenshots/1. A linear function of two independent continuous r.v.s - 3.png)
![](C:/Users/qp/Pictures/Screenshots/1. A linear function of two independent continuous r.v.s - 4.png)
![](C:/Users/qp/Pictures/Screenshots/1. A linear function of two independent continuous r.v.s - 5.png)
![](C:/Users/qp/Pictures/Screenshots/1. A linear function of two independent continuous r.v.s - 6.png)
![](C:/Users/qp/Pictures/Screenshots/1. A linear function of two independent continuous r.v.s - 7.png)
![](C:/Users/qp/Pictures/Screenshots/1. A linear function of two independent continuous r.v.s - 8.png)
![](C:/Users/qp/Pictures/Screenshots/1. A linear function of two independent continuous r.v.s - 9.png)
![](C:/Users/qp/Pictures/Screenshots/1. A linear function of two independent continuous r.v.s - 10.png)
We have developed the convolution formula,
which tells us the PDF of the sum of two
independent continuous random variables with known PDFs.
In this segment, we will see that the convolution formula
can be used more generally.
It can be exploited to give us the PDF
of a more general linear function of two
independent continuous random variables.
As an example, we will try and calculate
the PDF of the random variable, 2X minus
Y. How can we exploit the convolution formula
that we have in our hands?
Well, the trick is to look at this random variable
and think of it as the sum of two random variables--
the random variable 2X and the random variable minus Y. X
and Y are independent.
Therefore, 2X and minus Y are also independent.
So we're dealing with the sum of two
independent random variables, and therefore we
can apply the convolution formula
to these particular independent random variables
and we obtain the following.
It's an integral from minus infinity
to plus infinity of the density of the first random variable
that we're adding, which is the random variable 2X,
times the density of the second random variable.
And in this case, the second random variable is minus Y.
And all that we need to do now in order
to complete the solution of the problem
is to figure out these two PDFs, the PDF of 2X
and the PDF of minus Y. But this is a problem
that we know how to solve.
We have seen a formula for the PDF of a linear function
of a single random variable, and the special case
of this formula takes this form.
So in particular, the density of 2X
is equal to 1/2 times the density of X evaluated
at the argument divided by 2.
And using this formula once more to the random variable minus Y,
we obtain-- here, we have a correspondence
that a is equal to minus 1.
We obtain 1 over the absolute value of minus 1.
That's 1.
Here we have a minus 1, and so this
gives us f of Y at minus y.
Now, what we need here is actually
f minus Y evaluated at z minus x.
So all we'll need to do is to substitute the right symbols.
And when we have z minus x instead of y here,
we need to put the negative of the argument
that we have on the other side, so this
is going to become x minus z.
And now, if we take this expression,
substitute it in here, if we take this expression,
substitute it in there, we obtain a final answer,
which is this formula.
Now, this is not an important formula
that you should memorize at this point.
Instead, you should be comfortable with the way
of carrying out these calculations
and doing the right substitutions
and following the notation and the right set of symbols that
have to be used at each step.


# 2. Simulation

![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 2.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 3.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 4.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 5.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 6.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 7.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 8.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 9.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 11.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 12.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 13.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 14.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 15.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 16.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 17.png)
![](C:/Users/qp/Pictures/Screenshots/2. Simulation - 18.png)
![](C:/Users/qp/Pictures/20221113_181232.jpg)
Simulation is an important tool in the analysis
of probabilistic phenomena.
For example, suppose that X, Y, and Z
are independent random variables,
and you're interested in the statistical properties
of this random variable.
Perhaps you can find the distribution
of this random variable by solving a derived distribution
problem, but sometimes this is impossible.
And in such cases, what you do is,
you generate random samples of these random variables
drawn according to their distributions,
and then evaluate the function g on that random sample.
And this gives you one sample value of this function.
And you can repeat that several times
to obtain some kind of histogram and from that,
get some understanding about the statistical properties
of this function.
So the question is, how can we generate a random sample
of a random variable whose distribution is known?
So what we want is to create some kind of box
that outputs numbers.
And these numbers are random variables
that are distributed according to a CDF that's given to us.
How can we do it?
Well, computers typically have a random number generator
in them.
And random number generators, typically what they do
is generate values that are drawn
from a uniform distribution.
So this gives us a starting point.
We can generate uniform random variables.
But what we want is to generate values of a random variable
according to some other distribution.
How are we going to do it?
What we want to do is to create some kind of box or function
that takes this uniform random variable and generates g of U.
And we want to find the right function to use.
Find a g so that the random variable, g of U,
is distributed according to the distribution that we want.
That is, we want the CDF of g of U
to be the CDF that's given to us.
So let's see how we can do this.
Let us look at the discrete case first, which is easier.
And let us look at an example.
So suppose that I want to generate samples
of a discrete random variable that has the following PMF.
It takes this value with probability 2/6,
this value with probability 3/6, and this value
with probability 1/6.
What I have is a uniform random variable
that's drawn from a uniform distribution.
What can I do?
I can do the following.
Let this number here be 2/6.
If my uniform random variable falls
in this range, which happens with probability 2/6,
I'm going to report this value for
my discrete random variable.
Then I take an interval of length 3/6,
which takes me to 5/6.
And if my uniform random variable falls in this range,
then I'm going to report that value
for my discrete random variable.
And finally, with probability 1/6,
my uniform random variable happens to fall in here.
And then I report that [value].
So clearly, the value that I'm reporting
has the correct probabilities.
I'm going to report this value with probability 2/6,
I'm going to report that value with probability 3/6,
and so on.
So this is how we can generate random samples
of a discrete distribution, starting
from a uniform random variable.
Let us now look at what we did in a somewhat different way.
This is the x-axis.
And let me plot the CDF of my discrete random variable.
So the CDF has a jump of 2/6, at a point which is equal to that.
Then it has another jump of size 3/6, which
takes us to 5/6 at some other point.
And that point here corresponds to the location of that value.
And finally, it has another jump of 1/6
that takes us to 1, at another point, that
corresponds to the third value.
And look now at this interval here from 0 to 1.
And let us think as follows.
We have a uniform random variable
distributed between 0 to 1.
If my uniform random variable happens
to fall in this interval, I'm going to report that value.
If my uniform random variable happens
to fall in this interval, I'm going to report that value.
And finally, if my uniform falls in this interval,
I'm going to report that value.
We're doing exactly the same thing as before.
With probability 2/6, my uniform falls here.
And we report this value and so on.
So what's a graphical way of understanding
of what we're doing?
We're taking the CDF.
We generate a value of the uniform.
And then we move until we hit the CDF
and report the corresponding value of x.
It turns out that this recipe will also
work in the continuous case.
Let's see how this is done.
So let's assume that we have a CDF, which
is strictly monotonic.
So the picture would be as follows.
It's a CDF.
CDFs are monotonic, but here, we assume
that it is strictly monotonic.
And we also assume that it is continuous.
It doesn't have any jumps.
So this CDF starts at 0 and rises, asymptotically, to 1.
What was the recipe that we were just discussing?
We generate a value for a uniform random variable.
We move until we hit the CDF, and then report this value here
for x.
So what is it that we're doing?
We're going from u's to x's.
So we're using the inverse function.
The cumulative takes as an input an x,
a value on this axis, and then reports, a value on that axis.
The inverse function is the function
that goes the opposite way.
We start from a value on the vertical axis
and takes us to the horizontal axis.
Now, the important thing is that because of our assumption
that f is continuous and strictly monotonic,
this inverse function is well-defined.
Given any point here, we can always
find one and only one corresponding x.
Now, what are the properties of this method
that we have been using?
If I take some number c and then take the corresponding number
up here, which is going to be F_X of c,
then we have the following property.
My random variable X is going to be
less than or equal to c if and only
if my random variable X falls into this interval.
But that's equivalent to saying that the uniform random
variable fell in that interval.
Values of the uniform in this interval-- these
are the values that give me x's that are
less than or equal to c.
So the event that X is less than or equal to c
is identical to the event that U is less than
or equal to F_X of c.
So this is how I am generating my x's based on u's.
We now need to verify that the x's that I'm generating
this way have the correct property, have the correct CDF.
So let's check it out.
The probability that X is less than or equal to c, this
is the probability that U is less than or equal to F_X of c.
But U is a uniform random variable.
The probability of being less than something
is just that something.
So we have verified that with this way of constructing
samples of X based on samples of U,
the random variable that we get has the desired CDF.
Let's look at an example now.
Suppose that we want to generate samples
of a random variable, which is an exponential random variable,
with parameter 1.
In this case, we know what the CDF is.
The CDF of an exponential with parameter 1 is given by this
formula, for non-negative x's.
Now, let us find the inverse function.
If a u corresponds to 1 minus e to the minus x--
so we started with some x here and we find the corresponding
u-- this is the formula that takes us from x's to u's.
Let's find the formula that takes us from u's to x's.
So we need to solve this equation.
Let's send u to the other side, and let's
send this term to the left hand side.
We obtain e to the minus x equals 1 minus u.
Let us take logarithms: minus x equals to the logarithm
of 1 minus u.
And finally, x is equal to minus the logarithm of 1 minus u.
So this is the inverse function.
And now, what we have discussed leads us
to the following procedure.
I generate a random variable, U, according
to the uniform distribution.
Then I form the random variable X
by taking the negative of the logarithm of 1 minus U.
And this gives me a random variable,
which has an exponential distribution.
And so we have found a way of simulating
exponential random variables, starting with a random number
generator that produces uniform random variables.


# 3. Conditional expectation properties

![](C:/Users/qp/Pictures/Screenshots/3. Conditional expectation properties - 1.png)
![](C:/Users/qp/Pictures/Screenshots/3. Conditional expectation properties - 2.png)
![](C:/Users/qp/Pictures/Screenshots/3. Conditional expectation properties - 3.png)
![](C:/Users/qp/Pictures/Screenshots/3. Conditional expectation properties - 4.png)
![](C:/Users/qp/Pictures/Screenshots/3. Conditional expectation properties - 5.png)
![](C:/Users/qp/Pictures/Screenshots/3. Conditional expectation properties - 6.png)
![](C:/Users/qp/Pictures/Screenshots/3. Conditional expectation properties - 7.png)
![](C:/Users/qp/Pictures/Screenshots/3. Conditional expectation properties - 8.png)
![](C:/Users/qp/Pictures/Screenshots/3. Conditional expectation properties - 9.png)
In this segment, we point out and discuss some important but also intuitive properties of conditional
expectations. The first property is the one that is written up here. What is the intuitive meaning?
If you condition on Y, then the value of Y is known, and so g of Y is also known. There's no randomness
in it. It can be treated as a constant, and therefore it can be pulled outside the expectation. So that's the
intuition.
How does one establish such a result formally? Let us take the discrete case. So let us assume that X
and Y are both discrete. What does it take to establish a fact of this kind? We want to show that two
random variables are equal, and that amounts to the following. We consider an outcome of the
experiment, and we want to show that whatever the outcome of the experiment is, these two random
variables will be the same.
So let us consider an outcome for which the random variable, Y, takes a specific value, little y. And of
course, this has to be a specific little y that is possible. Otherwise, conditioning on that event would not
be meaningful.
So if an outcome has this value for the random variable, Y, then what does the random variable do?
This is, by definition, the random variable that takes the value-- expected value of g Y X, conditional
expectation, given that capital Y took on this value. This was our definition of the concept of the abstract
conditional expectation as a random variable. This is the random variable that takes this specific
numerical value whenever the random variable, capital Y, takes the value, little y.
And similarly, if the random variable, capital Y, takes the value, little y, this random variable here is the
expected value of X, given that Y is little y. And when capital Y takes the value, little y, this function, g of
Y, takes on this particular numerical value.
So we want to show that these two expressions we will be equal no matter what capital Y is. Now, when
we place ourselves in a conditional universe, where capital Y takes a value, little y, then the joint PMF of
X and Y gets concentrated on those values of capital Y that obey this relation. So conditioned on this
event, capital Y is, with certainty, equal to little y. Therefore, this random variable here, in the
conditional universe, is the same as this number.
16.431x
But now, since this is a number, it can be pulled outside the expectation. So we have concluded that for
any outcome for which the random variable, capital Y, takes this specific value, little y, this random
variable takes this value. This random variable takes this value. They are the same. So no matter what
the outcome is, these two random variables take the same value, and therefore they are the same
random variables.
Now, this is a correct proof if the random variables are discrete. If the random variables are continuous
or general, then carrying out a rigorous proof is actually quite subtle, and it is beyond our scope.
However, the intuition is still correct, and the result is correct. And we will be using it freely whenever we
need to.
Let us now move to a second observation. Suppose that h is an invertible function. What does that
mean? That if I give you the value of h, you can tell me the value of the argument. So in some sense, Y
and h of Y can be recovered from each other. If I give you Y, you can calculate h of Y. But also, if I give
you h of Y, you can figure out what Y was.
An example could be the function, h of Y, equals Y to the third power. If I tell you the value of Y, you
know Y cubed. But if I tell you Y cubed, you can also figure out Y. So Y and Y cubed carry exactly the
same information. In that case, the conditional expectation-- what you expect, on the average, X to be--
if I tell you the value of Y, should be the same as what you would expect X to be if I give you the value
of, let's say, Y cubed. In both cases, I'm giving you the same amount of information, so the conditional
distribution of X should be the same. And the conditional expectations should also be the same. So this
is, again, a very intuitive fact.
How do we verify that this fact is true? Using the same method as before. So fix some particular
outcome for which the random variable, capital Y, takes a specific value, little y. When that happens,
this random variable will take this value here. That's just by the definition of conditional expectation.
This is the random variable that takes this value whenever capital Y happens to be equal to little y.
In that case, we also have that h of capital Y takes on a specific value, h of little y. When this random
variable takes this specific value, this random variable here will take a value of this kind. So this is the
random variable that takes this value whenever h of capital Y happens to be this specific number. But
now, the event that h of Y takes this specific value, because the function, h, is invertible, is identical to
the event that Y takes that particular value.
2
And so, since this event is identical to that event, the conditional probabilities, given this event, would be
the same as the conditional probabilities given that the event. And therefore, the conditional
expectations would also be the same. Once more, this is a proof that's entirely rigorous if we are
dealing with discrete random variables, although in the continuous case, there could be some subtleties
involved. However, the result is true in general. The technical details are beyond our scope.

[][How do you know if a function is invertible?]
**In general, a function is invertible only if each input has a unique output. That is, each output is paired with exactly one input. That way, when the mapping is reversed, it will still be a function!**


## Course  /  Unit 6: Further topics on random variables  /  Problem Set 6

# 1. The PDF of the logarithm of X

![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the logarithm of X - 1.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the logarithm of X - 2.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the logarithm of X - 3.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the logarithm of X - 4.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the logarithm of X - 5.png)
![](C:/Users/qp/Pictures/Screenshots/1. The PDF of the logarithm of X - 6.png)
![](C:/Users/qp/Pictures/20221114_235501.jpg)
![](C:/Users/qp/Pictures/20221114_235508.jpg)


# 2. Functions of the standard normal

![](C:/Users/qp/Pictures/Screenshots/2. Functions of the standard normal - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. Functions of the standard normal - 2.png)
![](C:/Users/qp/Pictures/Screenshots/2. Functions of the standard normal - 3.png)
![](C:/Users/qp/Pictures/Screenshots/2. Functions of the standard normal - 4.png)
![](C:/Users/qp/Pictures/Screenshots/2. Functions of the standard normal - 5.png)
![](C:/Users/qp/Pictures/20221115_221026.jpg)
![](C:/Users/qp/Pictures/20221115_221050.jpg)
![](C:/Users/qp/Pictures/20221115_221107.jpg)


# 3. The PDF of the maximum

![](C:/Users/qp/Pictures/Screenshots/3. The PDF of the maximum - 1.png)
![](C:/Users/qp/Pictures/Screenshots/3. The PDF of the maximum - 2.png)
![](C:/Users/qp/Pictures/Screenshots/3. The PDF of the maximum - 3.png)
![](C:/Users/qp/Pictures/Screenshots/3. The PDF of the maximum - 4.png)
![](C:/Users/qp/Pictures/Screenshots/3. The PDF of the maximum - 5.png)
![](C:/Users/qp/Pictures/Screenshots/3. The PDF of the maximum - 6.png)
![](C:/Users/qp/Pictures/Screenshots/3. The PDF of the maximum - 7.png)
![](C:/Users/qp/Pictures/20221118_000640.jpg)
![](C:/Users/qp/Pictures/20221118_000700.jpg)


# 4. Convolution calculations

![](C:/Users/qp/Pictures/Screenshots/4. Convolution calculations - 1.png)
![](C:/Users/qp/Pictures/Screenshots/4. Convolution calculations - 2.png)
![](C:/Users/qp/Pictures/Screenshots/4. Convolution calculations - 3.png)
![](C:/Users/qp/Pictures/Screenshots/4. Convolution calculations - 4.png)
![](C:/Users/qp/Pictures/Screenshots/4. Convolution calculations - 5.png)


# 5. Covariance of the multinomial

[][*I dont know this, hard for me to understand, need to re-view it many times*]

![](C:/Users/qp/Pictures/Screenshots/5. Covariance of the multinomial - 1.png)
![](C:/Users/qp/Pictures/Screenshots/5. Covariance of the multinomial - 2.png)
![](C:/Users/qp/Pictures/Screenshots/5. Covariance of the multinomial - 3.png)
![](C:/Users/qp/Pictures/Screenshots/5. Covariance of the multinomial - 4.png)
![](C:/Users/qp/Pictures/Screenshots/5. Covariance of the multinomial - 5.png)
![](C:/Users/qp/Pictures/Screenshots/5. Covariance of the multinomial - 6.png)
[][******************************************************************************************************************]


# 6. Correlation coefficients

![](C:/Users/qp/Pictures/Screenshots/6. Correlation coefficients - 1.png)
![](C:/Users/qp/Pictures/Screenshots/6. Correlation coefficients - 2.png)
![](C:/Users/qp/Pictures/Screenshots/6. Correlation coefficients - 3.png)
![](C:/Users/qp/Pictures/Screenshots/6. Correlation coefficients - 4.png)
![](C:/Users/qp/Pictures/Screenshots/6. Correlation coefficients - 5.png)


# 7. Sum of a random number of r.v.'s

![](C:/Users/qp/Pictures/Screenshots/7. Sum of a random number of r.v.'s - 1.png)
![](C:/Users/qp/Pictures/Screenshots/7. Sum of a random number of r.v.'s - 2.png)
![](C:/Users/qp/Pictures/Screenshots/7. Sum of a random number of r.v.'s - 3.png)


## Course  /  Unit 6: Further topics on random variables  /  Unit summary

# 1. Unit 6 summary

![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 1.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 2.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 3.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 4.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 5.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 6.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 7.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 8.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 9.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 10.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 11.png)
![](C:/Users/qp/Pictures/Screenshots/1. Unit 6 summary - 12.png)

Let us now overview and summarize the contents of the
sixth unit in this class.
Actually, this unit consisted of three
rather separate topics.
So these are topics that go deeper
into the general subject.
But they were not related with each other.
For this reason, lets us talk about each one of them
separately.
The first topic was derived distributions.
The idea is that we have a random variable X whose
distribution is known.
And we want to somehow calculate the distribution of
Y. For continuous random variables, the way that we do
that is by first finding the CDF of the random variable Y
and then differentiating to find the PDF of Y. But if the
function g turns out to be monotonic, then we also have a
direct formula that takes us straight to the answer.
Then we saw that for the special case of a linear
function, the formula is rather simple.
And what we do is that we take the PDF of X, scale
it, and shift it.
And that gives us the PDF of Y. Finally, the same
methodology applies even if we are dealing with a function of
multiple random variables.
Once more, what we would do would be to calculate the CDF
of the random variable Z and then differentiate.
Moving further, a very interesting and important
special case of a function of two variables is when we take
those two variables and add them.
If those two variables X and Y are also independent, then
there are nice formulas for the distribution of Z. In the
discrete case, there's a formula.
And there's an analogous formula in the continuous case
where, as usual, the PMFs get replaced by PDFs and sums get
replaced by integrals.
This operation is called the convolution, convolution of
two PMFs, or convolution of two PDfs.
And there are also some nice mechanical ways of organizing
the calculations that are involved here.
Finally, an important fact which is derived by using this
convolution formula is the following--
that if we're dealing with two independent normals, then
their sum will also be a normal random variable.
Moving to the second topic, we defined the covariance of two
random variables using this formula.
And we interpreted it that it captures in some way whether X
and Y deviate from their means in a coordinated way or not.
So when X is higher than the mean, does Y
also tend to be higher?
Well, the covariance is trying to measure a
phenomenon of this kind.
Besides interpretations, we also looked at some algebraic
properties of the covariance.
We saw that it is linear in each one of the arguments.
And also, we saw that covariances are useful when we
want to find the variance of the sum of a collection of
random variables when those random
variables are dependent.
So instead of getting just the sum of the variances, we also
get a bunch of cross terms involving the covariances
between the different random variables.
We then introduced the correlation coefficient, which
is a scaled version of the covariance that takes into
account the standard deviations.
And what happens is that the correlation coefficient is a
dimensionless quantity that is guaranteed to be always less
than or equal to 1 in magnitude.
Finally, the last topic was to revisit the subject of
conditioning.
And we looked at the conditional expectation and
the conditional variance.
And instead of having lowercase symbols here, we now
introduced uppercase symbols.
This is a change in notation.
But more important, conceptually, we started
looking at these quantities as random variables.
The conditional expectation is determined by Y. And since Y
is random, we can think of this quantity as
being random by itself.
Since they're random variables, they have
expectations, variances, and so on.
So we started looking at some of their properties.
And for example, the expected value of the conditional
expectation turns out to be the same as the unconditional
expectation.
Now this law of iterated expectations is just a version
of the total expectation theorem, but written in a more
abstract form.
It accomplishes the same divide and
conquer approach to problems.
But it has an aesthetic appeal the way it is written because
it doesn't involve any assumptions whether X and Y
are discrete, continuous, and so on.
There's also a formula--
the law of total variance--
that relates the conditional variance to the
unconditional one.
But unlike what one might expect, there's also an
additional term that appears here.
And we saw an interpretation of these two terms in the
context of a few examples that hopefully shed light into what
we're dealing with.
It's a decomposition of the total amount of uncertainty
into uncertainties of two different types.
Finally, a key application of these two laws that we have in
our hands was in the context of calculating or doing
something about random variables that are the sum of
a random number of independent random variables.
And we saw that the expected value of the sum is equal to
the product of two expectations.
We have the expectation of how many terms we are adding times
the expected value of each one of the terms.
And as far as the variance is concerned, we applied the law
of total variance to get an expression of a certain type.
So the topics that we covered in this unit start to become
pretty sophisticated.
And by this point, we are ready to dive into pretty deep
and mature applications of probability.


## Course  /  Unit 7: Bayesian inference  /  Unit overview

# 1. Motivation