MITx 6.431x -- Probability - The Science of Uncertainty and Data + Unit_4.Rmd

---
title: "MITx 6.431x -- Probability - The Science of Uncertainty and Data + Unit_4.Rmd"
author: "John HHU"
date: "2022-11-05"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

```{r cars}
summary(cars)
```

## Including Plots

You can also embed plots, for example:

```{r pressure, echo=FALSE}
plot(pressure)
```

Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.


## Course  /  Unit 4: Discrete random variables  /  Lec. 5: Probability mass functions and expectations

# 1. Lecture 5 overview and slides

This lecture introduces random variables, the description of discrete random variables through probability mass functions, and the concept of expectation. The concepts are illustrated in the context of the most common discrete random variables: Bernoulli, uniform, binomial, and geometric. 


![](C:/Users/qp/Pictures/Screenshots/1. Lecture 5 overview and slides.png)
In this lecture, we introduce the notion of a random variable.  [][A random variable is, loosely speaking, a numerical quantity whose value is determined by the outcome of a probabilistic experiment].  

The weight of a randomly selected
student is one example.
After giving a general definition, we will focus
exclusively on discrete random variables.
These are random variables that take values in finite or
countably infinite sets.
For example, random variables that take
integer values are discrete.
To any discrete random variable we will associate a
probability mass function, which tells us the likelihood
of each possible value of the random variable.
Then we will go over a few examples and introduce some
common types of random variables.
And finally we will introduce a new concept- the expected
value of the random variable, also called the
expectation or mean.
It is a weighted average of the values of the random
variable, weighted according to their respective
probabilities, and has an intuitive interpretation as
the average value we expect to see if we repeat the same
probabilistic experiment independently a
large number of times.
Expected values play a central role in probability theory.
We will look into some of their properties.
And we will also calculate the expected values of the example
random variables that we will have introduced.


Printable transcript available here.
https://courses.edx.org/assets/courseware/v1/ad8812366180058d90973ef50396e587/asset-v1:MITx+6.431x+2T2022+type@asset+block/transcripts_L05-Overview.pdf

Lecture slides: [clean] [annotated]
https://courses.edx.org/assets/courseware/v1/ca3078fab429bcc9db6f2b562e12f959/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L05-clean-slides.pdf
https://courses.edx.org/assets/courseware/v1/fe59e64db945dbc0bee1468169284f8e/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L05-annotated-slides.pdf

More information is given in Sections 2.1-2.4 of the text.
https://courses.edx.org/courses/course-v1:MITx+6.431x+2T2022/pdfbook/0/chapter/1/9


# 2. Definition of random variables

![](C:/Users/qp/Pictures/Screenshots/2. Definition of random variables - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. Definition of random variables - 2.png)
We will now define the notion of a random variable.  [][Very loosely speaking, a random variable is a numerical quantity that takes random values].  But what does this mean?  We want to be a little more precise and I'm going to introduce the idea through an example.  Suppose that our sample space is a set of students labeled according to their names.  Or for simplicity, let's just label them as a, b, c, and d.  Our probabilistic experiment is to pick a student at random according to some probability law and then record their weight in kilograms.  

So for example, suppose that the outcome of the experiment was this particular student, and the weight of that student is 62.  Or it could be that the outcome of the experiment is this particular student, and that particular student has a weight of 75 kilograms.  *The weight of a particular student is a number, little w*.  But let us think of the abstract concept of weight, something that we will denote by capital W. Weight is an object whose value is determined once you tell me the outcome of the experiment, once you tell me which student was picked.  In this sense, weight is really a function of the outcome of the experiment.  So think of weight as an abstract box that takes as input a student and produces a number, little w, which is the weight of that particular student.  
Or more concretely, think of weight with a capital W as a
procedure that takes a student, puts him or her on a
scale, and reports the result.
In this sense, weight is an object of the same kind as the
square root function that's sitting inside your computer.
The square root function is a function.
It's a subroutine, perhaps it is a piece of code, that takes
as input a number, let's say the number 9, and produces
another number.
In this case, it would be the number 3, which is the
square root of 9.
Notice here the distinction that we will keep emphasizing
over and over.
Square root of 9 is a number.
It is the number 3.
The box square root is a function.
Now, let us go back to our probabilistic experiment.
Note that a probabilistic experiment such as the one in
our example can have several associated random variables.
For example, we could have another random variable
denoted by capital H, which is the height of a student
recorded in meters.
So if the outcome of the experiment, for example, was
student a, then this random variable would take a value
which is the height of that student, let's say it was 1.7.
Or if the outcome of the experiment was student c, then
we would record the height of that student.
And let's say it turns out to be 1.8.
Once again, height with a capital H is an abstract
object, a function whose value is determined once you tell me
the outcome of the experiment.
Now, given some random variables, we can create new
random variables as functions of the
original random variables.
For example, consider the quantity defined as weight
divided by height squared.
This quantity is the so-called body mass index, and it is
also a function on the sample space.
Why is it a function on the sample space?
Well, because once an outcome of the experiment is
determined, suppose that the outcome of the experiment was
the blue student, then these two numbers, 62 and 1.7, are
also determined.
And using those numbers, we can carry out this calculation
and find the body mass index of that particular student,
which in this case would be 21.5.
Or if it happened that this student was selected, then the
body mass index would turn out to be some other number.
In this case, it would be 23.
So again, we see that the body mass index can be viewed as an
abstract concept defined by this formula.
But once an outcome is determined, then the body mass
index is also determined.
And so the body mass index is really a function of which
particular outcome was selected.
Let us now abstract from the previous discussion.
We have seen that random variables are abstract objects
that associate a specific value, a particular number, to
any particular outcome of a probabilistic experiment.
So in that sense, random variables are functions from
the sample space to the real numbers.
They are numerical functions, but as numerical functions
they can either take discrete values, for example the
integers, or they can take continuous values, let's say
on the real line.
For example, if your random variable is the number of
heads in 10 consecutive coin tosses, this is a discrete
random variable that takes values in the
set from 0 to 10.
If your random variable is a measurement of the time at
which something happened, and if your timer has infinite
accuracy, then the timer reports a real number and we
would have a continuous random variable.
In this lecture sequence and in the next few ones, we will
concentrate on discrete random variables because they are
easier to handle.
And then later on, we will move to a discussion of
continuous random variables.
Throughout, we want to keep noting this very important
distinction that I already brought in the discussion for
a particular example, but it needs to be emphasized and
re-emphasized.
That we make a distinction between random variables,
which are abstract objects.
They are functions on the sample space and they are
denoted by uppercase letters.
In contrast, we will use lower case letters to indicate
numerical values of the random variables.
So little x is always a real number, as opposed to the
random variable, which is a function.
One point that we made earlier is that for the same
probabilistic experiment we can have several random
variables associated with that experiment.
And we can also combine random variables to
form new random variables.
In general, a function of random variables has numerical
values that are determined by the numerical values of the
original random variables.
And so, ultimately, they are determined by the outcome of
the experiment.
So a function of random variables has a numerical
value which is completely determined by the outcome of
the experiment.
And so a function of random variables is
also a random variable.
As an example, we could think of two random variables, X and
Y, associated with the same probabilistic experiment.
And then define a random variable, let's say X plus Y.
What does that mean?
X plus Y is a random variable that takes the value little x
plus little y when the random variable capital X takes the
value little x and capital Y takes the value little y.
So X and Y are random variables.
X plus Y is another random variable.
X and Y will take numerical values once the outcome of the
experiment has been obtained.
And if the numerical values that they take are little x
and little y, then the random variable X plus Y will take
the numerical value little x plus little y.
So we can now move on and start doing some interesting
things about random variables.
Characterize them, describe them, give some examples, and
introduce some new concepts associated with them.


# 3. Exercise: Random variables

![](C:/Users/qp/Pictures/Screenshots/3. Exercise Random variables.png)


# 4. Probability mass functions

![](C:/Users/qp/Pictures/Screenshots/4. Probability mass functions - 1.png)
![](C:/Users/qp/Pictures/Screenshots/4. Probability mass functions - 2.png)
[][A random variable can take different numerical values depending on the outcome of the experiment].  Some outcomes are more likely than others, and similarly some of the possible numerical values of a random variable will be more likely than others.  We restrict ourselves to discrete random variables, and we will describe these relative likelihoods in terms of the so-called probability mass function, or PMF for short, which gives the probability of the different possible numerical values.  

The PMF is also sometimes called the probability law or
the probability distribution of a discrete random variable.
Let me illustrate the idea in terms of a simple example.
We have a probabilistic experiment with
four possible outcomes.
We also have a probability law on the sample space.
And to keep things simple, we assume that all four outcomes
in our sample space are equally likely.
We then introduce a random variable that associates a
number with each possible outcome as
shown in this diagram.
The random variable, X, can take one of
three possible values--
namely 3, 4, or 5.
Let us focus on one of those numbers--
let's say the number 5.
So let us focus on x being equal to 5.
We can think of the event that X is equal to 5.
Which event is this?
This is the event that the outcome of the experiment led
to the random variable taking a value of 5.
So it is this particular event which consists of two
elements, namely a and b.
More formally, the event that we're talking about is the set
of all outcomes for which the value, the numerical value of
our random variable, which is a function of the outcome,
that numerical value happens to be equal to 5.
And in this example it is a set
consisting of two elements.
It's a subset of the sample space.
So it is an event.
And it has a probability.
And that probability we will be
denoting with this notation.
And in our case this probability is equal to 1/2.
Because we have two outcomes, each one has probability 1/4.
The probability of this event is equal to 1/2.
More generally, we will be using this notation to denote
the probability of the event that the random variable, X ,
takes on a particular value, x.
This is just a piece of notation, not a new concept.
We're dealing with a probability, and we indicate
it using this particular notation.
More formally, the probability that we're dealing with is the
probability, the total probability, of all outcomes
for which the numerical value of our random variable is this
particular number, x.
A few things to notice.
We use a subscript, X, to indicate which random variable
we're talking about.
This will be useful if we have several
random variables involved.
For example, if we have another random variable on the
same sample space, Y, then it would have its own probability
mass function which would be denoted with this particular
notation here.
The argument of the PMF, which is x, ranges over the possible
values of the random variable, X. So in this sense, here
we're really dealing with a function.
A function that we could denote just by p with a
subscript x.
This is a function as opposed to the specific
values of this function.
And we can produce plots of this function.
In this particular example that we're dealing with, the
interesting values of x are 3, 4, and 5.
And the associated probabilities are the value of
5 is obtained with probability 1/2, the value of 4--
this is the event that the outcome is c, which has
probability 1/4.
And the value of 3 is also obtained with probability 1/4
because the value of 3 is obtained when the outcome is
d, and that outcome has probability 1/4.
So the probability mass function is a function of an
argument x.
And for any x, it specifies the probability that the
random variable takes on this particular value.
A few more things to notice.
The probability mass function is always non-negative, since
we're talking about probabilities and
probabilities are always non-negative.
In addition, since the total probability of all outcomes is
equal to 1, the probabilities of the different possible
values of the random variable should also add to 1.
So when you add over all possible values of x, the sum
of the associated probabilities
should be equal to 1.
In terms of our picture, the event that x is equal to 3,
which is this subset of the sample space, the event that x
is equal to 4, which is this subset of the sample space,
and the event that x is equal to 5, which is this subset of
the sample space.
These three events--
the red, green, and blue--
they are disjoint, and together they cover the entire
sample space.
So their probabilities should add to 1.
And the probabilities of these events are the probabilities
of the different values of the random variable, X. So the
probabilities of these different values
should also add to 1.
Let us now go through a simple example to illustrate the
general method for calculating the PMF of a
discrete random variable.
We will revisit our familiar example involving two rolls of
the tetrahedral die.
And let X be the result of the first roll, Y be the result of
the second roll.
And notice that we're using uppercase letters.
And this is because X and Y are random variables.
In order to do any probability calculations, we also need the
probability law.
So to keep things simple, let us assume that every possible
outcome, there's 16 of them, has the same probability which
is therefore 1 over 16 for each one of the outcomes.
We will concentrate on a particular random variable
defined to be the sum of the random variables, X and Y. So
if X and Y both happen to be 1, then Z will take
the value of 2.
If X is 2 and Y is 1 our random variable will take the
value of 3.
And similarly if we have this outcome, in those outcomes
here, the random variable takes the value of 4.
And we can continue this way by marking, for each
particular outcome, the corresponding value of the
random variable of interest.
What we want to do now is to calculate the PMF of this
random variable.
What does it mean to calculate the PMF?
We need to find this value for all choices of z, that is for
all possible values in the range of our random variable.
The way we're going to do it is to consider each possible
value of z, one at a time, and for any particular value find
out what are the outcomes--
the elements of the sample space--
for which our random variable takes on the specific value,
and add the probabilities of those outcomes.
So to illustrate this process, let us calculate the value of
the PMF for z equal to 2.
This is by definition the probability that our random
variable takes the value of 2.
And this is an event that can only happen here.
It corresponds to only one element of the sample space,
which has probability 1 over 16.
We can continue the same way for other values of z.
So for example, the value of PMF at z equal to 3, this is
the probability that our random variable takes the
value of 3.
This is an event that can happen in two ways--
it corresponds to two outcomes--
and so it has probability 2 over 16.
Continuing similarly, the probability that our random
variable takes the value of 4 is equal to 3 over 16.
And we can continue this way and calculate the remaining
entries of our PMF.
After you are done, you end up with a table--
or rather a graph--
a plot that has this form.
And these are the values of the different probabilities
that we have computed.
And you can continue with the other values.
It's a reasonable guess that this was going to be 4 over
16, this is going to be 3 over 16, 2 over 16, and 1 over 16.
So we have completely determined the PMF of our
random variable.
We have given the form of the answers.
And it's always convenient to also provide a plot with the
answers that we have.


# 5. Exercise: PMF calculation

![](C:/Users/qp/Pictures/Screenshots/5. Exercise PMF calculation - 1.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise PMF calculation - 2.png)


# 6. Exercise: Random variables versus numbers

![](C:/Users/qp/Pictures/Screenshots/6. Exercise Random variables versus numbers - 1.png)
![](C:/Users/qp/Pictures/Screenshots/6. Exercise Random variables versus numbers - 2.png)


[][I am so confused on this, please help]

question posted less than a minute ago by john_hhu2020

"Let X be a random variable that takes integer values, with PMF p_X(x). "

Here is my thinking,

    this statement means the outcome of experiment is a int, saying a Bingo game, outcome is a int on the ball. 
    And X is a mapping function from our experiment with certain ball occurring event to the value on that ball.
    So x is the possible ints on the balls, and since y is an int, so p_X(y) give us a certain probability.

Then p_X(x) is another function, mapping certain occurred values to its' probability of occurring in all balls.

Am I thinking it correctly? Please help me
This post is visible to everyone.


# 7. Bernoulli and indicator random variables

![](C:/Users/qp/Pictures/Screenshots/7. Bernoulli and indicator random variables.png)
We now want to introduce some examples of random variables, and we will start with the simplest conceivable random variable--a random variable that takes the values of 0 or 1, with certain given probabilities.  Such a random variable is called a Bernoulli random variable.  And the distribution of this random variable is determined by this parameter p, which is a given number that lies in the interval between 0 and 1.  Using PMF notation, we have the probability of 0 being equal to 1 minus p and the probability of taking the value 1 equal to p.  If you wish to plot this particular PMF, the plot is rather simple.  It consists of two bars, one at 0 and one at 1.  This one has a height of p and this has a height of 1 minus p.  

Bernoulli random variables show up whenever you're trying to model a situation where you run a trial.  And that trial can result in two alternative outcomes, either success or failure, or heads versus tails, and so on.  *Another situation where Bernoulli random variables show up is when we're making a connection between events and random variables*.  Here's how this connection is made.  We have our sample space, omega.  And within that sample space, we have a certain event, A.  And outside of the event A, of course, we have the complement of A.  **Our random variable is defined so that it takes a value of 1, whenever the outcome of the experiment lies in A**.  And it takes a value of 0 whenever the outcome of the experiment lies outside the event A, so that it lies in the complement.  [][This random variable is called the indicator random variable of the event A].  It is equal to 1 if and only if event A occurs.  

And the PMF of that random variable
can be found as follows.
This is PMF notation.
This is the equivalent probabilistic notation.
This is the probability that the random variable
takes a value 1.
Now the random variable takes the value of 1 if and only if
event A occurs.
And so what we have is that our random variable, the
indicator random variable, is a Bernoulli random variable
with a parameter p equal to the probability of the event
of interest.
Indicator random variables are very useful because they allow
us to translate a manipulation of events to a manipulation of
random variables.
And sometimes the algebra of working with random variable
is easier than working with events, as we will see in some
later examples.


# 8. Exercise: Indicator variables

![](C:/Users/qp/Pictures/Screenshots/8. Exercise Indicator variables - 1.png)
![](C:/Users/qp/Pictures/Screenshots/8. Exercise Indicator variables - 2.png)
![](C:/Users/qp/Pictures/Screenshots/8. Exercise Indicator variables - 3.png)


# 9. Uniform random variables

![](C:/Users/qp/Pictures/Screenshots/9. Uniform random variables.png)
In this segment and the next two, we will introduce a few
useful random variables that show up in many applications--
discrete uniform random variables, binomial random
variables, and geometric random variables So let's
start with a discrete uniform.
A discrete uniform random variable is one that has a PMF
of this form.
It takes values in a certain range, and each one of the
values in that range has the same probability.
To be more precise, a discrete uniform is completely
determined by two parameters that are two integers, a and
b, which are the beginning and the end of the range of that
random variable.
We're thinking of an experiment where we're going
to pick an integer at random among the values that are
between a and b with the end points a and b included.
And all of these values are equally likely.
To be more formal, our sample space is the set of integers
from a until b.
And the number of points that we have in our sample space is
b minus a plus 1 possible values.
What is the random variable that we're talking about?
If this is our sample space, the outcome of the experiment
is already a number.
And the numerical value of the random variable is just the
number that we happen to pick in that range.
So in this context, there isn't really a distinction
between the outcome of the experiment and the numerical
value of the random variable.
They are one in the same.
Now since each one of the values is equally likely, and
since we have so many possible values, this means that the
probability of any particular value is going to be 1 over b
minus a plus 1.
This is the choice for the probability that would make
all the probabilities in the PMF sum to one.
What does this random variable model in the real world?
It models a case where we have a range of possible values,
and we have complete ignorance, no reason to
believe that one value is more likely than the other.
As an example, suppose that you look at your digital
clock, and you look at the time.
And the time that it tells you is 11:52 and 26 seconds.
And suppose that you just look at the seconds.
The seconds reading is something that takes values in
the set from 0 to 59.
So there are 60 possible values.
And if you just choose to look at your clock at a completely
random time, there's no reason to expect that one reading
would be more likely than the other.
All readings should be equally likely, and each one of them
should have a probability of 1 over 60.
One final comment--
let us look at the special case where the beginning and
the endpoint of the range of possible values is the same,
which means that our random variable can only take one
value, namely that particular number a.
In that case, the random variable that we're dealing
with is really a constant.
It doesn't have any randomness.
It is a deterministic random variable that takes a
particular value of a with probability equal to 1.
It is not random in the common sense of the world, but
mathematically we can still consider it a random variable
that just happens to be the same no matter what the
outcome of the experiment is.


# 10. Binomial random variables

![](C:/Users/qp/Pictures/Screenshots/10. Binomial random variables - 1.png)
![](C:/Users/qp/Pictures/Screenshots/10. Binomial random variables - 2.png)
The next random variable that we will discuss is the
binomial random variable.
It is one that is already familiar
to us in most respects.
It is associated with the experiment of taking a coin
and tossing it n times independently.
And at each toss, there is a probability, p,
of obtaining heads.
So the experiment is completely specified in terms
of two parameters--
n, the number of tosses, and p, the probability of heads at
each one of the tosses.
We can represent this experiment by the usual
sequential tree diagram.
And the leaves of the tree are the possible outcomes of the
experiment.
So these are the elements of the sample space.
And a typical outcome is a particular sequence of heads
and tails that has length n.
In this diagram here, we took n to be equal to 3.
We can now define a random variable associated with this
experiment.
Our random variable that we denote by capital X is the
number of heads that are observed.
So for example, if the outcome happens to be this one--
tails, heads, heads-- we have 2 heads that are observed.
And the numerical value of our random variable is equal to 2.
In general, this random variable, a binomial random
variable, can be used to model any kind of situation in which
we have a fixed number of independent trials and
identical trials, and each trial can result in success or
failure, and we have a probability of success equal
to some given number, p.
The number of successes obtained in these trials is,
of course, random and it is modeled by a
binomial random variable.
We can now proceed and calculate the PMF of this
random variable.
Instead of calculating the whole PMF, let us look at just
one typical entry of the PMF.
Let's look at this entry, which, by definition, is the
probability that our random variable takes the value of 2.
Now, the random variable taking the numerical value of
2, this is an event that can happen in three possible ways
that we can identify in the sample space.
We can have 2 heads followed by a tail.
We can have heads, tails, heads.
Or we can have tails, heads, heads.
The probability of this outcome is p times p
times (1 minus p).
So it's p squared times (1 minus p).
And the other two outcomes also have the same
probability, so the overall probability is 3 times this.
Which can also be written this way, 3 is the same as
3-choose-2.
It's the number of ways that you can choose 2 heads, where
they will be placed in a sequence of
3 slots or 3 trials.
More generally, we have the familiar binomial formula.
So this is a formula that you have already seen.
It's the probability of obtaining k successes in a
sequence of n independent trials.
The only thing that is new is that instead of using the
traditional probability notation, now
we're using PMF notation.
To get a feel for the binomial PMF, it's instructive to look
at some plots.
So suppose that we toss the coin three times and that the
coin tosses are fair, so that the probability of heads is
equal to 1/2.
Then we see that 1 head or 2 heads are equally likely, and
they are more likely than the outcome of 0 or 3 heads.
Now, if we change the number of tosses and toss the coin 10
times, then we see that the most likely result
is to have 5 heads.
And then as we move away from 5 in either direction, the
probability of that particular result
becomes smaller and smaller.
Now, if we toss the coin many times, let's say 100 times,
the coin is still fair, then we see that the number of
heads that we're going to get is most likely to be somewhere
in this range between, let's say, 35 and 65.
These are values of the random variable that have some
noticeable or high probabilities.
But anything below 30 or anything about 70 is extremely
unlikely to occur.
We can generate similar plots for unfair coins.
So suppose now that our coin is biased and the probability
of heads is quite low, equal to 0.2.
In that case, the most likely result is that we're going to
see 0 heads.
And then, there's smaller and smaller probability of
obtaining more heads.
On the other hand, if we toss the coin 10 times, we expect
to see a few heads, not a very large number, but some number
of heads between, let's say, 0 and 4.
Finally, if we toss the coin 100 times and we take the coin
to be an extremely unfair one, what do we expect to see?
If we think of probabilities as frequencies, we expect to
see heads roughly 10% of the time.
So, given that n is 100, we expect to see about 10 heads.
But when we say about 10 heads, we do not
mean exactly 10 heads.
About 10 heads, in this instance, as this plot tells
us, is any number more or less in the range from 0 to 20.
But anything above 20 is extremely unlikely.


# 11. Exercise: The binomial PMF

![](C:/Users/qp/Pictures/Screenshots/11. Exercise The binomial PMF - 1.png)
![](C:/Users/qp/Pictures/Screenshots/11. Exercise The binomial PMF - 2.png)


# 12. Geometric random variables

![](C:/Users/qp/Pictures/Screenshots/12. Geometric random variables.png)
The last discrete random variable that we will discuss
is the so-called geometric random variable.
It shows up in the context of the following experiment.
We have a coin and we toss it infinitely many times and
independently.
And at each coin toss we have a fixed probability of heads,
which is some given number, p.
This is a parameter that specifies the experiment.
When we say that the infinitely many tosses are
independent, what we mean in a mathematical and formal sense
is that any finite subset of those tosses are independent
of each other.
I'm only making this comment because we introduced a
definition of independence of finitely many events, but had
never defined the notion of independence or infinitely
many events.
The sample space for this experiment is the set of
infinite sequences of heads and tails.
So a typical outcome of this experiment
might look like this.
It's a sequence of heads and tails in some arbitrary order.
And of course, it's an infinite sequence, so it
continues forever.
But I'm only showing you here the
beginning of that sequence.
We're interested in the following random variable, X,
which is the number of tosses until the first heads.
So if our sequence looked like this, our random variable
would be taking a value of 5.
A random variable of this kind appears in many applications
and many real world contexts.
In general, it models situations where we're waiting
for something to happen.
Suppose that we keep doing trials at each time and the
trial can result either in success or failure.
And we're counting the number of trials it takes until a
success is observed for the first time.
Now, these trials could be experiments of some kind,
could be processes of some kind, or they could be whether
a customer shows up in a store in a particular second or not.
So there are many diverse interpretations of the words
trial and of the word success that would allow us to apply
this particular model to a given situation.
Now, let us move to the calculation of the PMF of this
random variable.
By definition, what we need to calculate is the probability
that the random variable takes on a
particular numerical value.
What does it mean for X to be equal to k?
What it means is that the first heads was observed in
the k-th trial, which means that the first k minus 1
trials were tails, and then were followed by heads in the
k-th trial.
This is an event that only concerns the first k trials,
and the probability of this event can be calculated using
the fact that different coin tosses or different trials are
independent.
It is the probability of tails in the first coin toss times
the probability of tails in the second coin toss, and so
on, k minus 1 times.
So we get an exponent here of k minus 1 times the
probability of heads in the k-th coin toss.
So this is the form of the PMF of this particular random
variable, and that formula applies for the possible
values of k, which are the positive integers.
Because the time of the first head can only
be a positive integer.
And any positive integer is possible, so our random
variable takes values in a discrete but infinite set.
The geometric PMF has a shape of this type.
Here we see the plot for the case where p equals to 1/3.
The probability that the first head shows up in the first
trial is equal to p, that's the probability of heads.
The probability that it shows up in the next trial, that the
first head appears in the second trial, this is the
probability that we had heads following a tail.
So we have the probability of a tail and then times the
probability of a head.
And then each time that we move to a further entry, we
multiply by a further factor of 1 minus p.
Finally, one little technical remark.
There's a possible and rather annoying outcome of this
experiment, which would be that we observe a sequence of
tails forever and no heads.
In that case, our random variable is not well-defined,
because there is no first heads to consider.
You might say that in this case our random variable takes
a value of infinity, but we would rather not have to deal
with random variables that could be infinite.
Fortunately, it turns out that this particular event has 0
probability of occurring, which I will now try to show.
So this is the event that we always see tails.
Let us compare it with the event where we see tails in
the first k trials.
How do these two events relate?
If we have always tails, then we will have tails in the
first k trials.
So this event implies that event.
This event is smaller than that event.
So the probability of this event is less than or equal to
the probability of that second event.
And the probability of that second event is 1
minus p to the k.
Now, this is true no matter what k we choose.
And by taking k arbitrarily large, this number here
becomes arbitrarily small.
Why does it become arbitrarily small?
Well, we're assuming that p is positive, so 1 minus p is a
number less than 1.
And when we multiply a number strictly less than 1 by itself
over and over, we get arbitrarily small numbers.
So the probability of never seeing a head is less than or
equal to an arbitrarily small positive number.
So the only possibility for this is that it is equal to 0.
So the probability of not ever seeing any heads is equal to
0, and this means that we can ignore
this particular outcome.
And as a side consequence of this, the sum of the
probabilities of the different possible values of k is going
to be equal to 1, because we're certain that the random
variable is going to take a finite value.
And so when we sum probabilities of all the
possible finite values, that sum will have
to be equal to 1.
And indeed, you can use the formula for the geometric
series to verify that, indeed, the sum of these numbers here,
when you add over all values of k, is, indeed, equal to 1.


# 13. Exercise: Geometric random variables

![](C:/Users/qp/Pictures/Screenshots/13. Exercise Geometric random variables - 1.png)
![](C:/Users/qp/Pictures/Screenshots/13. Exercise Geometric random variables - 2.png)
![](C:/Users/qp/Pictures/Screenshots/13. Exercise Geometric random variables - 3.png)


# 14. Expectation

![](C:/Users/qp/Pictures/Screenshots/14. Expectation - 1.png)
![](C:/Users/qp/Pictures/Screenshots/14. Expectation - 2.png)
![](C:/Users/qp/Pictures/Screenshots/14. Expectation - 3.png)
![](C:/Users/qp/Pictures/Screenshots/14. Expectation - 4.png)
Our discussion of random variable so far has involved nothing but standard probability calculations.  Other than using the PMF notation, we have done nothing new.  It is now time to introduce a truly new concept that plays a central role in probability theory.  This is the concept of the expected value or expectation or mean of a random variable.  It is a single number that provides some kind of summary of a random variable by telling us what it is on the average.  Let us motivate with an example.  You play a game of chance over and over, let us say 1,000 times.  *Each time that you play, you win an amount of money, which is a random variable, and that random variable takes the value 1, with probability 2/10, the value of 2, with probability 5/10, the value of 4, with probability 3/10*.  You can plot the PMF of this random variable.  It takes values 1, 2, and 4.  And the associated probabilities are 2/10, 5/10, and 3/10.  [][How much do you expect to have at the end of the day]?  

Well, if you interpret probabilities as frequencies, in a thousand plays, you expect to have about 200 times this outcome to occur and this outcome about 500 times and this outcome about 300 times.  So your average gain is expected to be your total gain, which is 1, 200 times, plus 2, 500 times, plus 4, 300 times.  This is your total gain.  And to get to the average gain, you divide by 1,000.  And the expression that you get can also be written in a simpler form as 1 times 2/10 plus 2 times 5/10 plus 4 times 3/10.  So this is what you expect to get, on the average, if you keep playing that game.  

What have we done?  We have calculated a certain quantity which is a sort of average of the random variable of interest.  And what we did in this summation here, we took each one of the possible values of the random variable.  Each possible value corresponds to one term in the summation.  And what we're adding is the numerical value of the random variable times the probability that this particular value is obtained.  So when x is equal to 1, we get 1 here and then the probability of 1. When we add the term corresponding to x equals 2,
we get little x equals to 2 and next to it the probability
that x is equal to 2, and so on.
So this is what we call the expected value of the random
variable x.
This is the formula that defines it, but it's also
important to always keep in mind the interpretation of
that formula.
The expected value of a random variable is to be interpreted
as the average that you expect to see in a large number of
independent repetitions of the experiment.
One small technical caveat, if we're dealing with a random
variable that takes values in a discrete but infinite set,
this sum here is going to be an infinite sum
or an infinite series.
And there's always a question whether an infinite series has
a well-defined limit or not.
In order for it to have a well-defined limit, we will be
making the assumption that this infinite series is, as
it's called, absolutely convergent, namely that if we
replace the x's by their absolute values--
so we're adding positive numbers,
or nonnegative numbers--
the sum of those numbers is going to be finite.
So this is a technical condition that we need in
order to make sure that this expected value is a
well-defined and finite quantity.
Let us now calculate the expected value of a very
simple random variable, the Bernoulli random variable that
takes the value 1 with probability p and the value 0
with probability 1 minus p.
The expected value consists of two terms.
X can take the value 1.
This happens with probability p.
Or it can take the value of zero.
This happens with probability 1 minus p.
And therefore, the expected value is just equal to p.
As a special case, we may consider the situation where X
is the indicator random variable of a certain event,
A, so that X is equal to 1 if and only if event A occurs.
In this case, the probability that X equals to 1, which is
our parameter p, is the same as the probability
that event A occurs.
And we have this relation.
And so with this correspondence, we readily
conclude that the expected value of an indicator random
variable is equal to the probability of that event.
Let us move now to the calculation of the expected
value of a uniform random variable.
Let us consider, to keep things simple, a random
variable which is uniform on the set from 0 to n.
It's uniform, so the probability of the values that
it can take are all equal to each other.
It can take one of n plus 1 possible values, and so the
probability of each one of the values is 1 over n plus 1.
We want to calculate the expected value
of this random variable.
How do we proceed?
We just recall the definition of the expectation.
It's a sum where we add over all of the possible values.
And for each one of the values, we multiply by its
corresponding probability.
So we obtain a summation of this form.
We can factor out a factor of 1 over n plus 1, and we're
left with 0 plus 1 plus all the way up to n.
And perhaps you remember the formula for us summing those
numbers, and it is n times n plus 1 over 2.
And after doing the cancellations, we obtain a
final answer, which is n over 2.
Incidentally, notice that n over 2 is just the midpoint of
this picture that we have here in this diagram.
This is always the case.
Whenever we have a PMF which is symmetric around a certain
point, then the expected value will be
the center of symmetry.
More general, if you do not have symmetry, the expected
value turns out to be the center of gravity of the PMF.
If you think of these bars as having weight, where the
weight is proportional to their height, the center of
gravity is the point at which you should put your finger if
you want to balance that diagram so that it doesn't
fall in one direction or the other.
And we now close this segment by providing one more
interpretation of expectations.
Suppose that we have a class consisting of n students and
that the ith student has a weight which
is some number xi.
We have a probabilistic experiment where we pick one
of the students at random, and each student is equally likely
to be picked as any other student.
And we're interested in the random variable X, which is
the weight of the student that was selected.
To keep things simple, we will assume that the
xi's are all distinct.
And we first find the PMF of this random variable.
Any particular xi that this possible is associated to
exactly one student, because we assumed that
the xi's are distinct.
So this probability would be the probability or selecting
the ith student, and that probability is 1 over n.
And now we can proceed and calculate the expected value
of the random variable X. This random variable X takes
values, and the values that it takes are the xi's.
A particular xi would be associated with a probability
1 over n, and we're adding over all the i's or over all
of the students.
And so this is the expected value.
What we have here is just the average of the weights of the
students in this class.
So the expected value in this particular experiment can be
interpreted as the true average over the entire
population of the students.
Of course, here we're talking about two
different kinds of averages.
In some sense, we're thinking of expected values as the
average in a large number of repetitions of experiments.
But here we have another interpretation as the average
over a particular population.


# 15. Exercise: Expectation calculation

![](C:/Users/qp/Pictures/Screenshots/15. Exercise Expectation calculation.png)


# 16. Elementary properties of expectation

![](C:/Users/qp/Pictures/Screenshots/16. Elementary properties of expectation.png)


# 17. Exercise: Random variables with bounded range

![](C:/Users/qp/Pictures/Screenshots/17. Exercise Random variables with bounded range - 1.png)
![](C:/Users/qp/Pictures/Screenshots/17. Exercise Random variables with bounded range - 2.png)


# 18. The expected value rule

![Read and think](C:/Users/qp/Pictures/Screenshots/18. The expected value rule.png)
In this segment, we discuss the expected value rule for calculating the expected value of a function of a random variable.
It corresponds to a nice formula that we will see
shortly, but it also involves a much more general idea that
we will encounter many times in this course, in somewhat
different forms.
Here's what it is all about.
We start with a certain random variable that has a known PMF.
However, we're ultimately interested in another random
variable Y, which is defined as a function of the original
random variable.
We're interested in calculating the expected value
of this new random variable, Y. How should we do it?
We will illustrate the ideas involved through a simple
numerical example.
In this example, we have a random variable, X, that takes
values 2, 3, 4, or 5, according to some given
probabilities.
We are also given a function that maps
x-values into y-values.
And this function, g, then defines a new random variable.
So if the outcome of the experiment leads to an X equal
to 4, then the random variable, Y, will also take a
value equal to 4.
How do we calculate the expected value of Y?
The only tool that we have available in our hands at this
point is the definition of the expected value, which tells us
that we should run a summation over the y-axis, consider
different values of y one at the time.
And for each value for y, multiply that value by its
corresponding probability.
So in this case, we start with Y equal to 3, which needs to
be multiplied by the probability that
Y is equal to 3.
What is that probability?
Well, Y is equal to three, if and only if X is 2 or 3, which
happens with probability, 0.1 plus 0.2.
Then we continue with the summation by considering the
next value of little y.
The next possible value is 4.
And this gives us a contribution of 4, weighted by
the probability of obtaining a 4.
The probability that Y is equal to 4 is the probability
that X is either equal to 4 or to 5, which happens with
probability.
0.3 plus 0.4.
So this way, we obtain an arithmetic expression which we
can evaluate.
And it's going to give us the expected value of Y. But
here's an alternative way of calculating
the expected value.
And this corresponds to the following type of thinking.
10% of the time, X is going to be equal to 2.
And when that happens, Y takes on a value of 3.
So this should give us a contribution to the average
value of Y, which is 3 times 0.1.
Then, 20% of the time, X is 3 and Y is also 3.
So 20% of the time, we also get 3's in Y.
Then 30% of the time, X is 4, which results in a Y that's
equal to 4.
So we obtain a 4 30% of the time.
And finally, 40% of the time, X equals to [5], which results
in a Y equal to 4.
And we obtain this arithmetic expression.
Now you can compare the two arithmetic expressions, the
red and the blue one, and you will notice that they're
equal, except that the terms are arranged in a slightly
different way.
Conceptually, however, there's a very big difference.
In the first summation, we run over the values of
Y one at the time.
In the second summation, we run over the different values
of X one at a time, and took into account their individual
contributions.
This second way of calculating the expected value of Y is
called the expected value rule.
And it corresponds to the following formula.
We carry out a summation over the x-axis.
For each x-value that we consider, we calculate what is
the corresponding y-value, that's g of x, and also weigh
this term according to the probability of
this particular x.
So for instance, a typical term here would be when x is
equal to 2, g of x would be equal to 3.
And the corresponding probability, that's the
probability of a 2, would be 0.1.
The advantage of using the expected value rule instead of
the definition of the expectation is that the
expected value rule only involves the PMF of the
original random variable, so we do not need to do any
additional work to find the PMF of
the new random variable.
Now we argued in favor of the expected value rule by
considering this numerical example, and by checking that
it gives the right result.
But now let us verify.
Let us argue more generally that it's going to give us the
right answer.
So what we're going to do is to take this summation and
argue that it's equal to the expected value of Y, which is
defined by that summation.
So let us start with this.
It's a sum over all x's.
Let us first fix a particular value of y, and add over all
those x's that correspond to that particular y.
So we're fixing a particular y.
And so we're adding only over those x's that lead to that
particular y.
And we carry out to the summation.
So this is the part of this sum associated with one
particular choice of y.
And it's a sum, really, over this set of x's.
But in order to exhaust all x's, we need to consider all
possible values of y.
And this gives rise to an outer summation over the
different y's.
So for any fixed y, we add over the associated x's.
But we want to consider all the possible y's.
Now at this point, we make the following observation.
Here, we have a summation over y's.
And let's look at the inner summation.
The inner summation involves x's, all of which are
associated with a specific value of y.
Having fixed y, all the terms inside this sum have the
property that g of x is equal to y.
So g of x is equal to that particular y.
And we can make this substitution here.
Now when we look at this summation, we now realize that
it's a summation over x's while y is being fixed.
And so we can take this term of y and pull
it outside the summation.
What this leaves us with is a sum over all y's of y, and
then a further sum over all x's that lead to that
particular y, of the probabilities of those x's.
Now what can we say about this inner summation?
We have fixed a y.
For that particular y, we're adding the probabilities of
all the x's that lead to that particular y.
Fixing y, consider all the x's that lead to it.
This is just the probability of that particular y.
But what we have now is just the definition of the expected
value of Y. And this concludes the proof that this
expression, as given by the expected value rule, gives us
the same answer as the original definition of the
expected value of Y.
Now before closing, a few observations.
The expected value rule is really simple to use.
For example, if you want to calculate the expected value
of the square of a random variable, then you're dealing
with a situation where the g function
is the square function.
And so, the expected value of X-squared will be the sum over
x's of x squared weighted according to the probability
of a particular x.
And finally, one important word of caution, that in
general, the expected value of the function--
so for example, the expected value of X-squared.
In general, it's not going to be the same as taking the
expected value of X and squaring it.
So this is a word [of] caution, that in general, you
cannot interchange the order with which you apply a
function, and then you calculate expectation.
There are exceptions, however, in which we happen to have
equality here.
And this is going to be our next topic.


[][*Note: At 3:32, the “average over x" should be 3*0.1 + 3*0.2 + 4*0.3 + 4*0.4 (instead of 4*0.5 in the last term).*]


# 19. Exercise: The expected value rule

![](C:/Users/qp/Pictures/Screenshots/19. Exercise The expected value rule.png)
![](C:/Users/qp/Pictures/20220925_210833.jpg)


# 20. Linearity of expectations

![](C:/Users/qp/Pictures/Screenshots/20. Linearity of expectations.png)
![](C:/Users/qp/Pictures/20220925_211252.jpg)
We end this lecture sequence with the most important
property of expectations, namely linearity.
The idea is pretty simple.
Suppose that our random variable, X, is the salary of
a random person out of some population.
So that we can think of the expected value of X as the
average salary within that population.
And now suppose that everyone gets a raise, and
Y is the new salary.
And generously, the new salary is twice the old salary plus a
bonus of $100.
What happens to the expected value of the salary, or the
average salary?
Well the new average salary, which is the expected value of
2X plus 100, is twice the old average plus 100.
So doubling everyone's salary and giving to everyone an
additional $100, what it does to the average is that it
doubles the average and adds 100 to it.
This is the linearity property of expectation in one
particular example.
It's a most intuitive property, but it's worth also
deriving it in a formal way.
And the derivation proceeds through the
expected value rule.
We're dealing here with a particular function, g, which
is a linear function.
So we're dealing with a linear function, ax plus b.
And we're dealing with a random variable, Y, which is g
applied to an original random variable, X.
So the expected value of Y can be calculated according to the
expected value rule.
It's the sum over all x's of g of x times the probability of
that particular x.
And we plug-in the specific form of the function, g, which
is ax plus b.
And then we separate the sum into two sums.
The first sum, after pulling out a constant of
a, takes this form.
And the second sum, after pulling out the constant, b,
takes this form.
Now, the first sum is a times the expected value of X. This
is just the definition of the expected value.
As, for the second sum, we realize that this quantity is
equal to 1 because it is the sum of the probabilities of
all the different values of X. And this concludes the proof
of the linearity of expected values.
Notice that for expected values, what we have is that
the expected value of Y, which is expected value of g of X,
is this same as g of the expected value of X. The
expected value of a linear function is the same linear
function applied to the expected value.
But this is an exceptional case.
This does not happen in general.
It's an exceptional function g that makes this happen.
This property is true for linear functions.
But for non-linear functions, it is generally false.


# 21. Exercise: Linearity of expectations

![](C:/Users/qp/Pictures/Screenshots/21. Exercise Linearity of expectations - 1.png)
![](C:/Users/qp/Pictures/Screenshots/21. Exercise Linearity of expectations - 2.png)
![](C:/Users/qp/Pictures/20220925_211627.jpg)


## Course  /  Unit 4: Discrete random variables  /  Lec. 6: Variance; Conditioning on an event; Multiple r.v.'s

# 1. Lecture 6 overview and slides

In this lecture, we introduce the variance of a random variable and some of its properties. We then discuss conditional PMFs, given an event. Finally, we introduce the joint PMF, as a way of describing the distribution of multiple random variables, and develop the linearity property of expectations. 

![](C:/Users/qp/Pictures/Screenshots/1. Lecture 6 overview and slides.png)
In the previous lecture we introduced random variables,
probability mass functions and expectations.
In this lecture we continue with the development of
various concepts associated with random variables.
There will be three main parts.
In the first part we define the variance of a random
variable, and calculate it for some of our
familiar random variables.
Basically the variance is a quantity that measures the
amount of spread, or the dispersion of a probability
mass functions.
In some sense, it quantifies the amount of randomness that
is present.
Together with the expected value, the variance summarizes
crisply some of the qualitative properties of the
probability mass function.
In the second part we discuss conditioning.
Every probabilistic concept or result has a conditional
counterpart.
And this is true for probability mass functions,
expectations and variances.
We define these conditional counterparts and then develop
the total expectation theorem.
This is a powerful tool that extends our familiar total
probability theorem and allows us to divide and conquer when
we calculate expectations.
We then take the opportunity to dive deeper into the
properties of geometric random variables, and use a trick
based on the total expectation theorem to
calculate their mean.
In the last part we show how to describe probabilistically
the relation between multiple random variables.
This is done through a so-called joint probability
mass function.
We take the occasion to generalize the expected value
rule, and establish a further linearity property of
expectations.
We finally illustrate the power of these tools through
the calculation of the expected value of a binomial
random variable.


Printable transcript available here.
https://courses.edx.org/assets/courseware/v1/144b8700a028ae456f9289fee7d4cb4a/asset-v1:MITx+6.431x+2T2022+type@asset+block/transcripts_L06-Overview.pdf

Lecture slides: [clean] [annotated]
https://courses.edx.org/assets/courseware/v1/55a6be8b73b1a9060f8003a1cffb0e6b/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L06-clean-slides.pdf
https://courses.edx.org/assets/courseware/v1/364fef85d8e4340712492e316cf2a5e1/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L06-annotated-slides.pdf

More information is given in Sections 2.4-2.6 of the text.
https://courses.edx.org/courses/course-v1:MITx+6.431x+2T2022/pdfbook/0/chapter/1/10
![OCW Video Lecture](C:/Users/qp/Pictures/20220925_211816.jpg)
![OCW Video Lecture](C:/Users/qp/Pictures/20220925_212020.jpg)
![OCW Video Lecture](C:/Users/qp/Pictures/20220925_212144.jpg)


# 2. Variance

![](C:/Users/qp/Pictures/Screenshots/2. Variance - 2.png)
![how we derived the furmula???](C:/Users/qp/Pictures/Screenshots/2. Variance - 3.png)
We have introduced the concept of expected value or mean,
which tells us the average value of a random variable.
We will now introduce another quantity, the variance, which
quantifies the spread of the
distribution of a random variable.
So consider a random variable with a given PMF, for example
like the PMF shown in this diagram.
And consider another random variable that happens to have
the same mean, but it's distribution
is more spread out.
So both random variables have the same mean, which we denote
by mu, and which in this picture would be somewhere
around here.
However, the second PMF, the blue PMF, has typical outcomes
that tend to have a larger distance from the mean.
By distance from the mean what we mean is that if the result
of the random variable, its numerical value, happens to
be, let's say for example, this one, then this quantity
here is X minus mu is the distance from the mean, how
far away the outcome of the random variable happens to be
from the mean of that random variable.
Of course, the distance from the mean is a random quantity.
It is a random variable.
Its value is determined once we know the outcome of the
experiment and the value of the random variable.
What can we say about the distance from the mean.
Let us calculate its average or expected value.
The expected value of the distance from the mean, which
is this quantity, using the linearity of expectations, is
equal to the expected value of X minus the constant mu.
But the expected value is by definition equal to mu.
And so we obtain zero.
So we see that the average value of the distance from the
mean is always zero.
And so it is uninformative.
What we really want is the average absolute value of the
distance from the mean, or something with this flavor.
Mathematically, it turns out that the average of the
squared distance from the mean is a better behaved
mathematical object.
And this is the quantity that we will consider.
It has a name.
It is called the variance.
And it is defined as the expected value of the squared
distance from the mean.
The first thing to note is that the variance is always
non-negative.
This is because it is the expected value of non-negative
quantities.
How exactly do we computer the variance?
The squared distance from the mean is really a function of
the random variable X. So it is a function of the form g of
X, where g is a particular function defined this way.
So we can use the expected value rule applied to this
particular function g.
And we obtain the following.
So what we have to do is to go over all numerical values of
the random variable X. For each one, calculate its
squared distance from the mean and weigh that quantity
according to the corresponding probability of that particular
numerical value.
One final comment, the variance is a bit hard to
interpret, because it is in the wrong units.
If capital X corresponds to meters, then the variance has
units of meters squared.
A more intuitive quantity is the square root of the
variance, which is called the standard deviation.
It has the same units as the random variable and captures
the width of the distribution.
Let us now take a quick look at some of the properties of
the variance.
We know that expectations have a linearity property.
Is this the case for the variance as well?
Not quite.
Instead we have this relation for the variance of a linear
function of a random variable.
Let us see why it is true.
We use the shorthand notation mu for the expected value of
X. We will proceed one step at a time and first consider what
happens to the variance if we add the
constant to a random variable.
So let Y be X plus some constant b.
And let us just define nu to be the expected value of Y,
which, using linearity of expectations, is the expected
value of X plus b.
Let us now calculate the variance.
By definition the variance of Y is the expected value of the
distance squared of Y from its mean.
Now we substitute, because in this case Y is
equal to X plus b.
Whereas the mean, nu, is mu plus b.
And now we notice that this b cancels with that b.
And we are left with the expected value of X minus mu
squared, which is just the variance of X. So this proves
this relation for the case where a is equal to 1.
The variance of X plus b is equal to the variance of X. So
we see that when we add a constant to a random variable,
the variance remains unchanged.
Intuitively, adding a constant just moves the entire PMF
right or left by some amount, but without
changing its shape.
And so the spread of this PMF remains unchanged.
Let us now see what happens if we multiply a random variable
by a constant.
Let again nu be the expected value of Y. And so in this
case by linearity this is equal to a times the expected
value of X. So it is a times mu.
We calculate the variance once more using the definition and
substituting in the place of Y what Y is in this case--
it's aX--
and subtracting the mean of Y, which is a mu, squared.
We take out a factor of a squared.
And then we use linearity of expectations to note that this
is a squared times the expected value of X minus mu
squared, which is a squared times the variance of X.
So this establishes this formula for the case where b
equals zero.
Putting together these two facts, if we multiply a random
variable by a, the variance gets multiplied by a squared.
And if we add a constant, the variance doesn't change.
And this establishes this particular fact.
As an example, the variance of, let's say, 3 minus 4X is
going to be equal minus 4 squared times the variance of
X, which is 16 times the variance of X.
Finally, let me mention an alternative way of computing
variances, which is often a bit quicker.
We have this useful formula here.
We will see later a few examples of how it is used,
but for now let me just show why it is true.
We have by definition that the variance of X is the expected
value of X minus mu squared.
Now let us rewrite what is inside the expectation by just
expanding this square, which is [X squared minus]
2 mu X plus mu squared.
Using linearity of expectations, this is broken
down into expected value of X squared minus the expected
value of two times mu X. But mu is a constant.
So we can take it outside the expected value.
And we're left with 2mu expected value
of X plus mu squared.
But remember that mu is just the same as the expected value
of X. So what we have here is twice the expected value of X,
squared, plus the expected value of X, squared, and that
leaves us just minus the expected value of X, squared.
So we will now move in the next segment into a few
examples of variance calculations.

![](C:/Users/qp/Pictures/Screenshots/2. Variance - 1.png)


# 3. Exercise: Variance calculation

![](C:/Users/qp/Pictures/Screenshots/3. Exercise Variance calculation.png)
![](C:/Users/qp/Pictures/20220925_205426.jpg)


# 4. Exercise: Variance properties

![](C:/Users/qp/Pictures/Screenshots/4. Exercise Variance properties.png)
![Think, do not just doing, but thinking](C:/Users/qp/Pictures/20220925_205101.jpg)
[][I actually proved this before, but for somehow lost that experience, and justing guessing and write ()]


# 5. Variance of the Bernoulli and the uniform

![Re-study it, really study it](C:/Users/qp/Pictures/Screenshots/5. Variance of the Bernoulli and the uniform - 1.png)
![](C:/Users/qp/Pictures/Screenshots/5. Variance of the Bernoulli and the uniform - 2.png)
In this segment, we will go through the calculation of the
variances of some familiar random variables, starting
with the simplest one that we know, which is the Bernoulli
random variable.
So let X take values 0 or 1, and it takes a value of 1 with
probability p.
We have already calculated the expected value of X, and we
know that it is equal to p.
Let us now compute its variance.
One way of proceeding is to use the definition and then
the expected value rule.
So if we now apply the expected value rule, we need
the summation over all possible values of X. There
are two values--
x equal to 1 or x equal to 0.
The contribution when X is equal to 1 is 1 minus the
expected value, which is p squared.
And the value of 1 is taken with probability p.
There is another contribution to this sum when little x is
equal to 0.
And that contribution is going to be 0 minus p, all of this
squared, times the probability of 0, which is 1 minus p.
And now we carry out some algebra.
We expand the square here, 1 minus 2p plus p squared.
And after we multiply with this factor of p, we obtain p
minus 2p squared plus p to the third power.
And then from here we have a factor of p squared times 1, p
squared times minus p.
That gives us a minus p cubed.
Then we notice that this term cancels out with that term.
p squared minus 2p squared leaves us
with p minus p squared.
And we factor this as p times 1 minus p.
An alternative calculation uses the formula that we
provided a little earlier.
Let's see how this will go.
We have the following observation.
The random variable X squared and the random variable X--
they are one and the same.
When X is 0, X squared is also 0.
When X is 1, X squared is also 1.
So as random variables, these two random variables are equal
in the case where X is a Bernoulli.
So what we have here is just the expected value of X minus
the expected value of X squared, to the second power.
And this is p minus p squared, which is the same answer as we
got before--
p times 1 minus p.
And we see that the calculations and the algebra
involved using this formula were a little simpler than
they were before.
Now the form of the variance of the Bernoulli random
variable has an interesting dependence on p.
It's instructive to plot it as a function of p.
So this is a plot of the variance of the Bernoulli as a
function of p, as p ranges between 0 and 1.
p times 1 minus p is a parabola.
And it's a parabola that is 0 when p is either 0 or 1.
And it has this particular shape, and the peak of this
parabola occurs when p is equal to 1/2, in which case
the variance is 1/4.
In some sense, the variance is a measure of the amount of
uncertainty in a random variable, a measure of the
amount of randomness.
A coin is most random if it is fair, that is, when
p is equal to 1/2.
And in this case, the variance confirms this intuition.
The variance of a coin flip is biggest if that coin is fair.
On the other hand, in the extreme cases
where p equals 0--
so the coin always results in tails, or if p equals to 1 so
that the coin always results in heads-- in those cases, we
do not have any randomness.
And the variance,
correspondingly, is equal to 0.
Let us now calculate the variance of a
uniform random variable.
Let us start with a simple case where the range of the
uniform random variable starts at 0 and extends up to some n.
So there is a total of n plus 1 possible values, each one of
them having the same probability--
1 over n plus 1.
We calculate the variance using the alternative formula.
And let us start with the first term.
What is it?
We use the expected value rule, and we argue that with
probability 1 over n plus 1, the random variable X squared
takes the value 0 squared, with the same probability,
takes the value 1 squared.
With the same probability, it takes the value 2 squared, and
so on, all of the way up to n squared.
And then there's the next term.
The expected value of the uniform is the midpoint of the
distribution by symmetry.
So it's n over 2, and we take the square of that.
Now to make progress here, we need to evaluate this sum.
Fortunately, this has been done by others.
And it turns out to be equal to 1 over 6 n, n plus 1
times 2n plus 1.
This formula can be proved by induction, but we will just
take it for granted.
Using this formula, and after a little bit of simple algebra
and after we simplify, we obtain a final answer, which
is of the form 1 over 12 n times n plus 2.
How about the variance of a more general
uniform random variable?
So suppose we have a uniform random variable whose range is
from a to b.
How is this PMF related to the one that we already studied?
First, let us assume that n is chosen so that it is
equal to b minus a.
So in that case, the difference between the last
and the first value of the random variable is the same as
the difference between the last and the first possible
value in this PMF.
So both PMFs have the same number of terms.
They have exactly the same shape.
The only difference is that the second PMF is shifted away
from 0, and it starts at a instead of starting at 0.
Now what does shifting a PMF correspond to?
It essentially amounts to taking a random variable--
let's say, with this PMF--
and adding a constant to that random variable.
So if the original random variable takes the value of 0,
the new random variable takes the value of a.
If the original takes the value of 1, this new random
variable takes the value of a plus 1, and so on.
So this shifted PMF is the PMF associated to a random
variable equal to the original random
variable plus a constant.
But we know that adding a constant does
not change the variance.
Therefore, the variance of this PMF is going to be the
same as the variance of the original PMF, as long as we
make the correspondence that n is equal to b minus a.
So doing this substitution in the formula that we derived
earlier, we obtain 1 over 12 b minus a times b
minus a plus 2.


# 6. Exercise: Variance of the uniform

![](C:/Users/qp/Pictures/Screenshots/6. Exercise Variance of the uniform - 1.png)
![](C:/Users/qp/Pictures/Screenshots/6. Exercise Variance of the uniform - 2.png)
![](C:/Users/qp/Pictures/Screenshots/6. Exercise Variance of the uniform - 3.png)
![](C:/Users/qp/Pictures/20220925_220642.jpg)
![](C:/Users/qp/Pictures/20220925_220718.jpg)


# 7. Conditional PMFs and expectations given an event

![](C:/Users/qp/Pictures/Screenshots/7. Conditional PMFs and expectations given an event - 1.png)
![](C:/Users/qp/Pictures/Screenshots/7. Conditional PMFs and expectations given an event - 2.png)
We now move to a new topic--
conditioning.
[][Every probabilistic concept or probabilistic fact has a conditional counterpart.]
As we have seen before, we can start with a probabilistic
model and some initial probabilities.
But then if we are told that the certain event has
occurred, we can revise our model and consider conditional
probabilities that take into account the available
information.
But as a consequence, the probabilities associated with
any given random variable will also have to be revised.
So a PMF will have to be changed to a conditional PMF.
Let us see what is involved.
Consider a random variable X with some given PMF, whose
values, of course, sum to 1, as must be true
for any valid PMF.
We are then told that a certain
event, A, has occurred.
In that case, the event that X is equal to--
little x--
will now have a conditional probability of this form.
We will use this notation here to denote the conditional
probability that the random variable takes the
value little x.
Notice that the subscripts are used to indicate what we're
talking about.
In this case, we are talking about the random variable X in
a model where event A is known to have occurred.
Of course, for this conditional probability to be
well defined, we will have to assume that the probability of
A is positive.
This conditional PMF is like an ordinary PMF, except that
it applies to a new or revised conditional model.
As such, its entries must also sum to 1.
Now the random variable X has a certain mean, expected
value, which is defined the usual way.
In the conditional model, the random variable X will also
have a mean.
It is called the conditional mean or the conditional
expectation.
And it is defined the same way as in the original case,
except that now the calculation involves the
conditional probabilities, or the conditional PMF.
Finally, as we discussed some time ago, a conditional
probability model is just another probability model,
except that it applies to a new situation.
So any fact about probability models--
any theorem that we derive--
must remain true in the conditional model as well.
As an example, the expected value rule will have to remain
true in the conditional model, except that, of course, in the
conditional model, we will have to use the conditional
probabilities instead of the original probabilities.
So to summarize, conditional models and conditional PMFs
are just like ordinary models and ordinary PMFs, except that
probabilities are replaced throughout by conditional
probabilities.
Let us now look at an example.
Consider a random variable, which in this case, is
uniform, takes values from 1 up to 4.
So each one of the possible values has
probability 1 over 4.
For this random variable, we can calculate the expected
value, which by symmetry is going to be the midpoint.
So it is equal to 2 and 1/2.
We can also calculate the variance.
And here we can apply the formula that we
have derived earlier--
1/2 times b minus a times b minus a plus 2.
And in this case, it's 1 over 12 times b minus a is 4 minus
1, which is 3.
And the next term is 5.
And after we simplify, this is 5 over 4.
Suppose that now somebody tells us that event A has
occurred, where event A is that the random variable X
takes values in the range 2, 3, 4.
What happens now?
What is the conditional PMF?
In the conditional model, we are told that the value of 1
did not occur, so this probability is going to be 0.
The other three values are still possible.
What are their conditional probabilities?
Well, these three values had equal probabilities in the
original model, so they should have equal probabilities in
the conditional model as well.
And in order for probabilities to sum to 1, of course, these
probabilities will have to be 1/3.
So this is the conditional PMF of our random variable, given
this new or additional information about the outcome.
The expected value of the random variable X in the
conditional universe--
that is, the conditional expectation--
is just the ordinary expectation but applied to the
conditional model.
In this conditional model, by symmetry, the expected value
will have to be 3, the midpoint of the distribution.
And we can also calculate the conditional variance.
This is a notation that we have not yet defined, but at
this point, it should be self-explanatory.
It is just the variance of X but calculated in the
conditional model using conditional probabilities.
We can calculate this variance using once more the formula
for the variance of a uniform distribution, but we can also
do it directly.
So the variance is the expected value of the squared
distance from the mean.
So with probability 1/3, our random variable will take a
value of 4, which is one unit apart from the mean, or more
explicitly, we have this term.
With probability 1/3, our random variable
takes a value of 3.
And with probability 1/3, our random variable takes the
value of 2.
This term is 0.
This is 1 times 1/3.
From here we get another 1 times 1/3.
So adding up, we obtain that the variance is 2/3.
Notice that the variance in the conditional model is
smaller than the variance that we had in the original model.
And this makes sense.
In the conditional model, there is less uncertainty than
there used to be in the original model.
And this translates into a reduction in the variance.
To conclude, there is nothing really different when we deal
with conditional PMFs, conditional expectations, and
conditional variances.
They are just like the ordinary PMFs, expectations,
and variances, except that we have to use the conditional
probabilities throughout instead of the original
probabilities.


# 8. Exercise: Conditional variance

![](C:/Users/qp/Pictures/Screenshots/8. Exercise Conditional variance - 1.png)
![](C:/Users/qp/Pictures/Screenshots/8. Exercise Conditional variance - 2.png)
![](C:/Users/qp/Pictures/20220925_222858.jpg)


# 9. Total expectation theorem

![](C:/Users/qp/Pictures/Screenshots/9. Total expectation theorem - 1.png)
![](C:/Users/qp/Pictures/Screenshots/9. Total expectation theorem - 2.png)
An important reason why conditional probabilities are
very useful is that they allow us to divide and conquer.
They allow us to split complicated probability modes
into simpler submodels that we can then
analyze one at a time.
Let me remind you of the Total Probability Theorem that has
his particular flavor.
We divide our sample space into three disjoint events--
A1, A2, and A3.
And these events form a partition of the sample space,
that is, they exhaust all possibilities.
They correspond to three alternative scenarios, one of
which is going to occur.
And then we may be interested in a certain event B. That
event B may occur under either scenario.
And the Total Probability Theorem tells us that we can
calculate the probability of event B by considering the
probability that it occurs under any given scenario and
weigh those probabilities according to the probabilities
of the different scenarios.
Now, let us bring random variables into the picture.
Let us fix a particular value--
little x--
and let the event B be the event that the random variable
takes on this particular value.
Let us now translate the Total Probability
Theorem to this situation.
First, the picture will look slightly different.
Our event B has been replaced by the particular event that
we're now considering.
Now, what is this probability?
The probability that event B occurs, having fixed the
particular choice of little x, is the value of PMF at that
particular x.
How about this probability here?
This is the probability that the random variable, capital
X, takes on the value little x--
that's what a PMF is--
but in the conditional universe.
So we're dealing with a conditional PMF.
And so on with the other terms.
So this equation here is just the usual Total Probability
Theorem but translated into PMF notation.
Now this version of the Total Probability Theorem, of
course, is true for all values of little x.
This means that we can now multiply both sides of this
equation by x and them sum over all
possibles choices of x.
We recognize that here we have the expected value of the
random variable X.
Now, we do the same thing to the right hand side.
We multiply by x.
And then we sum over all possible values of x.
This is going to be the first term.
And then we will have similar terms.
Now, what do we have here?
This expression is just the conditional expectation of the
random variable X under the scenario that
event A1 has occurred.
So what we have established is this particular formula, which
is called the Total Expectation Theorem.
It tells us that the expected value of a random variable can
be calculated by considering different scenarios.
Finding the expected value under each of the possible
scenarios and weigh them.
Weigh the scenarios according to their respective
probabilities.
The picture is like this.
Under each scenario, the random variable X has a
certain conditional expectation.
We take all these into account.
We weigh them according to their corresponding
probabilities.
And we add them up to find the expected value of X.
So we can divide and conquer.
We can replace a possibly complicated calculation of an
expected value by hopefully simpler calculations under
each one of possible scenarios.
Let me illustrate the idea by a simple example.
Let us consider this PMF, and let us try to calculate the
expected value of the associated random variable.
One way to divide and conquer is to define an event, A1,
which is that our random variable takes values in this
set, and another event, A2, which is that the random
variable takes values in that set.
Let us now apply the Total Expectations Theorem.
Let us calculate all the terms that are required.
First, we find the probabilities of
the different scenarios.
The probability of event A1 is 1/9 plus 1/9 plus
1/9 which is 1/3.
And the probability of event A2 is 2/9 plus 2/9 plus 2/9
which adds up to 2/3.
How about conditional expectations?
In a universe where event A1 one has occurred, only these
three values are possible.
They had equal probabilities, so in the conditional model,
they will also have equal probabilities.
So we will have a uniform distribution over
the set {0, 1, 2}.
By symmetry, the expected value is going
to be in the middle.
So this expected value is equal to 1.
And by a similar argument, the expected value of X under the
second scenario is going to be the midpoint of this range,
which is equal to 7.
And now we can apply the Total Probability Theorem and write
that the expected value of X is equal to the probability of
the first scenario times the expected value under the first
scenario plus the probability of the second scenario times
the expected value under the second scenario.
In this case, by breaking down the problem in these two
subcases, the calculations that were required were
somewhat simpler than if you were to proceed directly.
Of course, this is a rather simple example.
But as we go on with this course, we will apply the
Total Probability Theorem in much more interesting and
complicated situations.


# 10. Geometric PMF, memorylessness, and expectation

![](C:/Users/qp/Pictures/Screenshots/10. Geometric PMF, memorylessness, and expectation - 1.png)
![](C:/Users/qp/Pictures/Screenshots/10. Geometric PMF, memorylessness, and expectation - 2.png)
![](C:/Users/qp/Pictures/Screenshots/10. Geometric PMF, memorylessness, and expectation - 3.png)
![](C:/Users/qp/Pictures/Screenshots/10. Geometric PMF, memorylessness, and expectation - 4.png)
![](C:/Users/qp/Pictures/Screenshots/10. Geometric PMF, memorylessness, and expectation - 5.png)
![](C:/Users/qp/Pictures/Screenshots/10. Geometric PMF, memorylessness, and expectation - 6.png)
![](C:/Users/qp/Pictures/Screenshots/10. Geometric PMF, memorylessness, and expectation - 7.png)
![](C:/Users/qp/Pictures/Screenshots/10. Geometric PMF, memorylessness, and expectation - 8.png)
[][*  ==============================================================================================================  *]
Thank you @dennisp94 I have think your thought for one day, so with your approach, the equation should be write as:

    E[X] = First toss H_1 + More than 1 toss Given First toss T_1
    = p*1 + More than 1 toss Given First toss T_1
    = p*1 + (1-p)*1 + E[X-1|X>1] ---this is the expected tosses till first H given we have a T_1
    = p*1 + 1 - p*1 + (1-p)E[X-1]
    = 1 + E[X] - p*E[X]

thus E[X] = 1/p

posted less than a minute ago by john_hhu2020 
[][*  ==============================================================================================================  *]
I had the same doubt when watching the video, but I think I sort of understand now.

There are two steps here:

Step 1: E[X] = 1 + E[X] - 1 = 1 + E[X -1]

Here, it is a simple transformation using the linearity of expectation, i.e. E[aX+b] = aE[X] +b, transforming the problem of solving E[X] to solving E[X-1]. The two scenarios have not been used yet.

Step 2:

E[X -1] is broken down into two scenarios, whether the first toss is Heads or Tails. This is where the probabilities p and (1-p) come in, using total expectation theorem.

E[X -1] = p * E[X-1 when first toss is Heads] + (1-p) * E[X-1 when first toss is Tails] = p * E[X-1 | x=1] + (1-p) * E[X-1 | x>1]

I am no expert on programming, but this derivation reminds me a sense of recursive?
[][*  ==============================================================================================================  *]

We will now work with a geometric random variable and
put to use our understanding of conditional PMFs and
conditional expectations.
Remember that a geometric random variable corresponds to
the number of independent coin tosses until
the first head occurs.
And here p is a parameter that describes the coin.
It is the probability of heads at each coin toss.
We have already seen the formula for the geometric PMF
and the corresponding plot.
We will now add one very important property which is
usually called Memorylessness.
Ultimately, this property has to do with the fact that
independent coin tosses do not have any memory.
Past coin tosses do not affect future coin tosses.
So consider a coin-tossing experiment with independent
tosses and let X be the number of tosses
until the first heads.
And X is a geometric with parameter p.
Suppose that you show up a little after the experiment
has started.
And you're told that there was so far just one coin toss.
And that this coin toss resulted in tails.
Now you have to take over and carry out the remaining tosses
until heads are observed.
What should your model be about the future?
Well, you will be making independent coin tosses until
the first heads.
So the number of such tosses will be a random variable,
which is geometric with parameter p.
So this duration--
as far as you are concerned--
is geometric with parameter p.
Therefore, the number of remaining coin tosses starting
from here--
given that the first toss was tails--
has the same geometric distribution as the original
random variable X.
This is the Memorylessness property.
Now, since X is the total number of coin tosses and
since your coin tosses were all of them except for the
first one, the random variable that you are concerned
with is X minus 1.
And so the geometric distribution that you are
seeing here is the conditional distribution of X minus 1
given that the first toss resulted in tails, which is
the same as the event that X is strictly larger than 1.
So the statement that we have been making is the following
in more mathematical language--
that conditioned on X being larger than 1, the random
variable X minus 1, which is the remaining number of coin
tosses, has a geometric distribution with parameter p.
Let us now give a more precise,
mathematical argument.
But first, for a special case.
Let's us look at the conditional probabilities for
the random variable X minus 1.
And calculate, for example, the conditional probability
that X minus 1 is equal to 3, given that X is larger than 1.
Which is the same as saying that the first
toss resulted in tails.
Now, the first toss resulted in tails.
This is the probability that you will need three more
tosses until you observe heads.
Needing three more tosses until you observe heads is the
event that you had tails in the second toss, tails in the
third toss, and heads in the fourth toss.
And all that is conditioned on the first toss
having resulted in tails.
However, the different coin tosses are independent.
So the conditional probabilities, given the event
that the first toss was tails should be the same as the
unconditional probabilities.
The first toss does not change our beliefs about the
probabilities associated with the remaining tosses.
Now, this unconditional
probability is easy to calculate.
It is 1 minus p squared--
because we have two tails in a row--
times p.
Now, we observe that this quantity here is the
probability that a geometric random variable takes the
value of three.
Here what have we calculated?
We have calculated the PMF of the random variable X minus 1
in a conditional universe where X is larger than 1.
And we evaluated it for a value of 3.
The probability that our random variable X minus 1
takes the value of 3.
So what we have shown is that this conditional PMF is the
same as the unconditional PMF.
Now, there is nothing special about the number 3.
You can generalize this argument and establish that
the conditional probability of X minus 1 given that X is
strictly larger than one, for any particular k, is the same
as the corresponding probability for the random
variable X, which is given by the geometric PMF.
Finally, there is nothing special about the value of 1
that we're using here.
In fact, we can generalize and argue as follows--
suppose that I tell you that X is strictly larger than n.
That is, the first n tosses resulted in tails.
Once more, these past tosses were wasted but have no effect
on the future.
So the conditional PMF of the remaining number of tosses
should be, again, the same.
Therefore, the statement we're making is that this geometric
PMF will also be the PMF of X minus n, given that X is
strictly larger than n, and this will be true no matter
what argument we plug-in into the PMF.
We will now exploit the Memorylessness property of the
geometric PMF and use it together with the total
expectation theorem to calculate the mean or
expectation of the geometric PMF.
If we wanted to calculate the expected value of the
geometric using the definition of the expectation, we would
have to calculate this infinite sum here, which is
quite difficult.
Instead, we're going to use a certain trick.
The trick is the following--
to break down the expected value calculation into two
different scenarios.
Under one scenario we obtain heads in the first toss.
And in that case the random variable X--
the number of tosses until the first heads--
is equal to 1.
And this scenario occurs with probability p.
And we have another scenario with probability 1 minus p
where we obtain tails in the first toss.
And in that case, our random variable is
strictly larger than 1.
Now, the expected value of X consists of two pieces.
We have a first toss no matter what.
And then we have the number of remaining tosses,
which is X minus 1.
So this is true by linearity of expectations.
Now, the expected value of X minus 1 consists of two pieces
using the total expectation theorem.
The probability of the first scenario times the expected
value of X minus 1 given that X is equal to 1,
plus 1 minus p--
the probability of the second scenario--
times the expected value of X minus 1 given that X
is bigger than 1.
Now, this term here is 0.
Why?
If I tell you that X is equal to 1, then you're certain
that's X minus 1 is equal to 0.
So this term gives a 0 contribution.
How about the next term?
We have a 1 minus p here times this expected value.
Now this random variable, conditioned on this event, has
the same distribution as an ordinary, unconditioned
geometric random variable.
So this expectation here must be the same as the expectation
of an ordinary, unconditioned, geometric random variable.
And this gives us an equality.
Both sides involve the expected value of X. But we
can solve this equation for the expected value.
And we obtain the end result that the expected
value is 1 over p.
By the way, this answer makes intuitive sense.
If p is small, this means that the odds of
seeing heads is small.
Then in that case, we need to wait longer and longer until
we see heads for the first time.
Setting aside the specific form of the answer that we
found, what we have just done actually illustrates that
fairly difficult calculations can become very simple if one
breaks down a model or a problem in a clever way.
This is going to be a recurring theme throughout
this class.


# 11. Exercise: Total expectation calculation

![](C:/Users/qp/Pictures/Screenshots/11. Exercise Total expectation calculation - 1.png)
![](C:/Users/qp/Pictures/Screenshots/11. Exercise Total expectation calculation - 2.png)


# 12. Exercise: Memorylessness of the geometric

![](C:/Users/qp/Pictures/Screenshots/12. Exercise Memorylessness of the geometric - 1.png)
![](C:/Users/qp/Pictures/Screenshots/12. Exercise Memorylessness of the geometric - 2.png)
![](C:/Users/qp/Pictures/Screenshots/12. Exercise Memorylessness of the geometric - 3.png)


# 13. Joint PMFs and the expected value rule

![](C:/Users/qp/Pictures/Screenshots/13. Joint PMFs and the expected value rule - 1.png)
![](C:/Users/qp/Pictures/Screenshots/13. Joint PMFs and the expected value rule - 2.png)
![](C:/Users/qp/Pictures/Screenshots/13. Joint PMFs and the expected value rule - 3.png)
By this point, we have discussed pretty much
everything that is to be said about individual discrete
random variables.
Now let us move to the case where we're dealing with
multiple discrete random variables simultaneously, and
talk about their distribution.
As we will see, their distribution is characterized
by a so-called joint PMF.
So suppose that we have a probabilistic model, and on
that model we have defined two random variables--
X and Y. And that we have available
their individual PMFs.
These PMFs tell us about one random variable at the time.
This tells us about X, this tells us about Y. But they do
not give us any information about how the two random
variables are related to each other.
For example, if you wish to answer this question, whether
the numerical values that the two random variables happen to
be equal, and what is the probability of that event, you
will not be able to answer this question if you only know
the two individual PMFs.
In order to be able to answer a question of this type, we
will need information that tells us what values of X tend
to occur together with what values of Y. And this
information is captured in the so-called joint PMF.
So the joint PMF is nothing but a piece of notation for an
object that's familiar.
This is the probability that when we carry out the
experiment we happen to see random variable X take on a
value, little x.
And simultaneously see that random variable Y takes on a
value, little y.
This quantity we indicate it with this notation.
The letter little p stands for a PMF.
The subscripts tell us which random variables
we're talking about.
And finally, this is a function of two arguments.
Depending on what pair (x,y) we're interested in, we're
going to get a different numerical value for this
probability.
As an example of a joint PMF in which the two random
variables take values in a finite set, we might be given
a table of this form.
Using this table, we can answer questions such as the
following--
what is the probability that the random variables X and Y
simultaneously take the values, let us say, 1 and 3?
Then we look up in this table, and we identify that it's this
probability, X takes the value of 1, and Y takes
the value of 3.
And according to this table, the answer would be 2/20.
Now, something to notice about joint PMFs.
When you add over all possible pairs, x and y, this exhausts
all the possibilities.
And therefore, these probabilities should add to 1.
In terms of this table, all of the entries that we have here
should add to 1.
Now, once we have in our hands the joint PMF, we can use it
to find the individual PMFs of the random variables X and Y.
And these individual PMFs are called the marginal PMFs.
How do we find them?
Well, the joint PMF tells us everything there is to be
known about the two random variables, so it should
contain enough information for us to
answer any kind of question.
So for example, if we wish to find the probability that the
random variable X takes the value of 4, we look at all
possible outcomes in which X is equal to 4, and add the
probabilities of these outcomes.
So in this case, it would be 1/20 plus 2/20.
So what we're doing is that if we're interested in a specific
value of X, the probability that X takes on a specific
value, we consider all possible pairs associated with
that fixed x.
That is, we're considering one column of the PMF, and we're
adding the corresponding probabilities.
So to find this entry here, let's say px(3), what we need
is to add these terms on that column.
Similarly, we can find the PMF of the random variable Y.
So for example, the probability that the random
variable Y takes on a value of, let's say, 2, can be found
as follows.
You look at the probabilities of all pairs associated with
this specific y, and you add over the x's.
So we fix Y to have a value of 2, and we add over all pairs
in this row.
So in this example, it would be 1/20 plus 3/20, plus 1/20.
Finally, notice that we are able to answer the question
that got us motivated in the first place.
To find the probability that the two random variables take
equal values, we look at all the outcomes for which the two
random variables indeed take the same numerical values.
And we see that it is this event in this diagram, and the
probability of that event is going to be 2/20.
So in general, once we have available the joint PMF of two
random variables, we will be able to answer any questions
regarding probabilities of events that have to do with
these two random variables.
How about more than two random variables?
It's just a matter of notation.
For example, we can define the joint PMF of three random
variables, and you can use the same idea for the joint PMF,
let's say, of five, or 10, or n random variables.
Let's just look at the notation for three.
There is a well-defined probability that when we carry
out the experiment X, Y and Z as random variables take on
certain specific values.
So we look at the probability of that particular triple, and
we indicate that probability with this notation.
Once more, the sub-scripts tell us which random variables
we're talking about.
And the PMF, of course, is going to be a function of this
triple, little x, little y, little z, because each triple
in general should have a different probability.
Of course, probabilities must always add to 1.
So when we consider all triples and we add their
corresponding probabilities, we should get 1.
And finally, once we have the joint PMF, we can again
recover the marginal PMF.
For example, to find the probability that the random
variable takes on a specific value, little x, we consider
all possible triples in which the random variable indeed
takes that value, little x.
And then we sum over all the possible y's and z's that
could go together with this particular x.
In the same spirit, to find the probability that these two
random variables take on two specific values, we consider
all the possible z's that could go together with this
(x,y) pair.
So this way we're ranging over all outcomes in which X and Y
take on these specific values.
But Z is free to take any value, and so we consider all
those possible values of Z and sum the corresponding
probabilities.
Finally, we can talk about functions of
multiple random variables.
Suppose that we have two random variables, x and y, and
that we define a function of them.
So this function is, of course, a random variable.
And we can find the PMF of this random variable if we
know the joint PMF of X and Y.
So the PMF, which is the probability that the random
variable takes on a specific numerical value, that's the
probability that the function of X and Y takes on a specific
numerical value.
And we can find this probability by adding the
probabilities of all (x,y) pairs.
Which (x,y) pairs?
Those (x,y) pairs for which the value of Z would be equal
to this particular number, little z, that we care about.
So we collect essentially all possible outcomes that make
this event to happen, and we add the probabilities of all
those outcomes.
Finally, similarly to the case where we have a single random
variable and function of it, we now can talk about expected
values of functions of two random variables, and there is
an expected value rule that parallels the expected value
rule that we had developed for the case of a
function of this form.
The form that the expected value rule takes is similar,
and it's quite natural.
The interpretation is as follows.
With this probability, a specific
(x,y) pair will occur.
And when that occurs, the value of our random variable
is a certain number.
And the combination of these two terms gives us a
contribution to the expected value.
Now, we consider all possible (x,y) pairs that may occur,
and we sum over all these (x,y) pairs.


# 14. Exercise: Joint PMF calculation

![](C:/Users/qp/Pictures/Screenshots/14. Exercise Joint PMF calculation - 1.png)
![](C:/Users/qp/Pictures/Screenshots/14. Exercise Joint PMF calculation - 2.png)


# 15. Exercise: Expected value rule

![](C:/Users/qp/Pictures/Screenshots/15. Exercise Expected value rule - 1.png)
![](C:/Users/qp/Pictures/Screenshots/15. Exercise Expected value rule - 2.png)
![](C:/Users/qp/Pictures/Screenshots/15. Exercise Expected value rule - 3.png)


# 16. Linearity of expectations and the mean of the binomial

![](C:/Users/qp/Pictures/Screenshots/16. Linearity of expectations and the mean of the binomial - 1.png)
![](C:/Users/qp/Pictures/Screenshots/16. Linearity of expectations and the mean of the binomial - 2.png)
![](C:/Users/qp/Pictures/Screenshots/16. Linearity of expectations and the mean of the binomial - 3.png)
Let us now revisit the subject of expectations and develop an
important linearity property for the case where we're
dealing with multiple random variables.
We already have a linearity property.
If we have a linear function of a single random variable,
then expectations behave in a linear fashion.
But now, if we have multiple random variables, we have this
additional property.
The expected value of the sum of two random variables is
equal to the sum of their expectations.
Let us go through the derivation of this very
important fact because it is a nice exercise in applying the
expected value rule and also manipulating
PMFs and joint PMFs.
We're dealing with the expected value of a function
of two random variables.
Which function?
If we write it this way, we are dealing with the function
g, which is just the sum of its two entries.
So now we can continue with the application of the
expected value rule.
And we obtain the sum over all possible x, y pairs.
Here, we need to write to g of x,y.
But in our case, the function we're dealing with
is just x plus y.
And then we weigh, according to the entries
of the joint PMF.
So this is just an application of the expected value rule to
this particular function.
Now let us take this sum and break it into two pieces--
one involving only the x-term, and another piece involving
only the y-term.
Now, if we look at this double summation, look
at the inner sum.
It's a sum over y's.
While we're adding over y's, the value of x remains fixed.
So x is a constant, as far as the sum is concerned.
So x can be pulled outside this summation.
Let us just continue with this term, the first one, and see
that a simplification happens.
This quantity here is the sum of the probabilities of the
different y's that can go together with a particular x.
So it is just equal to the probability or
that particular x.
It's the marginal PMF.
If we carry out a similar step for the second term, we will
obtain the sum over y's.
It's just a symmetrical argument.
And at this point we recognize that what we have in front of
us is just the expected value of X, this is the first term,
plus the expected value of Y. So this completes the
derivation of this important linearity property.
Of course, we proved the linearity property for the
case of the sum of two random variables.
But you can proceed in a similar way, or maybe use
induction, and one can easily establish, by following the
same kind of argument, that we have a linearity property when
we add any finite number of random variables.
The expected value of a sum is the sum of
the expected values.
Just for a little bit of practice, if, for example,
we're dealing with this expression, the expected value
of that expression would be the expected value of 2X plus
the expected value of 3Y minus the expected value of Z. And
then, using the linearity property for linear functions
of a single random variable, we can pull the constants out
of the expectations.
And this would be twice the expected value of X plus 3
times the expected value of Y minus the expected value of Z.
What we will do next is to use the linearity property of
expectations to solve a problem that would otherwise
be quite difficult.
We will use the linearity property to find the mean of a
binomial random variable.
Let X be a binomial random variable with
parameters n and p.
And we can interpret X as the number of successes in n
independent trials where each one of the trials has a
probability p of resulting in a success.
Well, we know the PMF of a binomial.
And we can use the definition of expectation to obtain this
expression.
This is just the PMF of the binomial.
And therefore, what we have here is the usual definition
of the expected value.
Now, if you look at this sum, it appears quite formidable.
And it would be quite hard to evaluate it.
Instead, we're going to use a very useful trick.
We will employ what we have called indicator variables.
So let's define a random variable Xi, which is a one if
the ith trial is a success, and zero otherwise.
Now if we want to count successes, what we want to
count is how many of the Xi's are equal to 1.
So if we add the Xi's, this will have a contribution of 1
from each one of the successes.
So when you add them up, you obtain the
total number of successes.
So we have expressed a random variable as a sum of much
simpler random variables.
So at this point, we can now use linearity of expectations
to write that the expected value of X will be the
expected value of X1 plus all the way to the
expected value of Xn.
Now what is the expected value of X1?
It is a Bernoulli random variable that takes the value
1 with probability p and takes the value of 0 with
probability 1 minus p.
The expected value of this random variable is p.
And similarly, for each one of these terms in the summation.
And so the final end result is equal to n times p.
This answer, of course, makes also intuitive sense.
If we have to p equal to 1/2, and we toss a coin 100 times,
the expected number, or the average number, of heads we
expect to see will be 1/2 half times 100, which is 50.
The higher p is, the more successes we expect to see.
And of course, if we double n, we expect to see
twice as many successes.
So this is an illustration of the power of breaking up
problems into simpler pieces that are easier to analyze.
And the linearity of expectations is one more tool
that we have in our hands for breaking up perhaps
complicated random variables into simpler ones and then
analyzing them separately.


# 17. Exercise: Linearity of expectations drill

![](C:/Users/qp/Pictures/Screenshots/17. Exercise Linearity of expectations drill - 1.png)
![](C:/Users/qp/Pictures/Screenshots/17. Exercise Linearity of expectations drill - 2.png)


# 18. Exercise: Using linearity of expectations

![re-think how this 1/p comes from](C:/Users/qp/Pictures/Screenshots/18. Exercise Using linearity of expectations - 1.png)
![](C:/Users/qp/Pictures/Screenshots/18. Exercise Using linearity of expectations - 2.png)


## Course  /  Unit 4: Discrete random variables  /  Lec. 7: Conditioning on a random variable; Independence of r.v.'s

# 1. Lecture 7 overview and slides

In this lecture, we introduce conditional PMFs, for describing the conditional distribution of a random variable given another. We also introduce the concept of independence of random variables, and present some of the consequences of independence. 

In this last lecture of this unit, we continue with some of our earlier themes, and then introduce one new notion, the notion of independence of random variables.  We will start by elaborating a bit more on the subject of conditional probability mass functions.  We have already discussed the case where we condition a random variable on an event.  Here we will talk about conditioning a random variable on another random variable, and we will develop yet another version of the total probability and total expectation theorems.  There are no new concepts here, just new notation.  I should say, however, that [][notation is important, because it guides you on how to think about problems in the most economical way].  
The one new concept that we will introduce is the notion
of independence of random variables.
It is actually not an entirely new concept.
It is defined more or less the same way as independence of
events, and has a similar intuitive interpretation.
Two random variables are independent if information
about the value of one of them does not change your model or
beliefs about the other.
On the mathematical side, we will see that independence
leads to some additional nice properties
of means and variances.
We will conclude this lecture and this unit on discrete
random variables by considering a rather difficult
problem, the hat problem.
We will see that by being systematic and using some of
the tricks that we have learned, we can calculate the
mean and variance of a rather complicated random variable.


Printable transcript available here.
https://courses.edx.org/assets/courseware/v1/4a4d85ab3543deaa17963ea9aed77ccb/asset-v1:MITx+6.431x+2T2022+type@asset+block/transcripts_L07-Overview.pdf

Lecture slides: [clean] [annotated]
https://courses.edx.org/assets/courseware/v1/dfe25fabda68cc9c6fdb1ac352a9660c/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L07-clean-slides.pdf
https://courses.edx.org/assets/courseware/v1/ed3e666b7201ed057bc53c61b930a15d/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L07-annotated-slides.pdf

More information is given in Sections 2.6-2.7 of the text.
https://courses.edx.org/courses/course-v1:MITx+6.431x+2T2022/pdfbook/0/chapter/1/13


# 2. Conditional PMFs

![](C:/Users/qp/Pictures/Screenshots/2. Conditional PMFs - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional PMFs - 2.png)
![](C:/Users/qp/Pictures/Screenshots/2. Conditional PMFs - 3.png)
We have already introduced the concept of the conditional PMF
of a random variable, X, given an event A. We will now
consider the case where we condition on the value of
another random variable Y. That is, we let A be the event
that some other random variable, Y, takes on a
specific value, little y.
In this case, we're talking about a conditional
probability of the form shown here.
The conditional probability--
that X takes on a specific value, given that the random
variable Y takes on another specific value.
And we use this notation to indicate those conditional
probabilities.
As usual, the subscripts indicate the situation that
we're dealing with.
That is, we're dealing with the distribution of the random
variable X and we're conditioning on values of the
other random variable, Y.
Using the definition now of conditional probabilities this
can be written as the probability that both events
happen divided by the probability of the
conditioning event.
We can turn this expression into PMF notation.
And this leads us to this definition
of conditional PMFs.
The conditional PMF is defined to be the ratio
of the joint PMF--
this is the probability that we have here--
by the corresponding marginal PMF.
And this is the probability that we have here.
Now, remember that conditional probabilities are only defined
when the conditioning event has a positive probability,
when this denominator is positive.
Similarly, the conditional PMF will only be defined for those
little y that have positive probability of occurring.
Now, the conditional PMF is a function of two arguments,
little x and little y.
But the best way of thinking about the conditional PMF is
that we fix the value, little y, and then view this
expression here as a function of x.
As a function of x, it gives us the probabilities of the
different x's that may occur in the conditional universe.
And these probabilities must, of course, sum to 1.
Again, we're keeping y fixed.
We live in a conditional universe where y takes on a
specific value.
And here we have the probabilities of the different
x's in that universe.
And these sum to 1.
Note that if we change the value of little y, we will, of
course, get a different conditional PMF for the random
variable X. So what we're really dealing with in this
instance is that we have a family of conditional PMFs,
one conditional PMF for every possible value of little y.
And for every possible value of little y, we have a
legitimate PMF who's values add to 1.
Let's look at an example.
Consider the joint PMF given in this table.
Let us condition on the event that Y is equal to 2, which
corresponds to this row in the diagram.
We need to know the value of the marginal at this point, so
we start by calculating the probability of Y at value 2.
And this is found by adding the entries in
this row of the table.
And we find that this is 5 over 20.
Then we can start calculating entries of
the conditional PMF.
So for example, the probability that X takes on
the value of 1 given that Y takes the value of 2, it is
going to be this entry, which is 0, divided by 5/20, which
gives us 0.
We can find the next entry, the probability of X taking
the value of 2, given that Y takes the value of 2 will be
this entry, 1/20 divided by 5/20.
So it's going to be 1/5.
And we can continue with the other two entries.
And we can actually even plot the result once we're done.
And what we have is that at 1, we have a probability of 0.
At 2, we have a probability of 1/5.
At 3, we have a probability of 3/20 divided
5/20, which is 3/5.
And at 4, we have, again, a probability of 1/5.
So what we have plotted here is the conditional PMF.
It's a PMF in the variable x, where x ranges over the
possible values, but where we have fixed the value of y to
be equal to 2.
Now, we could have found this conditional PMF even faster
without doing any divisions by following the intuitive
argument that we have used before.
We live in this conditional universe.
We have conditioned on Y being equal to 2.
The conditional probabilities will have the same proportions
as the original probabilities, except that they needed to be
scaled so that they add to 1.
So they should be in the proportions of 0, 1, 3, 1.
And for these to add to 1, we need to put everywhere a
denominator of 5.
So the proportions are indeed 0, 1, 3, and 1.
Pictorially, the conditional PMF has the same form as the
corresponding slice of the joint PMF, except, again, that
the entries of that slice are renormalized so that the
entries add to 1.
And finally, an observation--
we can take the definition of the conditional PMF and turn
it around by moving the denominator to the other side
and obtain a formula, which is a version of the
multiplication rule.
The probability that X takes a value little x and Y takes a
value little y is the product or the probability that Y
takes this particular value times the conditional
probability that X takes on the particular value little x,
given that Y takes on the particular value little y.
We also have a symmetrical relationship if we interchange
the roles of X and Y. As we discussed earlier in this
course, the multiplication rule can be used to specify
probability models.
One way of modeling two random variables is by
specifying the joint PMF.
But we now have an alternative, indirect, way
using the multiplication rule.
We can first specify the distribution of Y and then
specify the conditional PMF of X for any given
value of little y.
And this completely determines the joint PMF, and so we have
a full probability model.
We can also provide similar definitions of conditional
PMFs for the case where we're dealing with more than two
random variables.
In this case, notation is pretty self-explanatory.
By looking at this expression here, you can probably guess
that this stands for the probability that random
variable X takes on a specific value, conditional on the
random variables Y and Z taking on some
other specific values.
Using the definition of conditional probabilities,
this is the probability that all events happen divided by
the probability of the conditioning event, which, in
our case, is the event that Y takes on a specific value and
simultaneously, Z takes another specific value.
In PMF notation, this is the ratio of the joint PMF of the
three random variables together, divided by the joint
PMF of the two random variables Y and Z. As another
example, we could have an expression like this, which,
again, stands for the probability that these two
random variables take on specific values, conditional
on this random variable taking on another value.
Finally, we can have versions of the multiplication rule for
the case where we're dealing with more
than two random variables.
Recall the usual multiplication rule.
For three events happening simultaneously, let's apply
this multiplication rule for the case where the event, A,
stands for the event that the random variable X takes on a
specific value.
Let B be the event that Y takes on a specific value, and
C be the event that the random variable Z takes
on a specific value.
Then we can take this relation, the multiplication
rule, and translate it into PMF notation.
The probability that all three events happen is equal to the
product of the probability that the first event happens.
Then we have the conditional probability that the second
event happens given that the first happened, times the
conditional probability that the third event happens--
this one-- given that the first
two events have happened.


# 3. Exercise: Conditional PMFs

![](C:/Users/qp/Pictures/Screenshots/3. Exercise Conditional PMFs - 1.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Conditional PMFs - 2.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Conditional PMFs - 3.png)
![](C:/Users/qp/Pictures/Screenshots/3. Exercise Conditional PMFs - 4.png)


# 4. Conditional expectation and the total expectation theorem

![](C:/Users/qp/Pictures/Screenshots/4. Conditional expectation and the total expectation theorem - 1.png)
![](C:/Users/qp/Pictures/Screenshots/4. Conditional expectation and the total expectation theorem - 2.png)
We will now talk about conditional expectations of
one random variable given another.
As we will see, there will be nothing new here, except for
older results but given in new notation.
Any PMF has an associated expectation.
And so conditional PMFs also have associated expectations,
which we call conditional expectations.
We have already seen them for the case where we condition on
an event, A.
The case where we condition on random variables
is exactly the same.
We let the event, A, be the event that Y takes on a
specific value.
And then we calculate the expectation using the relevant
conditional probabilities, those that are given by the
conditional PMF.
So the conditional expectation of X given that Y takes on a
certain value is defined as the usual expectation, except
that we use the conditional probabilities that apply given
that Y takes on a specific value little y.
Recall now the expected value rule for ordinary
expectations.
And also the Expected Value Rule for conditional
expectations given an event, something that we
have already seen.
Now, in PMF notation, the expected value rule takes a
similar form.
The event, A is replaced by the specific event that Y
takes on a specific value.
And in that case, the conditional PMF given the
event A is just the conditional PMF given that
random variable Y takes on a specific value, little y.
For the case where we condition on events, we also
developed a version of the total probability theorem and
the total expectation theorem.
We can do the same when we condition on random variables.
So suppose that the sample space has been partitioned
into n, disjoint scenarios.
The total probability theorem tells us that the probability
of the event that random variable X takes on a value
little x, can be found by taking the probabilities of
this event under each one of the possible scenarios.
And then weighing those probabilities according to the
probabilities of the different scenarios.
Now, suppose that we are dealing with a random variable
that takes values in a set consisting of n elements.
And let us consider scenarios Ai, the i-th scenario is the
event that the random variable Y takes on the
i-th possible value.
We can apply the total probability
theorem to this situation.
We can find the probability that the random variable X
takes on a certain value, little x, by considering the
probability of this event happening under each possible
scenario, where a scenario is that Y took on a specific
value, and then weigh those probabilities according to the
probabilities of the different scenarios.
The story with the total
expectation theorem is similar.
We know that an expectation can be found by taking the
conditional expectations under each one of the scenarios and
weighing them according to the probabilities of
the different scenarios.
Again, let the event that Y takes on a specific value be a
different scenario.
And with this correspondence we obtain the following
version of the total expectation theorem.
We have a sum of different terms.
And each term in the sum is the probability of a given
scenario times the expected value of X under this
particular scenario.
At this point, I have to add a comment of a more
mathematical flavor.
We have been talking about a partition of the sample space
into finitely many scenarios.
But if Y takes on values in a discrete but infinite set, for
example, if Y can take on any integer value, the argument
that we have given is not quite complete.
Fortunately, the total probability theorem and the
total expectation theorem, they both remain true, even
for the case where Y ranges over an infinite set as long
as the random variable X has a well-defined expectation.
For the total probability theorem, the proof for the
general case can be carried out without a lot of
difficulty, just using the countable additivity axiom.
However, for the total expectation theorem, it takes
some harder mathematical work.
And this is beyond our scope.
But we will just take this fact for granted, that the
total expectation theorem carries over to the case where
we're adding over an infinite sequence of possible values of
Y.
In the rest of the course we will often use the total
expectation theorem, including in cases where Y ranges over
an infinite discrete set.
In fact, we will see that this theorem is an extremely useful
tool that can be used to divide and
conquer complicated models.


# 5. Exercise: The expected value rule with conditioning

![](C:/Users/qp/Pictures/Screenshots/5. Exercise The expected value rule with conditioning - 1.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise The expected value rule with conditioning - 2.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise The expected value rule with conditioning - 3.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise The expected value rule with conditioning - 4.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise The expected value rule with conditioning - 5.png)
![](C:/Users/qp/Pictures/Screenshots/5. Exercise The expected value rule with conditioning - 6.png)


# 6. Independence of random variables

![](C:/Users/qp/Pictures/Screenshots/6. Independence of random variables.png)
We now come to a very important concept, the concept
of independence of random variables.
We are already familiar with the notion of independence of
two events.
We have the mathematical definition, and the
interpretation is that conditional probabilities are
the same as unconditional ones.
Intuitively, when you are told that B occurred, this does not
change your beliefs about A, and so the conditional
probability of A is the same as the unconditional
probability.
We have a similar definition of independence of a random
variable and an event A. The mathematical definition is
that event A and the event that X takes on a specific
value, that these two events are independent in the
ordinary sense.
So the probability of both of these events happening is the
product of their individual probabilities.
But we require this to be true for all values of little x.
Intuitively, if I tell you that A occurred, this is not
going to change the distribution of the random
variable x.
This is one interpretation of what independence means in
this context.
And this has to be true for all values of little x, that
is, when [the]
event occurs, the probabilities of any
particular little x [are]
going to be the same as the original unconditional
probabilities.
We also have a symmetrical interpretation.
If I tell you the value of X, then the conditional
probability of event A is not going to change.
It's going to be the same as the unconditional probability.
And again, this is going to be the case for all values of X.
So, no matter what they tell you about X, your beliefs
about A are not going to change.
We can now move and define the notion of independence of two
random variables.
The mathematical definition is that the event that X takes on
a value little x and the event that Y takes on a value little
y, these two events are independent, and this is true
for all possible values of little x and little y.
In PMF notation, this relation here can be
written in this form.
And basically, the joint PMF factors out as a product of
the marginal PMFs of the two random variables.
Again, this relation has to be true for all possible little x
and little y.
What does independence mean?
When I tell you the value of y, and no matter what value I
tell you, your beliefs about X will not change.
So that the conditional PMF of X given Y is going to be the
same as the unconditional PMF of X. And this has to be true
for any values of the arguments of these PMFs.
There is also a symmetric interpretation, which is that
the conditional PMF of Y given X is going to be the same as
the unconditional PMF of Y. We have the symmetric
interpretation because, as we can see from this definition,
X and Y have symmetric roles.
Finally, we can define the notion of independence of
multiple random variables by a similar relation.
Here, the definition is for the case of three random
variables, but you can imagine how the definition for any
finite number of random variables will go.
Namely, the joint PMF of all the random variables can be
expressed as the product of the
corresponding marginal PMFs.
What is the intuitive interpretation of
independence here?
It means that information about some of the random
variables will not change your beliefs, the probabilities,
about the remaining random variables.
Any conditional probabilities and any conditional PMFs will
be the same as the unconditional ones.
In the real world, independence models situations
where each of the random variables is generated in a
decoupled manner, in a separate probabilistic
experiment.
And these probabilistic experiments do not interact
with each other and have no common sources of uncertainty.


# 7. Exercise: Independence

![Use your brain, not trying to use other's, they are silly](C:/Users/qp/Pictures/Screenshots/7. Exercise Independence - 1.png)
![](C:/Users/qp/Pictures/Screenshots/7. Exercise Independence - 2.png)
![](C:/Users/qp/Pictures/Screenshots/7. Exercise Independence - 3.png)


# 8. Exercise: A criterion for independence

![](C:/Users/qp/Pictures/Screenshots/8. Exercise A criterion for independence - 1.png)
![](C:/Users/qp/Pictures/Screenshots/8. Exercise A criterion for independence - 2.png)


# 9. Example

![](C:/Users/qp/Pictures/Screenshots/9. Example.png)
Let us now consider a simple example.
Let random variables X and Y be described by a joint PMF
which is the one shown in this table.
Question--
are X and Y independent?
We can try to answer this question by using the
definition of independence.
But it is actually more instructive to proceed in a
somewhat more intuitive way.
We look at this table, and we observe that the value of one
is possible for X. In particular, the probability
that X takes the value of one, this is the marginal
probability, this can be found by adding the probabilities of
all of the outcomes in this column, which, in this case,
is 3 over 20.
Suppose now that somebody tells you the value of Y. For
example, I tell you that Y happens to be equal to one, in
which case you are transported into this universe.
In this universe, the conditional probability that X
takes a value of one, given that Y takes a value of one,
what is it?
In this universe, there's zero probability
associated to this outcome.
So this probability is zero, which is
different than 3 over 20.
And since these two numbers are different, this means that
information from Y changes our beliefs about what's going to
happen to X. And so, we do not have independence.
So again, intuitively, in the beginning, we thought that X
equal to one was possible.
But information given by Y, namely that Y is equal to one,
tells us that, actually, X equals to one is impossible.
Information about Y changed our beliefs about X, so X and
Y are dependent.
Now, when we first introduced the notion of independence
some time ago, we also introduced the notion of
conditional independence.
And we said that conditional independence is the same as
ordinary independence, except that it would be applied to a
conditional universe.
Something similar can be done for the case of random
variables as well.
So suppose, for example, that someone tells us that the
outcome of the experiment was such that it belongs
to this blue set.
This is the set where X is less than or equal to 2, and Y
is larger than or equal to three.
So we're given this information, and this is now
our new conditional model.
The question is, within this new conditional model are
random variables X and Y independent?
Let's just right down the conditional model, where I'm
only showing the four possible outcomes that are allowed in
the conditional model.
All the others, of course, will have zero probability in
the conditional model.
So in the conditional model, probabilities will keep the
same proportions as in the unconditional model--
and the proportions are 1, 2, 2, 4--
but then they need to be scaled, or normalized, so that
they add to 1.
And to make them add to 1, we need to divide them all by 9.
In this conditional model, this is the joint PMF of the
two random variables X and Y. Let us find the marginal PMFs.
To find the marginal PMF of X, we add the
entries in this column.
And we get here 1/3, and here 2/3.
And to find the marginal PMF of y, we add the
entries in this [row]
to find 2/3.
And we adds the entries in that [row]
to find 1/3.
So this is the marginal PMF of x.
That's the marginal PMF of Y. And now we notice that this
entry of the joint PMF is 1/3 times 1/3, the
product of the marginals.
This entry is the product of 1/3 times 2/3, the product of
the marginals, and so on for the remaining entries.
So each entry of the joint PMF is equal to the product of the
corresponding entries of the marginal PMFs.
And this is the definition of independence.
So in this conditional blue universe, we do have
independence.
And the way that this was established was to check that
the joint PMF factors as a product of marginal PMFs.


# 10. Independence and expectations

![](C:/Users/qp/Pictures/Screenshots/10. Independence and expectations.png)
When we have independence, does anything interesting
happen to expectations?
We know that, in general, the expected value of a function
of random variables is not the same as applying the function
to the expected values.
And we also know that there are some exceptions where we
do get equality.
This is the case where we are dealing with linear functions
of one or more random variables.
Note that this last property is always true and does not
require any independence assumptions.
When we have independence, there is one additional
property that turns out to be true.
The expected value of the product of two independent
random variables is the product of
their expected values.
Let us verify this relation.
We are dealing here with the expected value of a function
of random variables, where the function is defined to be the
product function.
So to calculate this expected value, you can use the
expected value rule.
And we are going to get the sum over all x, the sum over
all y, of g of xy, but in this case, g of xy is x times y.
And then we weigh all those values according to the
probabilities as given by the joint PMF.
Now, using independence, this sum can be changed into the
following form--
the joint PMF is the product of the marginal PMFs.
And now when we look at the inner sum over all values of
y, we can take outside the summation those terms that do
not depend on y, and so this term and that term.
And this is going to yield a summation over x of x times
the marginal PMF of X, and then the summation over all y
of y times the marginal PMF of Y. But now we recognize that
here we have just the expected value of Y. And then we will
be left with another expression, which is the
expected value of X. And this completes the argument.
Now, consider a function of X and another function of Y. X
and Y are independent.
Intuitively, the value of X does not give you any new
information about Y, so the value of g of X does not to
give you any new information about h of Y. So on the basis
of this intuitive argument, the functions g of X and h of
Y are also independent of each other.
Therefore, we can apply the fact that we have already
proved, but with g of X in the place of X and h of Y in the
place of Y. And this gives us this more general fact that
the expected value of the product of two functions of
independent random variables is equal to the product of the
expectations of these functions.
We could also prove this property directly without
relying on the intuitive argument.
We could just follow the same steps as in this derivation.
Wherever there is an X, we would write g of X, and
wherever there is a Y, we would write h of Y. And the
same algebra would go through, and we would end up with the
expected value of g of X times the expected value of h of Y.


# 11. Exercise: Independence and expectations

![](C:/Users/qp/Pictures/Screenshots/11. Exercise Independence and expectations - 1.png)
![](C:/Users/qp/Pictures/Screenshots/11. Exercise Independence and expectations - 2.png)


# 12. Independence, variances, and the binomial variance

![Think independent or not independent](C:/Users/qp/Pictures/Screenshots/12. Independence, variances, and the binomial variance - 1.png)
![](C:/Users/qp/Pictures/Screenshots/12. Independence, variances, and the binomial variance - 2.png)
Let us now revisit the variance and see what happens
in the case of independence.
Variances have some general properties that we have
already seen.
However, since we often add random variables, we would
like to be able to say something about the variance
of the sum of two random variables.
Unfortunately, the situation is not so simple, and in
general, the variance of the sum is not the same as the sum
of the variances.
We will see an example shortly.
On the other hand, when X and Y are independent, the
variance of the sum is equal to the sum of the variances,
and this is a very useful fact.
Let us go through the derivation of this property.
But to keep things simple, let us assume just for the sake of
the derivation, that the two random variables have 0 mean.
So in that case, the variance over the sum is just the
expected value of the square of the sum.
And we can expand the quadratic and write this as
the expectation of X squared plus 2 X Y plus Y squared.
Then we use linearity of expectations to write this as
the expected value of X squared plus twice the
expected value of X times Y and then plus the expected
value of Y squared.
Now, the first term is just the variance of X because we
have assumed that we have 0 mean.
The last term is similarly the variance of Y. How about the
middle term?
Because of independence, the expected value of the product
is the same as the product of the expected values, and the
expected values are 0 in our case.
So this term, because of independence, is going to be
equal to 0.
In particular, what we have is that the expected value of XY
equals the expected value of X times the expected value of Y,
equal to 0.
And so we have verified that indeed the variance of the sum
is equal to the sum of the variances.
Let us now look at some examples.
Suppose that X is the same random variable as Y. Clearly,
this is a case where independence fails to hold.
If I tell you the value of X, then you know the value of Y.
So in this case, the variance of the sum is the same as the
variance of twice X. Since X is the same as Y, X plus Y is
2 times X. And then using this property for the variance,
what happens when we multiply by a constant?
This is going to be 4 times the variance of X.
In another example, suppose that X is the negative of Y.
In that case, X plus Y is identically equal to 0.
So we're dealing with a random variable that
takes a constant value.
In particular, it is always equal to its mean, and so the
difference from the mean is always equal to 0, and so the
variance will also evaluate to 0.
So we see that the variance of the sum can take quite
different values depending on the sort of interrelation that
we have between the two random variables.
So these two examples indicate that knowing the variance of
each one of the random variables is not enough to say
much about the variance of the sum.
The answer will generally depend on how the two random
variables are related to each other and what kind of
dependencies they have.
As a last example, suppose now that X and Y are independent.
X is independent from Y, and therefore X is also
independent from minus 3Y.
Therefore, this variance is equal to the sum of the
variances of X and of minus 3Y.
And using the facts that we already know, this is going to
be equal to the variance of X plus 9 times the variance of
Y.
As an illustration of the usefulness of the property of
the variance that we have just established, we will now use
it to calculate the variance of a binomial random variable.
Remember that a binomial with parameters n and p corresponds
to the number of successes in n independent trials.
We use indicator variables.
This is the same trick that we used to calculate the expected
value of the binomial.
So the random variable X sub i is equal to 1 if the i-th
trial is a success and is a 0 otherwise.
And as we did before, we note that X, the total number of
successes, is the sum of those indicator variables.
Each success makes one of those variables equal to 1, so
by adding those indicator variables, we're just counting
the number of successes.
The key point to note is that the assumption of independence
that we're making is essentially the assumption
that these random variables Xi are independent of each other.
So we're dealing with a situation where we have a sum
of independent random variables, and according to
what we have shown, the variance of X is going to be
the sum of the variances of the Xi's.
Now, the Xi's all have the same distribution so all these
variances will be the same.
It suffices to consider one of them.
Now, X1 is a Bernoulli random variable with parameter p.
We know what its variance is--
it is p times 1 minus p.
And therefore, this is the formula for the variance of a
binomial random variable.


# 13. Exercise: Independence and variances

![](C:/Users/qp/Pictures/Screenshots/13. Exercise Independence and variances - 1.png)
![make sure your brain is running, then solving the question](C:/Users/qp/Pictures/Screenshots/13. Exercise Independence and variances - 2.png)
![](C:/Users/qp/Pictures/Screenshots/13. Exercise Independence and variances - 3.png)


# 14. The hat problem

![](C:/Users/qp/Pictures/Screenshots/14. The hat problem - 1.png)
![](C:/Users/qp/Pictures/Screenshots/14. The hat problem - 2.png)
We will now study a problem which is quite difficult to
approach in a direct brute force manner but becomes
tractable once we break it down into simpler pieces using
several of the tricks that we have learned so far.
And this problem will also be a good opportunity for
reviewing some of the tricks and techniques
that we have developed.
The problem is the following.
There are n people.
And let's say for the purpose of illustration that we have 3
people, persons 1, 2, and 3.
And each person has a hat.
They throw their hats inside a box.
And then each person picks a hat at random out of that box.
So here are the three parts.
And one possible outcome of this experiment is that person
1 ends up with hat number 2, person 2 ends up with hat
number 1, person 3 ends up with hat number 3.
We could indicate the hats that each person got by noting
here the numbers associated with each
person, the hat numbers.
And notice that this sequence of numbers, which is a
description of the outcome of the experiment, is just a
permutation of the numbers 1, 2, 3 of the hats.
So we permute the hat numbers so that we can place them next
to the person that got each one of the hats.
In particular, we have n factorial possible outcomes.
This is the number of possible permutations.
What does it mean to pick hats at random?
One interpretation is that every
permutation is equally likely.
And since we have n factorial permutations, each permutation
would have a probability of 1 over n factorial.
But there's another way of describing our model, which is
the following.
Person 1 gets a hat at random out of the three available.
Then person 2 gets a hat at random out of
the remaining hats.
Then person 3 gets the remaining hat.
Each time that there is a choice, each one of the
available hats is equally likely to be picked
as any other hat.
Let us calculate the probability, let's say, that
this particular permutation gets materialized.
The probability that person 1 gets hat number 2 is 1/3.
Then we're left with two hats.
Person 2 has 2 hats to choose from.
The probability that it picks this particular hat
is going to be 1/2.
And finally, person 3 has only 1 hat available, so it will be
picked with probability 1.
So the probability of this particular permutation is one
over 3 factorial.
But you can repeat this argument and consider any
other permutation, and you will always be getting the
same answer.
Any particular permutation has the same probability, one over
3 factorial.
The same argument goes through for the case of general n, n
people and n hats.
And we will find that any permutation will have the same
probability, 1/n factorial.
Therefore, the process of picking one hat at a time is
probabilistically identical to a model in which we simply
state that all permutations are equally likely.
Now that we have described our model and our process and the
associated probabilities, let us consider the question we
want to answer.
Let X be the number of people who get their own hat back.
For example, for the outcome that we have drawn here, the
only person who gets their own hat back is person 3.
And so in this case X happens to take the value of 1.
What we want to do is to calculate the expected value
of the random variable X. The problem is difficult because
if you try to calculate the PMF of the random variable X
and then use the definition of the expectation to calculate
this sum, you will run into big difficulties.
Calculating this quantity, the PMF of X, is difficult.
And it is difficult because there is no simple expression
that describes it.
So we need to do something more intelligent, find some
other way of approaching the problem.
The trick that we will use is to employ indicator variables.
Let Xi be equal to one 1 if person i selects their own hat
and 0 otherwise.
So then, each one of the Xi's is 1 whenever a person has
selected their own hat.
And by adding all the 1's that we may get, we obtain the
total number of people who have selected their own hats.
This makes things easier, because now to calculate the
expected value of X it's sufficient to calculate the
expected value of each one of those terms and add the
expected values, which we're allowed to
do because of linearity.
So let's look at the typical term here.
What is the expected value of Xi?
If you consider the first description or our model, all
permutations are equally likely, this description is
symmetric with respect to all of the persons.
So the expected value of Xi should be the same as the
expected value of X1.
Now, to calculate the expected value of X1, we will consider
the sequential description of the process in which 1 is the
first person to pick a hat.
Now, since X1 is a Bernoulli random variable that takes
values 0 or 1, the expected value of X1 is just the
probability that X1 is equal to 1.
And if person 1 is the first one to choose a hat, that
person has probability 1/n of obtaining the correct hat.
So each one of these random variables has an expected
value of 1/n.
The expected value of X by linearity is going to be the
sum of the expected values.
There is n of them.
Each expected value is 1/n.
And so the final answer is 1.
This is the expected value of the random variable X.
Let us now move and try to calculate a more difficult
quantity, namely, the variance of X. How shall we proceed?
Things would be easiest if the random variables Xi were
independent.
Because in that case, the variance of X would be the sum
of the variances of the Xi's.
But are the Xi's independent?
Let us consider a special case.
Suppose that we only have two persons and that I tell you
that the first person got their own hat back.
In that case, the second person must have also gotten
their own hat back.
If, on the other hand, person 1 did not to get their own hat
back, then person 2 will not get their own hat back either.
Because in this scenario, person 1 gets hat 2, and that
means that person 2 gets hat 1.
So we see that knowing the value of the random variable
X1 tells us a lot about the value of the
random variable X2.
And that means that the random variables
X1 and X2 are dependent.
More generally, if I were to tell you that the first n
minus 1 people got their own hats back, then the last
remaining person will have his or her own hat
available to be picked.
That's going to be the only available hat.
And then person n we also get their hat back.
So we see that the information about some of the Xi's gives
us information about the remaining Xn.
And again, this means that the random
variables are dependent.
Since we do not have independence, we cannot find
the variance by just adding the variances of the different
random variables.
But we need to do a lot more work in that direction.
In general, whenever we need to calculate variances, it is
usually simpler to carry out the calculation using this
alternative form for the variance.
So let us start towards a calculation of the expected
value of X squared.
Now the random variable X squared, by simple algebra, is
this expression times itself.
And by expanding the product we get all
sorts of cross terms.
Some of these cross terms will be of the type X1 times Xi or
X2 times X2.
These will be terms of this form, and there is n of them.
And then we get cross terms, such as X1 times X2, X1 times
X3, X2 times X1, and so on.
How many terms do we have here?
Well, if we have n terms multiplying n other terms we
have a total of n squared terms.
n are already here, so the remaining terms, which are the
cross terms, will be n squared minus n.
Or, in a simpler form, it's n times n minus 1.
So now how are we going to calculate the expected value
of X squared?
Well, we will use linearity of expectations.
So we need to calculate the expected value of Xi squared,
and we also need to calculate the expected value of Xi Xj
when i is different from j.
Let us start with Xi squared.
First, if we use the symmetric description of our model, all
permutations are equally likely, then all persons play
the same role.
There's symmetry in the problem.
So Xi squared has the same distribution as X1 squared.
Then, X1 is a 0-1 random variable, a
Bernoulli random variable.
So X1 squared will always take the same numerical value as
the random variable X1.
This is a very special case which happens only because a
random variable takes values in {0, 1}.
And 0 squared is the same as 0.
1 squared is the same as 1.
This expected value is something that we have already
calculated, and it is 1/n.
Let us now move to the calculation of the expectation
of a typical term inside the sum.
So let i be different than j, and look at the
expected value of Xi Xj.
Once more, because of the symmetry of the probabilistic
model, it doesn't matter which i and j we are considering.
So we might as well consider the product of X1 with X2.
Now, X1 and X2 take values 0 and 1.
And the product of the two also takes values 0 and 1.
So this is a Bernoulli random variable, and so the
expectation of that random variable is just the
probability that this random variable is equal to 1.
But for the product to be equal to 1, the only way that
this can happen is if both of these random variables happen
to be equal to 1.
Let us now turn to the sequential
description of the model.
The probability that the first person gets their own hat back
and the second person gets their own hat back is the
probability that the first one gets their own hat back, and
then multiplied by the conditional probability that
the second person gets their own hat back, given that the
first person got their own hat back.
What are these probabilities?
The probability that a person gets their
own hat back is 1/n.
Given that person 1 got their own hat back, person 2 is
faced with a situation where there are n
minus 1 available hats.
And one of those is that person's hat.
So the probability that person 2 will also pick his or her
own hat is 1 over n minus 1.
Now we are in a position to calculate the expected value
of X squared.
The expected value of X squared consists of the sum of
n expected values, each one of which is equal to 1/n plus so
many expected values, because we have so many terms, each
one of which, by this calculation, is 1/n times 1
over n minus 1.
And we see that we get cancellations here.
And we obtain 1 plus 1, which is equal to 2.
On the other hand we have this term that we need to subtract.
We found previously that the expected value of
X is equal to 1.
So we need to subtract 1.
And the final answer to our problem is that the variance
of X is also equal to 1.
So what we saw in this problem is that we can deal with quite
complicated models, but by breaking them down into more
manageable pieces, first break down the random variable X as
a sum of different random variables, then taking the
square of this and break it down into a number of
different terms, and then by considering one term at a
time, we can often end up with the solutions or the answers
to problems that would have been
otherwise quite difficult.


# 15. Exercise: The hat problem

![](C:/Users/qp/Pictures/Screenshots/15. Exercise The hat problem - 1.png)
![](C:/Users/qp/Pictures/Screenshots/15. Exercise The hat problem - 2.png)


## Course  /  Unit 4: Discrete random variables  /  Solved problems

# 1. PMF of a function of a random variable

![](C:/Users/qp/Pictures/Screenshots/1. PMF of a function of a random variable - 1.png)
![](C:/Users/qp/Pictures/Screenshots/1. PMF of a function of a random variable - 2.png)
![](C:/Users/qp/Pictures/Screenshots/1. PMF of a function of a random variable - 3.png)
![](C:/Users/qp/Pictures/Screenshots/1. PMF of a function of a random variable - 4.png)
Hey, guys.
Welcome back.
Today we're going to be working on a problem that asks
you to find the PMF of a function of a random variable.
So let's just jump right in.
The problem statement gives you the PMF for a random
variable called X. So we're told that there's this random
variable X that takes on values minus 3, minus 2, minus
1, 1, 2, and 3.
And for each of those values, the probability mass lying
over that value is given by this formula--
X squared over a.
Now, I didn't write it here to save room.
But we're also told that a is a real number that is
greater than zero.
And we're told that the probability of X taking on any
value outside of the set is 0.
Now, we're asked to do two things in the problem.
First is to find the value of the parameter a.
And that's sort of a natural question to ask, because if
you think about it, the PMF isn't fully specified.
And in fact, if you plug in the wrong number for a, you
actually won't get a valid PMF.
So we'll explore that idea in the first part.
And then in the second part, you're given a new random
variable called Z. And Z happens to be a function of X.
In fact, it's equal to X squared.
And then you're asked to compute that PMF.
So this problem is a good practice problem.
I think at this point, you guys are sort of newly
acquainted with the idea of a PMF, or a
probability mass function.
So this problem will hopefully help you get more familiar
with that concept and how to manipulate PMFs.
And by the way, just to make sure we're on the same page,
what does a PMF really tell you?
So p sub X--
this is a capital X because the convention in this class
is to use capital letters for random variables.
So pX of k--
this is defined to be the probability that your random
variable X takes on a value of k.
So essentially this says--
and this is just some numbers.
So in our particular case, this would be equal to k
squared over a.
And how you can interpret this is this pX guy is
sort of like a machine.
He takes in some value that your random variable could
take on and then he spits out the amount of probability mass
lying over that value.
OK.
So now that we've done that quick recap, let's get back to
the first part of the problem.
So we have this formula for pX of x and we need
to solve for a.
So in order to do that, we're going to use one of our axioms
of probability to set up an equation.
And we can solve precisely for a.
So namely, we know that every PMF must sum to 1.
And so essentially, if you sum this guy over all possible
values of x, you should get a 1.
And that equation will let us solve for a.
So let's do that.
Summation over X of pX of x--
so here essentially, you're only summing
over these six values.
So this is equal to pX of minus 3 plus pX of minus 2
plus pX of minus 1, et cetera--
oops-- pX of 2 plus pX of 3.
OK.
And again, like the interpretation, as we said,
this number here should be interpreted as the amount of
probability mass lying over minus 3.
And to help you visualize this, actually, before we go
further with the computation, let's actually plot this PMF.
So the amount of probability mass lying over minus 3-- the
way we figure that out is we take minus 3 and we plug it
into this formula up here.
So you get 9 over a.
Now, you can do this for minus 2.
You got 4 over a looking at the formula.
For 1, you get 1 over a.
And of course, this graph-- you know it's the mirror image
over 0 because of the symmetry.
So hopefully this little visualization helps you
understand what I'm talking about.
And now we can just read these values off of the
plot we just made.
So we know pX of minus 3 is equal to pX of 3.
So we can go ahead and just take 2 times 9 over a.
Similarly, we get 2 times 4 over a.
And then plus 2 times 1 over a.
So now it's just a question of algebra.
So simplifying this, you're going to get 18 plus 8 plus 2
divided by a.
And this gives you 28 over a.
And as I argued before, you know that if you sum a PMF
over all possible values, you must get 1.
So this is equal to 1, which of course implies that a is
equal to 28.
So what we've shown here is that you actually don't have a
choice for what value a can take on.
It must take on 28.
And in fact, if you plug in any other value than 28 in
here, you actually are not going to have a valid PMF
because it's not going to sum to 1.
OK.
So I'm going to write my answer here and then erase to
give myself more room for part (b).
So part (b) is a little bit more involved than part (a)
because we have a function of a random variable.
OK.
So we are told Z is a new random variable.
And he's equal to X squared.
And as you probably know already, this is sort of a
valid thing to do.
X is a random variable.
And any function of a random variable is
itself a random variable.
So therefore, Z is a random variable.
And because Z is just another random variable, it makes
sense to talk about its PMF, or its
probability mass function.
So we are asked to figure out what pZ of z is.
And again, this sort of notation can take a while to
get used to.
But always just come back to the basic definition.
And let's switch this to a k so we don't get confused.
So this is Z is a capital Z because
it's a random variable.
And this k is a little k because it's just denoting one
possible value that your random variable
Z could take on.
So coming to the definition of PMF, pZ of k is just the
probability that our new random variable Z takes on a
value of k.
OK.
And actually, you could sort of progress from here and do
this algebraically.
But I would prefer to look at this problem pictorially.
So the first step that I suggest we take is we figure
out, what are the possible values that Z could take on?
So for instance, for our PMF of X, we were told that X can
take on values in this set.
So I think a good place to start is figuring out, what is
the set of values that Z can take on?
So let's think about this a little bit.
I'm going to go up here.
So Z takes on values in what set?
Well, when X is equal to minus 3, Z takes on
a value of 9, right?
Because minus 3 squared is 9.
So that's one possible value of Z. Because since Z is a
function of X, the possible values of X influence what
values Z can take on.
Now when X takes on a value of minus 2, Z takes
on a value of 4.
Similarly, when X takes on a value of minus 1, Z takes on a
value of 1.
Now, when X takes on a value of 1, Z takes on a value of 1.
And we actually already have 1 in our set.
Similarly, 2 squared is 4 and 3 squared is 9.
But we already have those guys in our set.
So we're actually done.
Z can only take on one of these three values--
9, 4, or 1.
So now let's come down and sort of plot a new picture.
So before on this axis, I was plotting the value that the
random variable X takes on.
Now I'm going to plot the values that the random
variable Z can take on.
So Z can take on--
let's see.
1, 2, 3 4, 5, 6, 7, 8, 9--
let me just number these really quickly.
OK.
So Z can take on a value of 1.
It can take on a value of 4.
And it can take on the value of 9.
And as we argued before, these three points are the only ones
that should have any probability mass over [them].
OK.
So now let's think to ourselves, when or how
frequently does Z take on each of these three values?
Well, Z takes on a value of 1 precisely when X takes on a
value of minus 1 or 1.
Take a moment to sort of convince yourself of that.
There's no other scenario under which Z can take on a
value of 1.
Now X takes on a value of minus 1 with
probability 1 over a.
And X takes on a value of 1 with probability 1 over a.
And so Z therefore would take on a value of 1 with
probability 2 over a.
Now, I have sort of been appealing to your intuition.
But if you wanted to sort of say in math what I've been
saying, I was claiming that the event that Z takes on a
value of 1 is equal to the event that X takes on a value
of minus 1 or X takes on a value of 1.
And now clearly, these two events are disjoint.
X can't both simultaneously be 1 and minus 1.
So from this, if you were to take the probability, you
would get the probability that Z is equal to 1 is just the
sum of these two probabilities.
And they both have value 1 over a.
So we get 2 over a.
So let's come over here and do 2 over a.
So now let's take a look at the next value.
So what's the probability that Z takes on a value of 4?
Well, it's very similar logic.
Z can only take on the value of 4 if X takes on a value of
minus 2 or if X takes on a value of 2.
And X takes on a value of minus 2 or 2 with probability
4 over a plus 4 over a, which is 8 over a.
And I'll actually just sort of do this logic
out one more time.
So we have the event that Z is equal to 4 is the event that X
is equal to minus 2 or X is equal to 2.
These two events are disjoint.
So to compute the probability, it's just the sum of these two
events, the probabilities of those events.
And we know that the probability that X takes on a
value of minus 2 is 4 over a.
And similarly, the probability that X takes on a value of
plus 2 is 4 over a.
So you get 8 over a.
And now I won't repeat out the math for our last candidate.
But sort of same argument--
Z takes on a value of 9 when X takes on a value
of minus 3 or 3.
So you add 9 over a and 9 over a.
And you get 18 over a.
Mind you, my picture is not to scale.
So we're essentially done, because we already computed
the value of the parameter a in part (a).
So now I just want to write this out in a clean form,
because we've mainly been drawing the PMF pictorially.
But you should be comfortable with both representations.
So to actually represent the PMF of Z algebraically, you
would say that pZ of k is equal to 2 over a.
So a was 28, right?
So 2 over 28 for k equal to 1.
8 over 28 for k is equal to 4.
And 18 over 28 for k is equal to 9.
And let's just do a quick--
oh.
And sorry, this is not complete.
You should always specify 0 otherwise, right?
Because if I didn't say that, then there would be a chance
that this didn't sum to 1.
And you wouldn't actually have a valid PMF.
So let's just do a quick sanity check.
Before we had argued that Z can only take on three values.
So here, we have Z can only take on three values--
1, 4, or 9.
So that's consistent.
And then another sanity check you should make is that these
in fact do sum to 1.
And you see pretty quickly that they do.
2 plus 8 plus 18 is 28.
So it looks like we did it right.
And then of course, you can simplify these.
But I'm not going to bother to do that now.
So the hope is that after having done this problem you
feel more comfortable with what it means to be a PMF and
how to manipulate them.
So if you have a random variable, which is sort of a
simple function of another random variable, you should be
comfortable computing the PMF of the new random variable.
So that's it for today.
See you next time.


# 2. Sampling people on buses

![](C:/Users/qp/Pictures/Screenshots/2. Sampling people on buses - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. Sampling people on buses - 2.png)
![](C:/Users/qp/Pictures/Screenshots/2. Sampling people on buses - 3.png)
![](C:/Users/qp/Pictures/Screenshots/2. Sampling people on buses - 4.png)
![](C:/Users/qp/Pictures/Screenshots/2. Sampling people on buses - 5.png)
Hi.
In this problem, we're dealing with buses of students going
to a job convention.
And in the problem, we'll be exercising our
knowledge of PMFs--
probability mass functions.
So we'll get a couple of opportunities to write out
some PMFs, and also calculating expectations or
expected values.
And also, importantly, we'll actually be exercising our
intuition to help us not just rely on numbers, but also to
just have a sense of what the answers to some probability
questions should be.
So the problem specifically deals with
four buses of students.
So we have buses, and each one carries a
different number of students.
So the first one carries 40 students, the second one 33,
the third one has 25, and the last one has 50 students for a
total of 148 students.
And because these students are smart, and they like
probability, they are
interested in a couple questions.
So suppose that one of these 148 students is chosen
randomly, and so we'll assume that what that means is that
each one has the same probability of being chosen.
So they're chosen uniformly at random.
And let's assign a couple of random variables.
So we'll say X corresponds to the number of students in the
bus of the selected student.
OK, so one of these 148 students is selected uniformly
at random, and we'll let X correspond to the number of
students in that student's bus.
So if a student from this bus was chosen, then X would be
25, for example.
OK, and then let's come up with another random variable,
Y, which is almost the same thing.
Except instead of now selecting a random student,
we'll select a random bus.
Or equivalently, we'll select a random bus driver.
So each bus has one driver, and instead of selecting one
of the 148 students at random, we'll select one of the four
bus drivers also uniformly at random.
And we'll say the number of students in that driver's bus
will be Y. So for example, if this bus driver was selected,
then Y would be 33.
OK, so the main problem that we're trying to answer is what
do you expect the expectation--
which one of these random variables do you expect to
have the higher expectation or the higher expected value?
So, would you expect X to be higher on
average, or Y to be higher?
And what would be the intuition for this?
So obviously, we can actually write out the PMFs for X and
Y. These are just discrete random variables.
And we can actually calculate out what the expectation is.
But it's also useful to exercise your intuition, and
your sense of what the answer should be.
So it might not be immediately clear which one would be
higher, or you might even say that maybe it doesn't make a
difference.
They're actually the same.
But a useful way to approach some of these questions is to
try to take things to the extreme and see
how that plays out.
So let's take the simpler example and take it to the
extreme and say, suppose a set of four buses carrying these
number of students.
We have only two buses--
one bus that has only 1 student, and we have another
bus that has 1,000 students.
OK.
And suppose we ask the same question.
Well, now if you look at it, there's a total of 1,001
students now.
If you select one of the students at random, it's
overwhelmingly more likely that that student will be one
of the 1,000 students on this huge bus.
It's very unlikely that you'll get lucky and select the one
student who is by himself.
And so because of that, you have a very high chance of
selecting the bus with the high number of students.
And so you would expect X, the number of
students, to be high--
to be almost 1,000 in the expectation.
But on the other hand, if you selected the driver at random,
then you have a 50/50 chance of selecting
this one or that one.
And so you would expect the expectation there to be
roughly 500 or so.
And so you can see that if you take this to the extreme, then
it becomes more clear what the answer would be.
And the argument is that the expectation of X should be
higher than the expectation of Y, and the reason here is that
because you select the student at random, you're more likely
to select a student who is in a large bus, because that bus
just has more students to select from.
And because of that, you're more biased in favor of
selecting large buses, and therefore, that makes X higher
in expectation.
OK, so that's the intuition behind this problem.
And now, as I actually go through some of the more
mechanics and write out what the PMFs and the calculation
for the expectation would be to verify that our intuition
is actually correct.
OK, so we have two random variables that are defined.
Now let's just write out what their PMFs are.
So the PMF--
we write it as little p of capital X and little x.
So the random variable-- what we do is we say the
probability that it will take on a certain value, right?
So what is the probability that X will be 40?
Well, X will be 40 if a student from
this bus was selected.
And what's the probability that a student from this bus
is selected?
That probability is 40/148, because there's 148 students,
40 of whom are sitting in this bus.
And similarly, X will be 33 with probability 33/148, and X
will be 25 with probability 25/148.
And X will be 50 with probability 50/148.
And it will be 0 otherwise.
OK, so there is our PMF for X, and we can do the same thing
for Y. The PMF of Y--
again, we say what is the probability that Y will take
on certain values?
Well, Y can take on the same values as X can, because we're
still dealing with the number of students in each bus.
So Y can be 40.
But the probability that Y is 40, because we're selecting
the driver at random now, is 1/4, right?
Because there's a 1/4 chance that we'll pick this driver.
And the probability that Y will be 33 will also be 1/4,
and the same thing for 25 and 50.
And it's 0 otherwise.
OK, so those are the PMFs for our two random variables, X
and Y. And we can also draw out what the PMFs look like.
So if this is 25, 30, 35, 40, 45, and 50, then the
probability that it's 25 is 25/148.
So we can draw a mass right there.
For 33, it's a little higher, because it's
33/148 instead of 25.
For 40, it's even higher still.
It's 40/148.
And for 50, it is still higher, because it is 50/148.
And so you can see that the PMF is more heavily favored
towards the larger values.
We can do the same thing for Y, and we'll notice that
there's a difference in how these distributions look.
So if we do the same thing, the difference now is that all
four of these masses will have the same height.
Each one will have height 1/4, whereas this one for X, it's
more heavily biased in favor of the larger ones.
And so because of that, we can actually now calculate what
the expectations are and figure out whether or not our
intuition was correct.
OK, so now let's actually calculate out what these
expectations are.
So as you recall, the expectation is calculated out
as a weighted sum.
So for each possible value of X, you take that value and you
weight it by the probability of the random variable taking
on that value.
So in this case, it would be 40 times 40/148, 33 times
33/148, and so on.
48 plus 25 times 25/148 plus 50 times 50/148.
And if you do out this calculation, what you'll get
is that it is around 39.
Roughly 39.
And now we can do the same thing for Y. But for Y, it's
different, because now instead of weighting it by these
probabilities, we'll weight it by these probabilities.
So each one has the same weight of 1/4.
So now we get 40 times 1/4 plus 33 times 1/4.
That's 25 times 1/4 plus 50 times 1/4.
And if you do out this arithmetic, what you get is
that this expectation is 37.
And so what we get is that, in fact, after we do out the
calculations, the expected value of X is indeed greater
than the expected value of Y, which confirms our intuition.
OK, so this problem, to summarize-- we've reviewed how
to write out a PMF and also how to calculate expectations.
But also, we've got a chance to figure out some intuition
behind some of these problems.
And so sometimes it's helpful to take simpler things and
take things to the extreme and figure out intuitively whether
or not the answer makes sense.
It's useful just to verify whether the numerical answer
that you get in the end is correct.
Does this actually make sense?
It's a useful guide for when you're solving these problems.
OK, so we'll see you next time.


# 3. From tail probabilities to expectations

![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 1.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 2.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 3.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 4.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 5.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 6.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 7.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 8.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 9.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 10.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 11.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 12.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 13.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 14.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 15.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 16.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 17.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 18.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 19.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 20.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 21.png)
![](C:/Users/qp/Pictures/Screenshots/3. From tail probabilities to expectations - 22.png)
Hi.
In this problem, we're going to explore an alternative way
of calculating expected values for a special
class of random variables.
And the class of random variables we're going to
consider this problem are those that take on only
non-negative integer values.
So the random variable X can only take on values 0, 1, 2,
3, 4, and so on.
And so first, let's recall what the standard definition
of expected value is.
The expected value of a discrete random variable X is
just a weighted sum.
So you take all possible values that the random
variable can take on.
So we'll sum k from negative infinity to infinity.
And we'll weight the value that it can take on by the
probability that the random variable actually takes on
that particular value.
So this is the PMF evaluated at some value k.
And that's the weight applied to that value.
And we can simplify this formula a little bit if we
consider only random variables that can take on only
non-negative integer values.
So we can actually wipe out everything from negative
infinity to negative 1 because the variable can't take on
those values.
And we can also discard the case where k equals 0, because
when k equals 0, this expression is zero.
So it doesn't add anything.
OK.
What we want to do now is to show in this problem that
there's an alternative way of calculating the expected value
when you have a non-negative integer
valued random variable.
And that is given by this alternative summation.
So you're also summing k from 1 to infinity.
But now you're actually summing some probabilities.
And we call these things tail probabilities because it's the
probability that the random variable is
at least some value.
So imagine that you have a PMF and you take a k value.
And this will take everything to the right of that k value.
So you can imagine that this is some sort of tail of the
distribution.
And now you sum this from k equals 1 to infinity.
And as it turns out, as we'll show very soon, these two
actually give you the same value for this
class of random variables.
All right.
So let's actually get started and do that.
So part (a) is to show that this is true.
OK.
So let's write out again what it is that
we're trying to show.
So we're trying to show that this somehow will give you the
same expression as this summation.
All right.
So well, the only thing we can really do here is try to
substitute in what this tail probability is.
Well, what is this tail probability?
It's just adding up probabilities in the tail.
And so we can actually write this probability as a sum of i
from k to infinity of the PMF of X evaluated at i.
So this probability is equivalent to this.
All we're doing is just writing a different way.
This is just the same as adding up
probabilities from k onwards.
And this is the same as adding up
probabilities from k onwards.
All right.
So now we have something that is starting to
look more like this.
We at least have a PMF of X involved.
But it still has a double summation.
And it's missing some ks.
So let's see how we can resolve that.
Well, as the problem gives you a hint to try exchanging the
order these summations.
So right now, we're summing k first, and then i.
All right.
But we have to be a little bit careful.
Because we can't just swap these and call it a day
because we have to be a little more careful.
So let's actually go here now to this plot and see how we
actually interchange these summations.
So what we're really doing is as we currently have it set
up, first the outer summation,
increments k from 1 to infinity.
So I have k here.
So it goes like this.
And then the inner summation goes from k to infinity.
So let's just see exactly what values of k and i we're
summing here.
So for k equals 1, we sum from i equals 1 to infinity.
So it's really summing up everything like this way,
going up vertically.
And then once that's done, you come out to the outer
summation again.
You increment k to 2.
And then you sum i from 2 to infinity.
So k goes to 2.
And then you sum i from 2 to infinity.
And then k goes to 3.
And you sum from i equals 3 to infinity, and so on.
So what you see is that really, we're just covering
all these points in this kind of upper triangular
portion of this grid.
Now, in order to change summations, what we're
effectively doing is instead of summing things vertically
like this, we're going to sum things horizontally.
So i now becomes the outer summation.
And k becomes the inner summation.
So how do we do this?
Well, for each i--
let's say for i equals 1, we'll take--
well there's only one value.
So k is 1.
For i equals 2, we'll take k from 1 to 2 to this diagonal.
For i equals 3, we'll take k from 1 to 3 to this
diagonal, and so on.
So now instead of summing vertically, we'll sum
horizontally across.
And what that amounts to, using symbols, is we sum i
from 1 to infinity now.
And the inside-- we sum k from 1 to i.
And the thing that we're summing still stays the same.
So again, let's just try to verify that this is correct.
So for i equals 1, you sum from k equals 1 to 1.
So you pick up this one point.
For i equals 2, you sum from k equals 1 to 2.
So you pick up these two.
For i equals 3, you sum from k equals 1 to 3.
And you pick up these, and then so on.
So it looks like we have it correct.
Now we've interchanged the summation.
And it turns out that this helps.
Because we can now actually simplify what this inner
summation is.
So if you notice, the thing that you're summing over is k.
But what you're summing doesn't involve k at all.
And so really what you're doing is you're just summing
lots of copies of the same thing, of the same value.
And in particular, you're summing from k equals 1 to i.
So you have i copies of the same thing.
So this inner summation actually simplifies to just i
copies of what you're summing, which is this.
And now we are basically done.
Because if we compare these two, we see that they are
effectively the same.
The only difference is that here we're using k as the
variable that we're summing over, and here we're using i.
But really, it's the same thing.
So what we've done here in part (a) is to show that in
fact, this is correct.
This is actually an alternative way of calculating
expectations for this class of random variables.
All right.
And the trick here really was to write things out and then
do a clever change of summation.
All right.
So why is it useful?
Well, it could be the case that, depending on how the
random variable is defined, maybe this way of calculating
the expectation is easier than this way of calculating the
expectation.
And so it's useful to have multiple ways of attacking the
same problem.
All right.
So now let's move on to part (b), where we actually will
try to exercise this new formula on a
particular random variable.
So let me erase this part.
And the random variable that we're dealing with is random
variable Y. And we're told that the PMF is 1 over b minus
a plus 1 when y is a through b.
All right.
And our a and b are integers.
OK.
So what we have is it takes on some value for some range of
y's from a to b.
And a and b are integers.
And they're non-negative.
So what we see is that Y is in fact one of these types of
random variables that we're talking about in this problem.
All right.
So the question is, what is the expected value of this
random variable?
And now instead of using the kind of standard way of
calculating it, let's use this new formula
just to get some exercise.
All right.
So the thing that we need to do is calculate what these
tail probabilities are.
So let's actually calculate that.
So what is the probability that Y is at least some k?
Well, it turns out that this depends on what k is.
And in fact, there will be three different regimes.
And it might actually be helpful if we draw out what
this PMF looks like.
So it goes from a to b.
And the PMF is given by this 1 over b minus a plus 1.
But notice that this is just a constant with respect to y.
It doesn't actually depend on what y is.
And so what we have really is actually just a uniform
distribution.
So everything is height 1 over b minus a plus 1.
All right.
So let's think about the three different regimes.
Well, if k is, say, less than or equal to a, then the
probability that the random variable is greater than or
equal to k, well, that covers the entire PMF, right?
And so by definition, that has to sum up to 1.
So we know that when k is less than or equal to a, this tail
probability is just 1.
And the other simple case is when k is greater than b.
So when it's out here, well, there's no more PMF to the
right of k.
And so when it's all the way out here, the tail probably is
just going to be 0.
So when k is at least b plus 1, it's zero.
And the interesting regime is really
when it's in the middle.
So when k is between a and b--
so let's just plop down a k somewhere.
Well, in order to calculate this tail probability, we just
have to calculate how many of these kind of sticks are there
that go from k to b.
Well, let's just pretend that, say, this is k.
Well, how many of these PMFs are there?
There are b minus k plus 1 of them.
And each one contributes 1 over b minus a plus 1.
So it's b minus a plus 1.
All right.
And so now we have the tail probabilities in all the three
different regimes.
And what we have to do now is just sum over all
the possible ks.
And that will give us the answer.
And so let's finish up that last step.
So the expectation of Y is just the sum of all these.
So let's break things up into the different regimes.
So for k, when k is between 1 and a, this tail probability
is 1, right?
So we just have k equals-- from 1 to a, you get--
just sum up 1s.
And then when k is between a plus 1 and b,
you get this thing.
So we'll plug that in. b minus k plus 1 over b
minus a plus 1.
And then when k is greater than b, you get 0.
So we don't even have to include that anymore.
All right.
And so now the rest of it is just algebra.
The first term is easy.
There's just a copies of 1.
So you just get an a.
And this, we can just do some more algebra and simplify
things out.
And what we get is that this sum actually ends up being b
minus a over 2, which when we combine, we get that this is
just b plus a over 2 becomes our final answer.
And do we trust that this is actually correct?
Well, we can pretty easily verify it in this case,
because the random variable, as we already noted, was
actually uniform between a and b.
And so we know that for a uniform random variable, the
expected value is just the midpoint, which is exactly b
plus a over 2.
OK.
So to summarize this problem, we've shown that for a certain
type of random variables, you have an alternative way of
calculating expectations which may in some cases be easier
than using the standard definition.
And then we've apply that to a specific
case of a random variable.
And this also gave us a good opportunity to get some more
practice calculating out probabilities and working PMFs
and expectations.
All right.
So see you next time.


# 4. Coupon collector problem

![](C:/Users/qp/Pictures/Screenshots/4. Coupon collector problem - 0.png)
![](C:/Users/qp/Pictures/Screenshots/4. Coupon collector problem - 1.png)
![](C:/Users/qp/Pictures/Screenshots/4. Coupon collector problem - 2.png)
![](C:/Users/qp/Pictures/Screenshots/4. Coupon collector problem - 2.5.png)
![](C:/Users/qp/Pictures/Screenshots/4. Coupon collector problem - 3.png)
![](C:/Users/qp/Pictures/Screenshots/4. Coupon collector problem - 4.png)
![](C:/Users/qp/Pictures/Screenshots/4. Coupon collector problem - 5.png)
![](C:/Users/qp/Pictures/Screenshots/4. Coupon collector problem - 6.png)
![](C:/Users/qp/Pictures/Screenshots/4. Coupon collector problem - 7.png)
![](C:/Users/qp/Pictures/20221002_202011.jpg)
![](C:/Users/qp/Pictures/20221002_202052.jpg)
In this exercise, we'll be looking at a problem, also know as the [][coupons collector's problem].  We have a set of k coupons, or grades in our case.  And each time slot we're revealed with one random grade.  And we'd like to know how long it would take for us to collect all k grades.  In our case, k is equal to 6.  Now the key to solving the problem is essentially twofold.  First, we'll have to find a way to intelligently define [a] sequence [of] random variables that capture, essentially, [the] stopping time of this process.  And then we'll employ the idea of linearity of expectations in breaking down this value in simpler terms.  So let's get started.  

We'll define Yi as the number of papers till we see the i-th new grade.  What does that mean?
Well, let's take a look at an example.
Suppose, here we have a timeline from no paper yet,
first paper, second paper, third paper,
so on, and so forth.
Now, if we got grade A on the first slot, grade A minus on
second slot, A again on the third slot, let's say there's
a fourth slot, we got B.
According to this process, we see that Y1 is always 1,
because whatever we got on the first slot
will be a new grade.
Now, Y2 is 2, because the second paper is,
again, a new grade.
On the third paper we got a grade, which is the same as
the first grade.
So that would not count as any Yi.
And the third time we saw a new grade would
now be paper four.
According to this notation, we're interested in knowing
what is the expected value of Y6, which is the time it takes
to receive all six grades.
So so far this notation isn't really helping us in solving
the problem, but kind of just stating a different way.
It turns out, it's much easier to look at the following
variable derived from the Yis.
We'll define Xi as the difference between Yi
plus 1 minus Yi.
And in words, it says, Xi is a number of papers you need
until you see the i plus 1-th new grade, after you have
received i new grades so far.
So in this case, X1 will be if we call 0, Y0, will be the
difference between Y1 and Y0, which is always 1--
that's X1.
And the difference between these two will be X2.
And the difference between Y3 and Y2--
Sorry.
It should be X0, 1, 2, and so on.
OK?
Through this notation we see that Y6 now can be written as
the summation of i equal to 0 to 5, Xi.
So all I did was to break down i6 into a sequence of
summations of the differences, like Y6 minus Y5, Y5
minus Y4, and so on.
It turns out, this expression will be very useful.
OK.
So now that we have the two variables Y and X, let's see
if it will be easier to look at the distribution of X in
studying this process.
Let's say, we have seen a new grade so far--
one.
How many trials would it take for us to see
the second new grade?
It turns out it's not that hard.
In this case, we know there is a total of six grades, and we
have seen one of them.
So that leaves us five more grades that
we'll potentially see.
And therefore, on any random trial after that, there is a
probability of 5 over 6 that we'll see a new grade.
And hence, we know that X1 has a distribution geometric with
a success probability, or a parameter, 5/6.
Now, more generally, if we extend this idea further, we
see that Xi will have a geometric distribution of
parameter 6 minus i over 6.
And this is due to the fact that so far we have already
seen i new grades.
And that will be the success probability of seeing a
further new grade.
So from this expression, we know that the expected value
of Xi will simply be the inverse of the parameter of
the geometric distribution, which is 6 over 6 minus i or 6
times 1 over 6 minus i.
And now we're ready to compute a final answer.
So from this expression we know expected value of Y6 is
equal to the expected value of sum of i equal to 0 to 5 Xi.
And by the linearity of expectation, we can pull out
the sum and write it as [sum to]
5 expected value of Xi.
Now, since we know that expected value of Xi is the
following expression.
We see that this term is equal to 6 times expected value of i
equals 0 [to]
5, 1 over 6 minus i.
Or written in another way this is equal to 6 times i
equal to 0 to 5.
In fact, 1 to 6, 1 over i.
And all I did here was to, essentially, change the
variable, so that these two summations contain exactly the
same terms.
And this will give us the answer, which is 14.7.
Now, more generally, we can see that there's nothing
special about number 6 here.
We could have substituted 6 with a number, let's say, k.
And then we'll get E of Yk, let's say, there's more than
six labels.
And this will give us k times summation i equal to 1, of k
minus 1, 1 over i.
Interestingly, it turns out this quantity has an
asymptotic expression that, essentially, is roughly equal
to k times the natural logarithm of k.
And this is known as the scaling law for the coupon
collector's problem that says, essentially, takes about k
times log(k) many trials until we collect all k coupons.
And that'll be the end of the problem.
See you next time.


# 5. Conditioning example

![](C:/Users/qp/Pictures/Screenshots/5. Conditioning example - 1.png)
![](C:/Users/qp/Pictures/Screenshots/5. Conditioning example - 2.png)
![](C:/Users/qp/Pictures/Screenshots/5. Conditioning example - 3.png)
![](C:/Users/qp/Pictures/Screenshots/5. Conditioning example - 4.png)
![](C:/Users/qp/Pictures/Screenshots/5. Conditioning example - 5.png)
In this problem, we're asked to show that a certain
statement is true.
And in doing so, we'll exercise conditional
probability, independence, the law of total probability, and
the geometric distribution.
And as a quick review, conditional probability is
basically shrinking our universe down into a smaller
universe and doing probability within only
that smaller universe.
So what we know is that if A and B are two events, then the
conditional probability of A, given that event B has
happened, is equal to the probability of the
intersection that A and B both happened divided by the
probability that event B happens.
And also, the geometric distribution, you could think
of it as the number of coin flips you need until you get
your first heads, where p is the probability of getting
heads on any given coin flip.
And so this is a plot of what the PMF looks like for a
generic geometric distribution.
You can see that it falls off geometrically as the number of
trials increases.
So let's now go back to the problem that we
have to deal with.
What we're given is that X and Y are two i.i.d., which means
independent and identically distributed.
So what does that actually mean?
It means that X and Y have the same distribution and they are
independent of one another.
And what distribution did they have?
They are both geometrically distributed, with the same
parameter, p.
So one way to think about that is that you have two friends
who are independently flipping coins with the same
probability of heads, both equal to p.
And X is the number of flips until the first friend gets
his first heads, and Y is the number of flips until the
second friend gets his first heads.
And what are we asked to show?
We're asked to show this fact, that the probability that X
equals i for i equals 1, 2, all the way through n minus 1,
given that X plus Y equals n is equal to 1 over n minus 1.
So let's actually try to parse this and see what it's saying.
The conditioning part is saying that the sum of X plus
Y equals is equal to n.
So going back to our example, what it's saying is that if
your two friends in total they require a total of n flips
between the two of them, in order for each person to get
their first heads.
If that is the case, then the probability that your first
friend requires i flips to get his first heads is equal to 1
over n minus 1.
And the important thing here to note is that this
probability no longer depends on i.
So in fact, it turns out that the probability of requiring
one flip versus the probability of of requiring n
minus 1 flips is all the same.
It's all 1 over n minus 1.
So it's kind of surprising.
And at the end, we'll come back and actually try to
figure out why this is the case.
But first, let's try to show that this is true.
So you can see that this fits into the general form of
conditional probability.
So X equals i, you could think of that as just event A. And X
plus Y equals n is the event B.
So let's just apply this definition
and see what we get.
So in that case, what we get is the probability of X equals
i and X plus Y equals n divided by the probability
that X plus Y equals n.
Now, what we can do is work with each of these two
separately.
So the numerator--
let's deal with that first.
What does it mean that X equals i and X
plus Y equals n?
Well, that's the same as saying that, well, the first
friend took i flips to get heads.
And in total, they took n flips.
Well, what does that imply about the second friend?
The second friend must taken n minus i flips to get heads.
And so we can equivalently write the numerator as the
probability that X equals i and Y equals n minus i,
because if this is true, then this is true.
And if that is true, then this is true.
And now, what are we going to do?
Well, what we know is that X and Y are independent, because
we're given that as an assumption.
And now we apply independence.
And because they're independent, this probability
involving an intersection, we can just write that now as a
product of two different probabilities.
So this is just the probability that X equals i
times the probability that Y equals n minus i.
And now we actually know what each of these two are as well,
because we know that X and Y are both geometric random
variables, with parameter, p.
What is probability that X equals i?
Well, we can apply that.
We know what the PMF is.
So this is just 1 minus p to the power of i
minus 1 times p.
And what is the probability that Y equals n minus i?
Well, that means that the second friend needed n minus i
flips to get heads.
So there were n minus i minus 1 tails and then
heads at the end.
So we can simplify this, and this is just 1 minus p to the
n minus 2, because we can combine these
exponents, times p squared.
And now, let's deal with the denominator.
And to do that, what we'll do is apply or invoke the law of
total probability.
We'll split this up into lots of different cases.
So we're going to essentially do the same thing as this but
for more cases.
So the probability that X plus Y equals n, we can write that
as a sum of lots of different combinations.
So for X plus Y equal to n, we can think
of this as a partition.
So first, let's say we want X to be equal to some k.
And then we want X plus Y to equal n given that
X is equal to k.
And what do we sum k from?
Well, this is basically the partition.
So we know that X is a geometric random variable.
So it has to be at least 1.
You need at least one flip to get heads.
And in order for X plus Y to equal n, the most that X can
be is n minus 1, because we know that Y is also geometric.
So Y also has to be at least 1.
So in order for them to sum to n, X can be, at
most, n minus 1.
So that's why we sum to this.
And this forms our partition and how we applied total
probability.
And now we can now further simplify this.
And we know that this probability, the probability
that X equals k is just a geometric.
So it's just 1 minus p to the k minus 1 times p.
And then what do we do here?
Well, given that X equals k and X plus Y equals n, what we
need is that in order for X plus Y equal to n, given that
X equals k, we need Y to equal n minus k.
But now, we know also that Y and X are independent.
So this conditioning actually doesn't matter.
So we can actually just ignore it.
And we can just plug in what is the probability of Y
equaling n minus k.
Well, the probability that Y equals n minus k is just
another geometric, where we have n minus k being the
number of flips needed for success.
So what we have is 1 minus p to the n minus k
minus 1 times p.
And we can simplify this and get that this is the sum from
k equals 1 to n minus 1 of 1 minus p to the--
if we combine these exponents, we get n
minus 2 times p squared.
So what we have here, let's highlight the
pieces that we want.
This is what we calculated for the numerator.
And this is what we have calculated for the
denominator.
So now let's put everything together, and we'll see that
the numerator is 1 minus p to the n minus 2 times p squared.
And the denominator--
well, what we're doing is, we're summing this over k.
But notice that k doesn't actually appear in the
summation anymore.
And so every single term within the
summation is the same.
We just have n minus 1 copies of them.
And so what we can really do is just write this as n minus
1 copies of what we're summing over, which is 1 minus p to
the n minus 2 times p squared.
And now we notice that everything cancels except for
the n minus 1.
And so what we get is what we wanted as our final answer,
that this does, in fact, equal 1 over n minus 1, which is
what we wanted.
And the last thing that we should verify is that this
actually is true for the range of i that we wanted.
And if we think about it, i is--
as we went through this, this was the case, that i has to be
somewhere between 1 and n minus 1.
So now, let's take a step back and think about why it is the
case that when you condition on the sum, a geometric
somehow became more like a uniform, because in general,
the geometric falls off as the random variable takes on
higher values.
So it's less and less likely that you'll need more flips in
order to get heads.
So without this conditioning on the sum of X and Y, the
probability of X equaling i should
drop off as i increases.
But it just so happens then when you add this
conditioning, that X plus Y equals n, no matter what i is,
the probability is still the same.
So it looks more like a uniform.
So the question is, why is that?
One way to think about it is that because you're
conditioning on the sum being equal to n, some fixed number,
the universe that you're dealing with has changed.
And that is basically what's changing everything.
So let's make this a little bit more concrete.
Instead of dealing with i's and n's, let's
just pick some numbers.
Let's say n is 10.
So what it's saying is that your two friends combined,
they needed 10 flips between the two of them to get their
first heads, each one of them.
So now, let's say what's the probability that the first
friend needed only one flip?
So on the first flip, he got heads.
And compare that to the probability that he needed
nine flips--
so when i equals 9.
Without this conditioning, you know that it's more likely
that he'll get heads on the first flip then needing nine
flips to get heads, because that requires eight tails
first, followed by one heads.
But so why is it that when you condition on the sum, these
two probabilities now become the same?
It's as likely that he needed one flip versus nine flips.
Well the reason is that because you know that the sum
is equal to 10, then in order for the first friend to need
only one flip, that means the other friend, second friend,
needed nine flips.
And that is a less likely event.
And the combination of these, the first friend needing only
one is more likely.
But the second friend needing nine is less likely.
And the combination of those two, you need both of them to
happen, makes it so that it kind of decreases the
probability of the first friend needing only one flip.
And conversely, if you think about the first friend needing
nine flips, that is not a very likely outcome.
But paired with that, that also means that the second
friend needed only one flip.
And that is a more likely outcome.
And so again, the combination, they somehow cancel each other
out so that no matter what, you still get the same
probability.
And that's kind of the intuition behind why it is
that this actually becomes more like a uniform.
So the problem itself was more like a mechanical exercise,
where we used the ideas of conditional probability and
the geometric distribution along with some algebra.
But it's also useful to think more about the bigger picture
and see why it is that we're even interested in this
statement, why we want to show it, and also think about the
intuition behind why it's true.
So we'll see you next time.


# 6. Joint PMF drill 1

![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 1.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 2.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 3.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 4.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 5.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 6.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 7.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 8.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 9.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 10.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 11.png)
![](C:/Users/qp/Pictures/Screenshots/6. Joint PMF drill 1 - 12.png)
Welcome back guys.
Today we're going to work on a problem that tests your
knowledge of joint PMFs.
And we're also going to get some practice computing
conditional expectations and conditional variances.
So in this problem, we are given a set of
points in the xy plane.
And we're told that these points are equally likely.
So there's eight of them.
And each point has a
probability of 1/8 of occurring.
And we're also given this list of questions.
And we're going to work through them together.
So in part (a), we are asked to find the values of X that
maximize the conditional expectation of Y given X. So
jumping right in, this is the quantity we're interested in.
And so this quantity is a function of x.
You plug-in various values of x.
And then this will spit out a scalar value.
And that value will correspond to the conditional expectation
of Y conditioned on the value of x that you put in.
So let's see, when x is equal to 0, for instance, let's
figure out what this value is.
Well, when x is equal to 0 we're living in a world,
essentially, on this line.
So that means that only these two
points could have occurred.
And in particular, Y can only take on the values of 1 and 3.
Now, since all these points in the unconditional universe
were equally likely, in the conditional universe they will
still be equally likely.
So this happens with probability 1/2.
And this happens with probability 1/2.
And therefore, the expectation would just be 3/2 plus 1/2
which is 4/2, or 2.
But a much faster way of seeing this-- and it's the
strategy that I'm going to use for the rest of the problem--
is to remember that expectation acts
like center of mass.
So the center of mass, when these two points are equally
likely, is just the midpoint, which of course is 2.
So we're going to use that intuition on the other ones.
So I'm skipping to x is equal to 2 because 1
and 3 are not possible.
So when x is equal to 2, Y can only take on the
values of 1 or 2.
Again, they're equally likely.
So the center of mass is in the middle which happens at
1.5 or 3/2.
Similarly, x is equal to 4.
We're living in this conditional universe, where Y
can take on of these four points with
probability 1/4 each.
And so again, we expect the center of mass to
be at 1.5 or 3/2.
And this quantity is undefined otherwise.
OK, so we're almost done.
Now we just need to find which value of x maximizes this.
Well, let's see, 2 is the biggest quantity out of all of
these numbers.
So the maximum is 2.
And it occurs when x is equal to 0.
So we come over here.
And we found our answer.
x is equal to 0 is the value, which maximizes the
conditional expectation of Y given x.
So part (b) is very similar to part (a).
But there is slightly more computation involved.
Because now we're dealing with the variance and not an
expectation.
And variance is usually a little bit tougher to compute.
So we're going to start in the same manner.
But I want you guys to see if you can figure out intuitively
what the right value is.
I'm going to do the entire computation now.
And then you can compare whether your intuition matches
with the real results.
So variance of X conditioned on a particular value of y,
this is now a function of y.
For each value of y you plug in you're going to get out a
scalar number.
And that number represents the conditional variance of X when
you condition on the value of y that you plugged in.
So let's see, when y is equal to 0 we have a nice case.
If y is equal to 0 we have no freedom about what X is.
This is the only point that could have occurred.
Therefore, X definitely takes on a value of 4.
And there's no uncertainty left.
So in other words, the variance is 0.
Now, if y is equal to 1, X can take on a value of 0, a value
of 2 or a value of 4.
And these all have the same probability of occurring, of
1/3,
And again, the reasoning behind that is that all eight
points were equally likely in the unconditional universe.
If you condition on Y being equal to 1 these outcomes
still have the same relative frequency.
Namely, they're still equally likely.
And since there are three of them they now have a
probability of 1/3 each.
So we're going to go ahead and use a formula that hopefully,
you guys remember.
So in particular, variance is the expectation of X squared
minus the expectation of X all squared,
the whole thing squared.
So let's start by computing this number first.
So conditioned on Y is equal to 1--
so we're in this line--
the expectation of X is just 2, right?
The same center-of-mass to argument.
So this, we have a minus 2 squared over here.
Now, X squared is only slightly more difficult.
With probability 1/3, X squared will take
on a value of 0.
With probability 1/3, X squared will take
on a value of 4.
I'm just doing 2 squared.
And with probability 1/3, X squared takes on a value of 4
squared or 16.
So writing down when I just said, we have 0 times 1/3
which is 0.
We have 2 squared, which is 4 times 1/3.
And then we have 4 squared, which is 16 times 1/3.
And then we have our minus 4 from before.
So doing this math out, we get, let's see, 20/3 minus
12/3, which is equal to 8/3, or 8/3.
So we'll come back up here and put 8/3.
So I realize I'm going through this pretty quickly.
Hopefully this step didn't confuse you.
Essentially, when I was doing is, if you think of X squared
as a new random variable, X squared, the possible values
that it can take on are 0, 4, and 16 when you're
conditioning on Y is equal to 1.
And so I was simply saying that that random variable
takes on those values with equal probability.
So let's move on to the next one.
So if we condition on Y is equal to 2 we're going to do a
very similar computation.
Oops, I shouldn't have erased that.
OK, so we're going to use the same formula that we just
used, which is the expectation of X given Y is equal to 2.
Sorry, X squared minus the expectation of X conditioned
on Y is equal to 2, all squared.
So conditioned on Y is equal to 2, the
expectation of X is 3.
Same center of mass argument.
So 3 squared is 9.
And then X squared can take on a value of 4.
Or it can take on a value of 16.
And it does so with equal probability.
So we get 4/2, 4 plus 16 over 2.
So this is 2 plus 8, which is 10, minus 9.
That'll give us 1.
So we get a 1 when Y is equal to 2.
And last computation and then we're done.
I'm still recycling the same formula.
But now we're conditioning on Y is equal to 3.
And then we'll be done with this problem, I promise.
OK, so when Y is equal to 3, X can take on the value of 0.
Or it can take on the value of 4.
Those two points happen with probability 1/2, 1/2.
So the expectation is right in the middle which is 2.
So we get a minus 4.
And similarly, X squared can take on the value of 0.
When X takes on the value of 0-- and that happens with
probability 1/2--
similarly, X squared can take on the value of 16 when X
takes on the value of 4.
And that happens with probability 1/2.
So we just have 0/2 plus 16/2 minus 4.
And this gives us 8 minus 4, which is simply 4.
So finally, after all that computation, we are done.
We have the conditional variance of X given Y.
Again, we're interested in when this value is largest.
And we see that 4 is the biggest value in this column.
And this value occurs when Y takes on a value of 3.
So our answer, over here, is y is equal to 3.
All right, so now we're going to switch gears in part (c)
and (d) a little bit.
And we're going to be more concerned
with PMFs, et cetera.
So in part (c), we're given a random variable called R which
is defined as the minimum of X and Y.
So for instance, this is the point (0,1).
The minimum of 0 and 1 is 0.
So R would have a value of 0 here.
Now, we can be a little bit smarter about this.
If we plot the line, y is equal to x.
So that looks something like this.
We see that all of the points below this line satisfy y
being less or equal to x.
And all the points above this line have y greater than or
equal to x.
So if y is less than or equal to x, you hopefully agree that
here the min, or r, is equal to y.
But over here, the min, r, is actually equal to x, since x
is always smaller.
So now we can go ahead quickly.
And I'm going to write the value of r next to each point
using this rule.
So here, r is the value of y, which is 1.
Here, r is equal to 0.
Here r is 1.
Here r is 2.
Here r is 3.
Over here, r is the value of x.
So r is equal to 0.
And r is equal to 0 here.
And so the only point we didn't handle is the one that
lies on the line.
But in that case it's easy.
Because x is equal to 2.
And y is equal to 2.
So the min is simply 2.
So with this information I claim we're now done.
We can just write down what the PMF of R is.
So in particular, R takes on a value of 0.
When this point happens, this point happens,
or this point happens.
And those collectively have a
probability of 3/8 of occurring.
R can take on a value of 1 when either of these two
points happen.
So that happens with probability 2/8.
R is equal to 2.
This can happen in two ways.
So we get 2/8.
And R equal to 3 can happen in only one way.
So we get 1/8.
Quick sanity check, 3 plus 2 is 5, plus 2 is
7, plus 1 is 8.
So our PMF sums to 1.
And to be complete, we should sketch it.
Because the problem asks us to sketch it.
So we're plotting pR of r, 0, 1, 2, 3.
So here we get, let's see, 1, 2, 3.
For 0 we have 3/8.
For 1 we have 2/8.
For 2 we have 2/8.
And for 3 we have 1/8.
So this is our fully labeled sketch of pR of r.
And forgive me for erasing so quickly, but you guys can
pause the video, presumably, if you need more time.
Let's move on to part (d).
So in part (d) we're given an event named A, which is the
event that X squared is greater than or equal to Y.
And then we're asked to find the expectation of XY in the
unconditional universe.
And then the expectation of X times Y conditioned on A.
So let's not worry about the conditioning for now.
Let's just focus on the unconditional expectation of X
times Y. So I'm just going to erase all these r's so I don't
get confused.
But we're going to follow a very similar strategy, which
is at each point I'm going to label what the value of W is.
And we'll find the expectation of W that way.
So let's see, here, we have 4 times 0.
So W is equal to 0.
Here we have 4 times 1.
W is equal to 4.
4 times 2, W is equal to 8.
4 times 3, W is equal to 12.
W is equal to 2.
W is equal to 4.
W is equal to 0.
W is equal to 0.
OK, so that was just algebra.
And now, I claim again, we can just write down what the
expectation of X times Y is.
And I'm sorry, I didn't announce my notation.
I should mention that now.
I was defining W to be the random variable X times Y. And
that's why I labeled the product of X
times Y as W over here.
My apologies about not defining that random variable.
So the expectation of W, well, W takes on a value of 0.
When this happens, this happens or that happens.
And we know that those three points occur
with probability 3/8.
So we have 0 times 3/8.
I'm just using the normal formula for expectation.
W takes on a value of 2 with probability 1/8.
Because this is the only point in which it
happens, 2 times 1/8.
Plus it can take on the value of 4 with probability
2/8, 4 times 2/8.
And 8, with 1/8 probability.
And similarly, 12 with 1/8 probability.
So this is just algebra.
The numerator sums up to 30.
Yes, that's correct.
So we have 30/8, which is equal to 15/4.
So this is our first answer for part (d).
And now we have to do this slightly trickier one, which
is the conditional expectation of X times Y, or W,
conditioned on A. So similar to what I did in part (c), I'm
going to draw the line y equals x squared.
So y equals x squared is 0 here, 1 here.
And at 2, it should take on a value of 4.
So the curve should look something like this.
This is the line y is equal to x squared.
So we know all the points below this line satisfy y less
than or equal to x squared.
And all the points above this line have y greater than or
equal to x squared.
And A is y less than or equal to x squared.
So we are in the conditional universe where only points
below this line can happen.
So that one, that one, that one, that one, that
one and that one.
So there are six of them.
And again, in the unconditional world, all of
the points were equally likely.
So in the conditional world these six points are still
equally likely.
So they each happen with probability 1/6.
So in this case, the expectation of W is simply 2
times 1/6 plus 0 times 1/6.
But that's 0.
So I'm not going to write it.
4 times 2/6 plus 4 times 2/6 plus 8 times 1/6, plus 12
times 1 over 6.
And again, the numerator summed to 30.
But this time our denominator is 6.
So this is simply 5.
So we have, actually, finished the problem.
Because we've computed this value and this value.
And so the important takeaways of this problem are,
essentially, honestly, just to get you comfortable with
computing things involving joint PMFs.
We talked a lot about finding expectations quickly by
thinking about center of mass and the
geometry of the problem.
We've got practice computing conditional variances.
And we did some derived distributions.
And we'll do a lot more of those later.


# 7. Joint PMF drill 2

![](C:/Users/qp/Pictures/Screenshots/7. Joint PMF drill 2 - 1.png)
![](C:/Users/qp/Pictures/Screenshots/7. Joint PMF drill 2 - 2.png)
![](C:/Users/qp/Pictures/Screenshots/7. Joint PMF drill 2 - 3.png)
![](C:/Users/qp/Pictures/Screenshots/7. Joint PMF drill 2 - 4.png)
![](C:/Users/qp/Pictures/Screenshots/7. Joint PMF drill 2 - 5.png)
![](C:/Users/qp/Pictures/Screenshots/7. Joint PMF drill 2 - 6.png)
![](C:/Users/qp/Pictures/Screenshots/7. Joint PMF drill 2 - 7.png)
![](C:/Users/qp/Pictures/Screenshots/7. Joint PMF drill 2 - 8.png)
![](C:/Users/qp/Pictures/Screenshots/7. Joint PMF drill 2 - 9.png)
![](C:/Users/qp/Pictures/Screenshots/7. Joint PMF drill 2 - 10.png)
Hey, guys.
Welcome back.
Today, we're going to do another fun problem, which is
a drill problem on joint PMFs.
And the goal is that you will feel more comfortable by the
end of this problem, manipulating joint PMFs.
And we'll also review some ideas about independence in
the process.
So just to go over what I've drawn here, we
are given an xy plane.
And we're told what the PMF is.
And it's plotted for you here.
What these stars indicate is simply that
there is a value there.
But we don't know what it is.
It could be anything between 0 and 1.
And so we're given this list of questions.
And we're just going to work through
them linearly together.
So we start off pretty simply.
We want to compute, in part (a), the probability that X
takes on a value of 1.
So for those of you who like formulas, I'm going to use the
formula, which is usually referred to as
marginalization.
So the marginal over X is given by
summing over the joint.
So here we are interested in the probability that X is 1.
So I'm just going to freeze the value of 1 here.
And we sum over y.
And in particular, 1, 2, and 3.
So carrying this out, this is the pXY of 1, 1, plus pXY of
1, 2, plus pXY) 1, 3.
And this, of course, reading from the graph, is 1/12 plus
2/12 plus 1/12, which is equal to 4/12, or 1/3.
So now you guys know the formula.
Hopefully you'll remember the term marginalization.
But I want to point out that intuitively you can come up
with the answer much faster.
So the probability that X is equal to 1 is the probability
that this dot happens or this dot
happens or this dot happens.
Now, these dots, or outcomes, they're disjoint.
So you can just sum the probability to get the
probability of one of these things happening.
So it's the same computation.
And you'll probably get there a little bit faster.
So we're done with (a) already, which is great.
So for part (b), conditioning on X is equal to 1, we want to
sketch the PMF of Y. So if X is equal to 1 we are suddenly
living in this universe.
Y can take values of 1, 2, or 3 with these relative
frequencies.
So let's draw this here.
So this is Y. I said, already, Y can take on a value of 1.
Y can take on a value of 2.
Or it can take on a value of 3.
And we're plotting here, pY given X, [of] y conditioned on
X is equal to 1.
OK, so what I mean by preserving the relative
frequencies is that in the unconditional world this is
dot is twice as likely to happen as either
this dot or this dot.
And that relative likelihood remains the same after
conditioning.
And the reason why we have to change these values is because
they have to sum to 1.
So in other words, we have to scale them up.
So you can use a formula.
But again, I'm here to show you faster ways of
thinking about it.
So my little algorithm for figuring out conditional PMFs
is to take the numerators--
so 1, 2, and 1--
and sum them.
So here that gives us 4.
And then to preserve the relative frequency, you
actually keep the same numerators but divide it by
the sum, which you just computed.
So I'm going fast.
I'll review in a second.
But this is what you will end up getting.
So to recap, I did 1 plus 2 plus 1, which is 4, to get
these denominators.
And so I skipped a step here.
This is really 2/4, which is 1/2, obviously.
So you add these guys to get 4.
And then you keep the numerators and just
divide them by 4.
So 1/4, 2/4, which is 1/2 and 1/4.
And that's what we mean by preserving
the relative frequency.
Except so this thing now sums to 1, which is what we want.
OK, so we're done with part (b).
Part (c) actually follows almost
immediately from part (b).
In part (c) we're interested in computing the conditional
expectation of Y given that X is equal to 1.
So we've already done most of the legwork because we have
the conditional PMF that we need.
And so expectation, you guys have calculated a bunch of
these by now.
So I'm just going to appeal to your
intuition and to symmetry.
Expectation acts like center of mass.
This is a symmetrical distribution of mass.
And so the center is right here at 2.
So this is simply 2.
And if that went too fast, just convince yourselves.
Use the normal formula for expectations.
And your answer will agree with ours.
OK, so (d) is a really cool question.
Because you can do a lot of math.
Or you can think and ask yourself, at the most
fundamental level, what is independence?
And if you think that way you'll come to
the answer very easily.
So essentially, I rephrased this to truncate it from the
problem statement that you guys are reading.
But the idea is that these stars are
unknown probability masses.
And this question is asking can you figure out a way of
assigning numbers between 0 and 1 to these values such
that you end up with a valid probability mass function, so
everything sums to 1 and such that X and Y are independent?
So it seems hard a priori.
But let's think about it a bit.
And in the meantime I'm going to erase this
so I have more room.
What does it mean for X and Y to be independent?
Well, it means that they don't, essentially, have
information about each other.
So if I tell you something about X and if X and Y are
independent, your belief about Y shouldn't change.
In other words, if you're a rational person, X shouldn't
change your belief about Y.
So let's look more closely at this diagram.
Now, the number 0 should be popping out to you.
Because this essentially means that the
point (3,1) can't happen.
Or it happens with 0 probability.
So let's fix X equal to 3.
If you condition on X is equal to 3, as I just said, this
outcome can't happen.
So Y could only take on values of 2 or 3.
However, if you condition on X is equal to 1, Y could take on
a value of 1 with probability 1/4 as we computed in the
other problem.
It could take on a value of 2 with probability of 1/2.
Or it could take on a value of 3 with probability 1/4.
So these are actually very different cases, right?
Because if you observe X is equal to 3, Y can
only be 2 or 3.
But if you observe X is equal to 1, Y can be 1, 2, or 3.
So actually, X, no matter what values these stars have on, X
always tells you something about Y. Therefore, the answer
to this, part (d), is no.
So let's put a no with an exclamation point.
So I like that problem a lot.
And hopefully it clarifies independence for you guys.
So parts (e) and (f), we're going to be thinking about
independence again.
To go over what the problem statement gives you, we
defined this event, B, which is the event that X is less
than or equal to 2 and Y is less than or equal to 2.
So let's get some colors.
So do bright pink.
So that means we're essentially
living in this world.
There's only those four dots.
And we're also told a very important piece of information
that conditioned on B. X and Y are conditionally independent.
OK, so part (e), now that we have this.
And by the way, these two assumptions apply to both
parts (e) and part (f).
So in part (e), we want to find out pXY of 2, 2.
Or in English, what is the probability that X takes on a
value of 2 and Y takes on a value of 2?
So determine the value of this star.
And the whole trick here is that the possible values that
this star could take on are constrained by the fact that
we need to make sure that X and Y are conditionally
independent given B.
So my claim is that if two random variables are
independent and you condition on one of them, say we
condition on X. If you condition on different values
of X, the relative frequencies of Y should be the same.
So here, the relative frequency, conditioned on X is
equal to 1.
The relative frequencies of Y are 2 to 1.
This outcome is twice as likely to happen as this one.
If we condition on 2 this outcome needs to be twice as
likely to happen as this outcome.
If they weren't, X would tell you information about Y.
Because you would know that the distribution over 2 and 1
would be different.
OK?
So because the relative frequencies have to be the
same and 2/12 is 2 times 1/12 this guy must
also be 2 times 2/12.
So that gives us our answer for part (e).
Let me write up here.
Part (e), we need pXY 2, 2 to be equal to 4/12.
And again, the way we got this is simply we need X and Y to
be conditionally independent given B. And if this were
anything other than 4 the relative frequency of Y is
equal to 2 to 1 would be different from over here.
So here conditioned on X is equal to 1.
The outcome, Y is equal to 2 is twice as likely as X is
equal to 1.
Here, if we put a value of 4/12 and you condition on X is
equal to 2, the outcome Y is equal to 2 is still twice as
likely as the outcome Y is equal to 1.
And if you put any other number there the relative
frequencies would be different.
So X would be telling you something about Y.
So they would not be independent conditioned on B.
OK, that was a mouthful.
But hopefully you guys have it now.
And lastly, we have part (f), which follows pretty directly
from part (e).
So we are still in the unconditional universe.
In part (e), we were figuring out what's the value of star
in the whole unconditional universe?
Now, in part (f), we want the value of star in the
conditional universe where B occurred.
So let's come over here and plot a new graph so we don't
confuse ourselves.
So we have xy.
x can be 1 or 2. y could be 1 or 2.
So we have a plot that looks something like this.
And so again, same argument as before.
Let me just fill this in.
From part (e), we have that this is 4/12.
And we're going to use my algorithm again.
So in the conditional world, the relative frequencies of
these four dots should be the same.
But you need to scale them up so that if you sum over all of
them the probability sums to 1.
So you have a valid PMF.
So my algorithm from before was to add up all the
numerators.
So 1 plus 2 plus 4 plus 2 gives you 9.
And then to preserve the relative frequency you keep
the same numerator.
So here we had a numerator of 1.
That becomes 1/9.
Here we had a numerator of 2.
This becomes 2/9.
Here we had a numerator of 4.
That becomes 4/9.
Here we had a numerator of 2, so 2/9.
And indeed, the relative frequencies are preserved.
And they all sum to 1.
So our answer for part (f)--
let's box it here--
is that pXY 2, 2 conditioned on B is equal to 4/9,
is just that guy.
So we're done.
Hopefully that wasn't too painful.
And this is a good drill problem, because we got more
comfortable working with PMFs, joint PMFs.
We went over marginalization.
We went over conditioning.
We went over independence.
And I also gave you this quick algorithm for figuring out
what conditional PMFs are if you don't
want to use the formulas.
Namely, you sum all of the numerators to get a new
denominator and then divide all the old numerators by the
new denominator you computed.
So I hope that was helpful.
I'll see you next time.


# 8. Joint PMF drill 3

![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 1.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 2.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 3.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 4.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 5.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 6.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 7.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 8.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 9.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 10.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 11.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 12.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 13.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 14.png)
![](C:/Users/qp/Pictures/Screenshots/8. Joint PMF drill 3 - 15.png)
Hi.
In this problem, we'll get some more practice working
with joint PMFs and calculating conditional
expectations, conditional variances, thinking about
independence between two jointly distributed random
variables, and more practice along those lines.
So here is the joint PMF between two random variables X
and Y that were presented in the problem.
And in the beginning, we're given them in terms of some
constant c.
In the first part of this problem, part (a), asks us to
find out what exactly this constant is.
So what how do we calculate this?
Well, we know that the sum of all these probabilities
has to equal 1.
So when you sum over all the x's and all the y's of the
joint PMF, you know that this has to equal 1.
And well, let's just add these up then.
We have 1, 2, 4, 6, 10, 13, 14, 20 c's.
So the sum is 20c, and we know that it has to equal 1.
So that tells us that c is 1/20.
So that wasn't so bad, just using this identity to
calculate what the c is.
So now we know exactly what all these
numerical values are.
And now the second part of the question asks
us to find a marginal.
So what is the marginal PMF of Y evaluated at 2?
What does this actually mean?
Remember, to calculate this what we do is we take the
joint and we add up all the x's, holding y fixed at 2.
And what this really amounts to is just taking this row
where y equals 2 and adding along this
row, across this row.
And so what we have is we have 2c plus 0 plus 4c, so that
gives us 6c.
We know what c is, so the answer is going to be 6/20.
just some simple exercise calculating with these
marginals are.
And if you wanted to calculate the marginal of X, you would
sum along these columns instead.
OK, so now we define a new random variable.
We call it Z. And Z is defined to be Y times X squared.
And what we're asked to find is the conditional expectation
of Z given that Y is equal to 2.
Well, the first thing we can do is we can substitute in--
we know that Z is Y times X squared.
And the PMF that we have is in terms of Y and so let's
substitute those in.
And now the next step is--
well, we're given that Y equals 2.
So this Y is no longer really random.
We know that it's equal to 2.
So we can substitute that in.
So we did that.
But notice that we still keep this
conditioning on Y equals 2.
The reason is that this conditioning still affects X
because Y and X are not independent, or at least we
haven't verified that yet.
So we need to, in general, keep this conditioning because
it may affect the expectation of X.
So this is just a constant.
So we can also just pull that out.
So what we really need to do now is find this.
We need to find the expectation of X squared
conditioned on Y equals 2.
How do we do this?
Well, remember that to calculate these conditional
expectations, what we need are conditional PMFs.
And so what we really need to do here is figure out--
this is actually going to require calculating what the
conditional PMF is.
So let's figure out what this conditional PMF is.
Let's go up here.
What is the conditional PMF of X given that Y equals 2?
So we know that the conditional is the joint
divided by the marginal [that]
Y equals 2.
So now, we know what this denominator is.
The probability that Y equals 2 we calculated
already in part (b).
So what now is this part?
The numerator we also know because it's in the PMF, the
joint PMF that we're given.
So now, let's figure out what this is.
So given that Y equals 2, we know that we can focus on this
row right here.
And in that row, X can be either 1 or 3.
Because given that Y equals 2, X has 0
probability of being 2.
So we know that X can take on two values.
So X can be 1 or X can be 3.
So in the denominator, we know that this is equal to 6/20.
And the numerator is the probability that X is equal to
1 and Y is equal to 2, which corresponds to
here, which is 2/20.
So we know that this is equal to 1/3.
And similarly, for X equals 3, it's 4/20 over 6/20, which is
equal to 2/3.
So we figured out what the conditional PMF is.
A simple way of eyeballing this is also just we know that
it focuses us on this row.
And you know that when conditioning, it basically
rescales the probabilities so that they add up to 1.
So we know that--
well, the total is 60 and 2 of those are for X equals 1, 4 of
those are for X equals 3.
So it should be in the ratio of 1 to 2, which
gives us 1/3 and 2/3.
So now we know this and we can actually calculate out what
this conditional expectation is.
Remember for conditional expectations it is equal to--
it's going to be the value squared times the conditional
PMF at that value.
So the value, we can take on two values, 1 or 3.
So for 1, it's 1 squared times the conditional PMF at 1,
which is 1/3.
And for 3, it's 3 squared times conditional PMF at 3,
which is 2/3.
So the final answer that we get is 2 times 1/3
plus 1/3 plus 6.
So it's 2 times 1/3 plus 6, which gives us a
final answer of 38/3.
And that gives us our conditional expectation of Z
given Y equals 2, which is what we wanted.
All right, so that is part (c).
And part (d) asks us a more--
less mathematical question, but more intuition.
So the question asks us, suppose that we know that X is
not equal to 2.
So we condition on the fact that X is not equal to 2.
And we're asked in that conditional universe, are X
and Y independent?
Well, because we know that X does not equal 2, let's forget
about this column.
And so we only have these two columns remaining.
And in that universe, how do we know if things are
independent?
Well, one way to tell they're independent is that knowing
what X is does not affect the probability distribution of Y.
And so is that the case in this conditional universe?
Well, if I told you that X equals 1, then the probability
distribution of Y would be in the ratio of 1 to 2 to 3.
So when you renormalize that, the conditional would be--
PMF of Y would be 1/6, 2/6, 3/6.
And if I told you now that X equals to 3, we again, have a
ratio of 2 to 4 to 6, which is the same as 1 to 2 to 3.
And so when we renormalize this, we would get
1/6, 2/6, and 3/6.
So what we've discovered is that if I tell you that X is
not equal to 2, so X is either 1 or 3, then any further
information about what X is doesn't really tell you
anything more about what Y can be.
So if I tell you that X is 1, then you know that Y has a 1/6
chance of being 3.
If I tell you that X is 3, Y still has a 1/6
chance of being 3.
So because that is the case, then X and Y are actually
conditionally independent conditioned on X
not equal to 2.
So the shortcut for looking at this is to see whether or not
these columns are multiples of one another.
And if they are, then they will be conditionally
independent.
You can also do the same thing and ask yourself, if I tell
you what Y is, does that change the distribution of X?
Well, X can't be 2, so forget about that.
Now, look at these ratios.
It's 1 to 2, 2 to 4, and 3 to 6, which are
all the same ratio.
And so again, knowing what Y is wouldn't change the
distribution of X, so they are independent.
OK, so the last part of this problem is a question about
conditional variance.
And specifically, we want to find the conditional variance
of Y given that X is 2.
So we want the variance of Y given that X is 2.
So in order to do this, how do we calculate this variance?
Well, calculating this variance is kind of the same
as calculating normal variance without the conditioning.
It's expectation of Y squared conditioned on X equals 2
minus expectation of Y conditioned
on X equal 2 squared.
So recall that a standard variance, variance of Y, would
just be expectation of Y squared minus expectation of Y
quantity squared.
The difference in the conditioning is that just you
carry over this conditioning in both these terms.
So now, what we need to do is just calculate what
these two terms are.
And in order to do that, we need the conditional PMF of Y
given that X is equal to 2.
So what we need is the PMF of Y given that X is equal to 2.
So we're in this world, the world of this middle column
where x is equal to 2.
And so let's take a shortcut now.
We can go through the old steps, but we can see that
when X is equal to 2, Y can either be 3 or 1 and they have
the equal probability.
And so this conditional PMF must be probability of 1/2
being 1 and 1/2 being 3.
Because we're just taking this and rescaling it so that it
adds up to 1 and becomes a valid PMF.
All right, so we had this conditional PMF of Y now.
We can use this to calculate these two terms.
So the first term, remember what we do is we take this Y
squared and we apply it to this expectation.
So for Y equals 1, it's going to be 1 squared times the
probability that Y equals 1 given X
equals 2, which is 1/2.
Plus Y can also be 3.
You'd square that and multiply the probability that it's
equal to 3.
So that's that first term.
Now, the second term is the probability that Y can be 1
with probably 1/2 and it can be 3 with probability 1/2.
So that gives you the conditional expectation and
then you square it.
So what do we end up with in the end?
We get 1/2 plus 9/2, which gives us 5.
And you subtract out 2.
Squared is 4.
So it turns out the conditional variance
in the end is 1.
All right, so we've gotten some exercise here with all
kinds of calculations related to a joint PMF.
We've used normalization to find out what the
constant should be.
We've calculated some marginals.
We calculated conditional PMFs, conditional
expectations, and conditional variances.
And we've also thought about independence between two
jointly distributed random variables.
And so for these concepts, it's useful to do a lot of
these kind of drill problems just to really cement these
ideas in your head.
So I hope that was helpful, and we'll see you next time.


# 9. Indicator variables: the number of inversions

![](C:/Users/qp/Pictures/Screenshots/9. Indicator variables the number of inversions - 1.png)
![](C:/Users/qp/Pictures/Screenshots/9. Indicator variables the number of inversions - 2.png)
![](C:/Users/qp/Pictures/Screenshots/9. Indicator variables the number of inversions - 3.png)
![](C:/Users/qp/Pictures/Screenshots/9. Indicator variables the number of inversions - 4.png)
![](C:/Users/qp/Pictures/Screenshots/9. Indicator variables the number of inversions - 5.png)
![](C:/Users/qp/Pictures/Screenshots/9. Indicator variables the number of inversions - 6.png)
![](C:/Users/qp/Pictures/Screenshots/9. Indicator variables the number of inversions - 7.png)
![](C:/Users/qp/Pictures/Screenshots/9. Indicator variables the number of inversions - 8.png)
![](C:/Users/qp/Pictures/Screenshots/9. Indicator variables the number of inversions - 9.png)
![](C:/Users/qp/Pictures/Screenshots/9. Indicator variables the number of inversions - 10.png)
************************************************************************************************************************
now I got some level of understanding of it. I think the process is we first describe the question with math, then apply the probability knowledge like independence and permutation to help us on expectation value calculation. but during the process we need a smart way to handle the question solving, which is setting 0 and 1 for independent event happening or not. in order to properly understand it, you need to think the individual event, which is 1/2 chance * outcome 1 as expected events on individual pair, then thinking about the permutation of i and j in all n population, and since only half of those meet the condition i < j, so you multiply those values. ta did a bad job
************************************************************************************************************************
In this problem, we'll be using a technique so-called
the indicator random variable to make a problem that seems
very difficult actually quite easy.
Now, we're given a group of n people going into a room,
where upon arrival, each person will receive a random
seat number so that we call Xi the seat number
received by a person i.
For example, if we have three people, then X1, X2, X3, the
sequence could look like 1, 3, 2.
In this case, person one goes into seat one, person two into
seat three, and person three into seat two.
Now, we'll assume that such [an] arrangement is uniformly
random in the sense that any sequence of seating
arrangement is equally likely to occur.
Now, for each arrangement, we'll define N as number of
inversions, where inversion is defined as the following.
For every two numbers, i and j, we say there's an inversion
occurring to i and j if i is less than j but Xj
is less than Xi.
In other words, the person with a lower index actually
receives a higher seating number.
If this is true, then we say there's an inversion occurring
to the pair i and j.
Now, the number N simply accounts a total number of
inverse pairs in such a seating arrangement.
For example, in the previous example, X1, X2, X3, being
equal to 1, 3, 2, we see that the only inversion that occurs
is between the pair 2 and 3.
So in this case, N is equal to 1.
Now, the question we want to answer is, what is the
expected value of N, given that there are little n people
in the room?
Now, let's try a direct
approach, which is by counting.
We know that there are in total n factorially many
permutations of the set 1, 2, 3, and n.
In other words, this is the total number of ways we can
arrange n people.
And 1/n factorial is simply the probability that each
permutation is used.
Now, for each permutation, we can count the number of
inversions for, let's say, permutation sigma.
Now, if we were to add up this number across all sigma, we
will get the expected value of N. That is, iterating through
the space of all permutations, and for each permutation,
counting the number or inversions and weight it by
the probability that this permutation that will emerge.
Now, the issue, of course, is that this is a huge summation.
We have to sum over n factorially many terms.
And for each term, we have to account exactly how many
inversions there are in this permutation.
So we wonder if there's an easier way to go about without
going through the counting procedure.
Now, let's introduce a notion of indicator random variable.
Here we'll define for each ij the variable Vij as the
indicator variable of the event Xi greater than Xj.
That is, the random variable is equal to 1 if Xi is indeed
greater than Xj and 0 if otherwise.
With this definition in mind, we now see that the random
variable N can be expressed as a summation [over]
all pairs where i is less than j [of] variables Vij.
That is, we'll go over the pair of all ij where i is less
than j and count one if there's an
inversion, and 0 otherwise.
Now, if we were to take the expected value of N now, which
is expected value of i less than j Vij, and by the
linearity of expectation, this gives us the summation of all
i less than j expected value of Vij.
What is the expected value of Vij?
Well, this term is equal to the probability that Xi is
greater than Xj.
We can verify this through the definition of Vij.
What is the probability that Xi is greater than Xj?
Well, since we're drawing a uniformly random permutation,
then the chance Xi is greater than Xj is exactly the same as
the chance that Xj is greater than Xi.
And therefore, we know this number must be equal to 1/2.
With this in mind, we put this back into [the]
summation.
We know now that the summation is equal to 1/2 times the
total number of ways that we can find a pair ij, where i is
less than j.
And it remains to just evaluate a
quantity right here.
Now, to compute a count here, we can actually do it with a
simpler set first.
In particular, we'll count the pair of all ij, where i is not
equal to j, removing that constraint, that i
be less than j.
In this case, we know the size of this set is given by n
times n minus 1.
Well, n is the total number of ways to pick i.
And once i is chosen, there remains n minus
ways to pick j.
Now, in this set, we know that if number, for example, 1, 3
appears, then 3, 1 also appears.
In fact, for every pair ij, there's ij and ji.
However, originally we only care about the case where i is
less than j.
And hence we know the set we care about right here is
exactly half the size of the set here.
And therefore, we know the original expression evaluates
to 1/2 times 1/2 of n, n minus 1.
And this gives us the final answer.
Expected value of N is equal to 1/4, n, n minus 1.
So in this problem, we were able to compute the expected
value of a fairly complex random variable by breaking it
down into the summation of much simpler expected values
using the linearity of expectation plus the technique
so-called indicator random variables.


# 10. Indicator variables: the problem of joint lives

![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 1.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 2.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 3.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 4.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 5.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 6.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 7.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 8.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 9.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 10.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 11.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 12.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 13.png)
![](C:/Users/qp/Pictures/Screenshots/10. Indicator variables the problem of joint lives - 14.png)
Hi.
In this session, we'll be talking about the problem of
joint lives.
And through it, we'll be introducing a useful tool
called indicator variables.
The problem with joint lives is the following.
We have 2n people, who are paired into m couples.
Each couple consists of a first partner
and a second partner.
And what we're told is that at some later point in time, each
individual, each person has a probability, p,
of being still alive.
And if it helps, you could think of this later point in
time as something more concrete, say 50
years down the road.
And we're also told that whether any person is alive is
independent of whether any other person is alive.
So this person being alive has no probabilistic effect on the
probability that his or her partner is still alive.
So let's define a couple of random variables, as the
problem says.
Let's let A be the number of people still alive 50 years
down the road.
This is random because each person has some probability of
being still alive.
And let S be the number of couples still alive.
And a couple being still alive is defined as being both
partners in that same couple are still alive.
So if at least one person is no longer alive in that
couple, then we consider that couple as no longer alive.
So with these definitions, what the problem asks us to
calculate is this conditional expectation.
What is the expectation of S conditioned on A
being some little a?
Now, what does this actually mean in words is, what it says
that, suppose I tell you that there are exactly little a
people who are still alive 50 years from now.
Then given that, what is the expected number of couples who
are still alive?
And of course, this is going to be a function of what
little a is.
So it depends on exactly how many people are still alive.
Now, you might think and say, well, the
answer is pretty obvious.
It's probably A/2.
If there's little a people left and each couple has two
people, then if there's A people left, then it must be
A/2 couples left.
But in fact, that's actually not right, because in order
for a couple to be alive, both people have to be still alive
in that same couple.
And that may not be the case for all the a people who are
still alive.
They may not just happen to fall into couples.
So in fact, the answer should be less than A/2.
And exactly what it is is less obvious.
And we'll have to do some calculations
in order to do this.
And it seems, actually, that it might be difficult to do,
because S is not some clear random variable that we've
dealt with so far.
Now, the trick in this problem is to try to think about, can
we split S up into its component parts?
So what is S?
S is the number of couples who are still alive.
Let's just split this up into each individual couple.
And if we can somehow calculate whether or not any
given couple is still alive, maybe if we can then aggregate
that up, we can figure out what this conditional
expectation is.
And that really is the idea behind indicator variables.
So what exactly is an indicator variable?
Well, it's actually really just a
Bernoulli random variable.
It's either 0 or 1.
And we call it an indicator variable, just because in this
case, it's supposed to indicate whether or not a
certain event has occurred.
So let's define a couple indicator variables.
Let's let Xi be 1 if the first partner in couple i is still
alive 50 years from now.
Now, notice that we've actually specified that this
is the first partner.
So within any couple, let's let one of them be the first
partner and the second one be the second partner.
So this is just a Bernoulli random variable, but we call
indicator variable, because it's supposed to indicate that
whether or not the first partner in couple i is still
alive, and is 0 otherwise.
And similarly, let's define Yi to be 1 if the second partner
in couple i is still alive, and 0 or otherwise.
So why do we bother with doing this?
Well, now we can define another random variable, Zi to
be the product of Xi times Yi.
Now, what is this?
This actually is also an indicator variable.
It's 1 if couple i is alive.
And it's 0 otherwise.
Now, why is this?
Well this is the case because Zi is 1 if and only if Xi is 1
and Y1 is 1.
If at least one of these two indicator variables is 0, then
their product is going to be 0.
So Zi is 1 if and only if the first partner in couple i
alive and the second partner in couple i is alive, which is
exactly the event that that couple is still alive.
And so what we've done now is, we've kind of broken up things
into individual couples.
And we've defined a random variable that dictates whether
or not that couple is still alive and now, what we do is,
we aggregate up, because it turns out that S now is just
the sum of all these Zi's, because S is a count of a
number of couples who are alive.
Zi tells you whether any given couple is still alive.
It's 1 if that couple is alive and 0 otherwise.
So if you just add up all the Zi's for i from 1 to m--
remember, there's m couples total--
then this will give you exactly S. And of course, this
is just Xi times Yi.
Now, we've rewritten S as this summation.
And now, we can plug this back into the original conditional
expectation that we want to calculate.
So the conditional expectation of S given that A is a is just
the expectation of the sum of Xi times Yi.
And now, we can use linearity of expectations, because this
is just a sum of m terms.
Expectation of a sum is just the sum of expectations.
And even in the conditional world, that's still true.
And so this turns out to be just a sum from i equals 1 to
m of the conditional expectation of each of these
Xi times Yi's.
And remember, we still need to keep the conditioning on A
equals A. And now, we can further simplify this, because
we can observe that it doesn't actually really matter, which
i we're talking about because of the symmetry--
all the people have the same probability
of being still alive.
And they're all independent.
And so because of that, each of these expectations, this
expectation is for a couple i, and then you
have another couple.
But all these are going to be the same, no matter what i is.
And so what we actually have is just m copies of the same
conditional expectation.
And so let's actually just be specific and just focus on,
say, we could focus on any of them.
Let's just focus on the first couple.
So now, the quantity that we want to find is this.
Now, what's left to do is, we just need to figure out what
this conditional expectation is.
So let's do that.
What is the conditional expectation of X1, Y1 given
that A is a?
Well, remember what an expectation is.
An expectation is really just a weighted sum.
It's just, you take the value of the random variable, and
you weight it by the probability that that random
variable takes on that value.
Well, we know that X1 times Y1 is just Z1, which is a
Bernoulli random variable, so it can only take on two
values, either 1 or 0.
And the 0 case doesn't actually factor into the
expectation, because the 0 just cancels everything out,
probability out.
And so it's really just equal to the probability that X1, Y1
equals 1 given that A is a.
What is this equal to?
Well, X1 times Y1 can only equal 1 if X1 equals 1
and Y1 equals 1.
And now we can separate this out some more and focus on,
let's say, this is the multiplication rule and say,
well, first, let's focus on this.
And now, we'll then focus on Y.
So first, we'll say, what's the probability
that X1 equals 1.
And remember, we still need to keep the conditioning on A.
And then once we have X1 equals 1, then we add that to
the conditioning, and then we calculate what's the
probability that Y1 equals 1.
So this is really the same rule, except in the
conditional universe, where A equals a.
So now, we have to figure out what these two things are.
So the first one is, what is the probability that X1 equals
1, given that A equals a.
What does that mean in words?
That means that conditional on that there are exactly little
a people who are still alive.
What's the probability that the first partner in the first
couple is still alive?
Well, because of the symmetry and all that, if we know that
there are a people left, then any set of a people out of the
original 2m is equally likely to be the a
who are still alive.
How many ways can we choose that?
Well, what we have is, we have 2m people.
And we need to choose a of them to be the ones who are
still alive.
So that's the total number of ways we can pick the a people
who are still alive out of the original 2m people.
Now, we want the first partner in the first couple to be
still alive.
So his place is already fixed.
He needs to be one of the a people who are left, which
leaves a minus 1 people left.
And those a minus 1 people who are still alive have to be
filled by 2m minus 1 people, because besides this first
partner in the first couple, there are now 2m minus 1
people left.
And those are the people from which we need to choose the a
minus 1 people who are also still alive.
So there's 2m minus 1 choose a minus 1.
And if you actually calculate this out, turns out that this
is actually just a/2m.
So another way to think about this, if the counting argument
didn't quite make sense is that there are basically a
slots left.
These are the a lucky people who are still alive.
And there are a slots left to be filled.
And each person, because each person has the same
probability p of being still alive, they have an equal
chance of being selected for one of these a lucky slots.
So consider the first slot of people who are still alive,
the a people who are still alive.
Every single person out of the original 2m is
vying for that spot.
And so the first partner in the first couple has 1/2m
chance of being selected for that first slot.
But he also has the same probability of being selected
for the second slot, and all the way
through the eighth slot.
So essentially, he has a chances to be selected to be
one of the people who are still alive.
And so that's why you get the probability of a/2m.
Now, let's think about the second one.
Now, we're conditioning on there being a people who are
still alive.
And we know also that one of those a people is the first
partner in the first couple.
Now, given that, what's the probability that the second
partner in the first couple is still alive?
Well, if you think about it, this probability is going to
be lower than this one, because we know now also that
one of the a slots has already been taken by the first
partner in the first couple.
So the second partner kind of has less chance of being
selected, because one of the slots has already been taken.
So it turns out that now, through a similar argument,
you get that it's going to be 2m minus 1 over a minus 1.
And if you calculate this, you will also
get something similar.
And the only difference here you notice is that the fact
that this person, the first partner in the first couple is
known to be alive removes one of the slots available.
And so we get that that this is this conditional
expectation.
But remember, we also have a factor of m that
we need to add in.
And so the final answer that we get is going to be a/2
times a minus 1 over 2m minus 1.
So that is our final answer.
So I have one more word about this.
You can also think of it as the same thing, because now we
know that one of the a slots has been taken.
So there's only a minus 1 slots to be filled.
And there's 2m minus 1 people vying for those slots, because
we know that the first partner in the first couple has been
fixed to be still alive.
And so we get a similar argument.
You get this expression.
Now, let's just make sure that this actually makes sense.
Well, we know that a is no more than 2m, because there
were 2m people to begin with, so the number of people who
are still alive can be no more than 2m.
And so because of that, you see that this thing here is
going to be no greater than 1, which means that this
conditional expectation, the answer that we're looking for
is going to be no more than a/2.
And so this is just a sanity check to make sure that our
answer is actually correct.
And that makes sense, because if you start out with if you
have a people, there's no way that you can have more than
a/2 couples.
And you notice that for a less than 2m, this expression is
actually going to be less than a/2, because this part is
going to be less than 1.
And this just verifies our original intuition, that
because not every person who was left alive will just
happen to fall into the original couples, original
partners, then the number of couples who are still alive
will actually be less than a/2.
And only when a is exactly equal 2m--
so if a is exactly equal to 2m, then this is going to be
equal to 1 and this is going to be equal to m.
So in that case, if all the original people are still
alive, then we'll have exactly all the original m couples who
are still alive.
But otherwise, it's going to be less than a/2.
So this answer, at least on the surface, seems to be
something that's reasonable.
So in this problem, we started out with something that seemed
fairly complicated.
And we've broken it down into constituent parts.
So we've taken one random variable and we've broken down
into all these indicator variables, each one
representing an individual couple.
And through that, the use of indicator variables and
linearity of expectations, we were able to calculate the
final answer.


## Course  /  Unit 4: Discrete random variables  /  Additional theoretical material

# 1. Functions

![](C:/Users/qp/Pictures/Screenshots/1. Functions - 1.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 2.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 3.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 4.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 5.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 6.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 7.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 8.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 9.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 10.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 11.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 12.png)
![](C:/Users/qp/Pictures/Screenshots/1. Functions - 13.png)
We are defining a random variable as a real valued
function on the sample space.
So this is a good occasion to make sure that we understand
what a function is.
To define a function, we start with two sets.
One set--
call it A--
is the domain of the function.
And we have our second set.
Then a function is a rule that for any element of A
associates an element of B. And we use a notation of this
kind to indicate that we are dealing with a function f that
maps elements of A into elements of B.
Now, two elements of A may be mapped to the same element of
B. This is allowed.
What is important, however, is that every element of A is
mapped to exactly one element of B, not more.
But it is also possible that we have some elements of B
that do not correspond to any of the elements of A.
Now, I said that a function is a rule that assigns points of
A to points in B. But what exactly do we mean by a rule?
If we want to be more precise, a function would
be defined as follows.
It would be defined as a set of pairs of values.
It would be a set of pairs of the form x, y such that x is
always an element of A, y is always an element of B, and
also-- most important--
each x in A appears in exactly one pair.
So this would be a formal definition of
what a function is.
It is collection of ordered pairs of this kind.
As a concrete example, let us start with the set consisting
of these elements here.
And let B be the set of real numbers.
And consider the function that corresponds to what we usually
call the square.
So it's a function that squares its argument.
Then this function would be represented by the following
collection of pairs.
So this is the value of x.
And this is the corresponding value of y.
Any particular x shows up just once in this
collection of pairs.
But a certain y--
for example, y equal to 1--
shows up twice, because minus 1 and plus 1 both map to the
same element of B.
Now, this is a representation in terms of ordered pairs.
But we could also think of the function as being
described by a table.
We could, for instance, put this information here in a
form of a table of this kind and say that this table
describes the function.
For any element x, it tells us what the
corresponding element y is.
However, when the set A is an infinite set it is not clear
what we might mean by saying a table, an infinite table,
whereas this definition in terms of
ordered pairs still applies.
For example, if you're interested in the function
which is, again, the square function from the real
numbers, the way you would specify that function
abstractly would be as follows.
You could write, it's the set of all pairs of this form such
that x is a real number.
And now such pairs, of course, belong to the two dimensional
plane because it's a pair of numbers.
So this set here can be viewed as a formal definition or a
specification of the squaring function.
Now, what this set is is something that we
can actually plot.
If we go in the two dimensional plane, the points
of this form are exactly the points that belong to the
graph of the square function.
So this abstract definition, really all that it says is
that a function is the same thing as the
plot of that function.
But it's important here to make a distinction.
The function is the entire plot--
so this set here is the function f--
whereas if I tell you a specific number x, the
corresponding value here would be f of x.
So here x is a number and f of x is also a number.
And those two values, x and f of x, define this particular
point on this plot.
But the function itself is the entire plot.
Let us also take this occasion to talk a little bit about the
notation and the proper way of talking about functions.
Here is the most common way that one
would describe a function.
And this is an appropriate way.
We've described the domain.
We've described the set on which the
function takes values.
And I'm telling you for any x in that set what the value of
the function is.
On the other hand, sometimes people use a more loose
language, such as for example, they would say,
the function x squared.
What does that mean?
Well, what this means is exactly this statement.
Now let us consider this function.
The function f--
again, from the reals to the reals--
that's defined by f of z equal to z squared.
Is this a different function or is it the same function?
It's actually the same function, because these two
involve the same sets.
And they produce their outputs, the values of f,
using exactly the same rule.
They take an argument and they square that argument.
Now, if you were to use informal notation, you would
be referring to that second function as
the function z squared.
And now, if you use informal language, it's less clear that
the function x squared and the function z squared are one and
the same thing, whereas with this terminology here, it
would be pretty clear that we're talking
about the same function.
Finally, suppose that we have already defined a function.
How should we refer to it in general?
Should we call it the function f, or should we say the
function f of x?
Well, when x is a number, f of x is also a number.
So f of x is not really a function.
The appropriate language is this one.
We talk about the function f, although quite often, people
will abuse language and they will use this terminology.
But it's important to keep in mind what we really mean.
The idea is that we need to think of a function as some
kind of box or even a computer program, if you wish, that
takes inputs and produces outputs.
And there's a distinction between f, which is the box,
from the value f of x that the function takes if we feed it
with a specific argument.


# 2. The variance of the geometric PMF

![](C:/Users/qp/Pictures/Screenshots/2. The variance of the geometric PMF - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. The variance of the geometric PMF - 2.png)
![](C:/Users/qp/Pictures/Screenshots/2. The variance of the geometric PMF - 3.png)
![](C:/Users/qp/Pictures/Screenshots/2. The variance of the geometric PMF - 4.png)
![](C:/Users/qp/Pictures/Screenshots/2. The variance of the geometric PMF - 5.png)
![](C:/Users/qp/Pictures/Screenshots/2. The variance of the geometric PMF - 6.png)
![](C:/Users/qp/Pictures/Screenshots/2. The variance of the geometric PMF - 7.png)
![](C:/Users/qp/Pictures/Screenshots/2. The variance of the geometric PMF - 8.png)
![](C:/Users/qp/Pictures/20221005_153039.jpg)
![](C:/Users/qp/Pictures/20221005_153113.jpg)
In this segment, we will derive the formula for the
variance of the geometric PMF.
The argument will be very much similar to the argument that
we used to drive the expected value of the geometric PMF.
And it relies on the memorylessness properties of
geometric random variables.
So let X be a geometric random variable with
some parameter p.
The way to think about X is like the number of coin flips
that it takes until we obtain heads for the first time,
where p is the probability of heads at each toss.
Recall now the memorylessness property.
If I tell you that X is bigger than 1--
which means that the first trial was a failure---
we obtained tails.
Given that event, the remaining number of tosses has
the same geometric PMF as if we were just
starting at this time.
So it has the same geometric PMF as the unconditional PMF
of X. And this is the property that we exploited in order to
find the expected value of X.
Now let us take this observation and add one to the
random variables involved and turn this statement to the
following version.
The conditional PMF of X--
which is this random variable plus 1--
is the same as the unconditional PMF of this
random variable plus 1.
So it's the same statement as before except that we added 1.
One consequence of the memorylessness that we have
already seen and exploited is that the expected value of X
in the conditional universe where the first coin flip was
wasted is equal to 1--
that's the wasted coin flip--
plus how long you expect to have to flip the coin until
you obtain heads for the first time, starting
from the second flip.
Since the conditional distribution of X in this
universe is the same as the unconditional distribution of
this random variable, it means that the corresponding
expected value in this universe is going to be equal
to the expected value of this random variable, which is 1
plus the expected value of X. And by exactly the same
argument, the random variable X squared has the same
distribution in the conditional universe as the
random variable X plus 1 squared in the
unconditional universe.
So since X in the conditional universe has the same
distribution as X plus 1, it means that X squared in the
conditional universe has the same distribution as X plus 1
squared in the unconditional universe.
So now let us take those facts and use a divide and conquer
method to calculate the expected value of X squared.
We will use exactly the same method that we used in order
to calculate the expected value.
We separate into two scenarios.
In one scenario, X is equal to 1.
And then we have the expected value of X squared given that
X is equal to 1.
And then we have another scenario--
the scenario that X is bigger than 1.
And then we have the expected value of X squared given that
X is bigger than 1.
So this is just the total expectation theorem.
Now let us calculate terms.
The probability that the first toss results in success, that
X is equal to 1-- this is p.
And if X is equal to 1, then the value of X squared is also
equal to 1.
And then there is probability 1 minus p that the first trial
was not a success.
So we get to continue.
We have this conditional expectation here.
But it is equal to this unconditional
expectation up there.
And now let us expand the terms in this quadratic and
write this as expected value of X squared plus twice the
expected value of X plus 1.
Now we know what this expected value here is.
The expected value of a geometric is just 1/p.
And what we're left with is an equation that involves a
single unknown.
Namely, this quantity is the unknown.
And we can solve this linear equation for this unknown.
We carry out some algebra, which is not so
interesting by itself.
And after we carry out the algebra, what we obtain is
that the expected value of X squared is equal to 2 over p
squared minus 1 over p.
And then we use the formula that the variance of a random
variable is equal to the expected value of the square
of that random variable minus the square of
the expected value.
We already know what that expected value is.
We found the expected value of the square.
And putting all that together, we obtain a final answer.
And this is the expression for the variance of a geometric
random variable.
It goes without saying that for this calculation to make
sense, we need to assume that the parameter that we're
dealing with is positive.


# 3. The inclusion-exclusion formula

![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 1.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 2.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 3.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 4.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 5.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 6.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 7.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 8.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 9.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 10.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 11.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 12.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 13.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 14.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 15.png)
![](C:/Users/qp/Pictures/Screenshots/3. The inclusion-exclusion formula - 16.png)
In this segment, we develop the inclusion-exclusion
formula, which is a beautiful generalization of a formula
that we have seen before.
Let us look at this formula and remind
ourselves what it says.
If we have two sets, A1 and A2, and we're interested in
the probability of their union, how can we find it?
We take the probability of the first set, we add to it the
probability of the second set, but then we realize that by
doing so we have double counted
this part of the diagram.
And so we need to correct for that and we need to subtract
the probability of this intersection.
And that's how this formula comes about.
Can we generalize this thinking, let's say, to the
case of three events?
Suppose that we have three events, A1, A2, and A3.
And we want to calculate the probability of their union.
We first start by adding the probabilities of
the different sets.
But then we realize that, for example, this part of the
diagram has been counted twice.
It shows up once inside the probability of A1 and once
inside the probability of A2.
So, for this reason, we need to make a correction and we
need to subtract the probability of this
intersection.
Similarly, subtract the probability of that
intersection and of this one.
So we subtract the probabilities of these
intersections.
But, actually, the intersections are not just
what I drew here.
The intersections also involve this part.
So now, let us just focus on this part of the diagram here.
A typical element that belongs to all three of the sets will
show up once here, once here and once there.
But it will also show up in all of these intersections.
And so it shows up three times with a plus sign, three times
with a minus sign, which means that these elements will not
to be counted at all.
In order to count them, we need to add one more term
which is the probability of the three way intersection.
So this is the formula for the probability of the union of
three events.
It has a rationale similar to this formula, and you can
convince yourself that it is a correct formula by just
looking at the different pieces of this diagram and
make sure that each one of them is accounted properly.
But instead of working in terms of such a picture, let
us think about a more formal derivation.
And the formal derivation will use a beautiful trick.
Namely, indicator functions.
So here is the formula that we want to establish.
And let us remind ourselves what indicator functions are.
To any set or event, we can associate
an indicator function.
Let's say that this is the set Ai.
We're going to associate an indicator function, call it
Xi, which is equal to 1 when the outcome is inside this
set, and it's going to be 0 when the outcome is outside.
What is the indicator function of the complement?
The indicator function of the complement is 1 minus the
indicator of the event.
Why is this?
If the outcome is in the complement, then Xi is equal
to 0, and this expression is equal to 1.
On the other hand, if the outcome is inside Ai, then the
indicator function will be equal to 1 and this quantity
is going to be equal to 0.
If we have the intersection of two events, Ai and Aj, what is
their indicator function?
It is Xi times Xj.
This expression is equal to 1, if and only if, Xi is equal to
1 and Xj is equal to 1, which happens, if and only if, the
outcome is inside Ai and also inside Aj.
Now, what about the indicator of the intersection of the
complements?
Well, it's an intersection.
So the associated indicator function is going to be the
product of the indicator function of the first set,
which is 1 minus Xi times the indicator function of the
second set, which is 1 minus Xj.
And finally, what is the indicator
function of this event?
Here we remember De Morgan's Laws.
De Morgan's Laws tell us that the complement of this set--
the complement of a union--
is the intersection of the complements.
So this event here is the complement of that event.
And, therefore, the associated indicator function is going to
be 1 minus this expression.
And if we were dealing with more than two sets--
and here we had, for example, three way intersections--
you would get the product of three terms.
And if we had a three way union, we would get a similar
expression, except that here we would have, again, a
product of three terms instead of two.
So now, let us put to use what we have done so far.
We are interested in the probability that the outcome
falls in the union of three sets.
Now, an important fact to remember is that the
probability of an event is the same as the expected value of
the indicator of that event.
This is because the indicator is equal to 1, if and only if,
the outcome happens to be inside that set.
And so the contribution that we get to the expectation is 1
times the probability that the indicator is 1, which is just
this probability.
Now, the indicator of a three way union is going to be, by
what we just discussed, 1 minus a product of this kind,
but now with three terms.
Let us now calculate this expectation by expanding the
product involved.
We have this first term, then, when we multiply those three
terms together, we're going to get a bunch of contributions.
One contribution with a minus sign is 1 times 1 times 1.
Another contribution would be minus minus--
that's a plus--
X1 times 1 times 1.
And similarly, we get a contribution of X2 and X3.
And then we have a contribution such as X1 times
X2 times 1.
And if you look at the minus signs--
there are three minuses involved-- so, overall, it's
going to be a minus.
Minus X1 times X2.
And then there is going to be similar terms, such as
X1 X3 and X2 X3.
And, finally, there's going to be a term X1
times X2 times X3.
There's a total of four minus signs involved, so everything
shows up in the end with a plus sign.
So the probability of this event is equal to the
expectation of this random variable here.
We notice that the ones cancel out.
The expected value of X1 for an indicator variable is just
the probability of that event.
And we get this term.
The expected value of X2 and X3 give us these terms.
The expected value of X1 times X2.
This is the indicator random variable of the intersection.
So the expected value of this term is just the probability
of the intersection of A1 and A2.
And, similarly, these terms here give rise to those two
terms here.
Finally, X1 times X2 times X3 is the indicator variable for
the event A1 intersection A2 intersection A3.
Therefore, the expected value of this term, is equal to this
probability.
And, therefore, we have established exactly the
formula that we wanted to establish.
Now this derivation that we carried out here, there's
nothing special about the case of three.
We could have the union of many more events, we would
just have here the product of more terms, and we would need
to carry out the multiplication and we would
get cross terms of all types involving just one of the
indicator variables, or products of two indicator
variables, or products of three indicator
variables, and so on.
And after you carry out this exercise and keep track of the
various terms, you end up with this general version of what
is called the inclusion-exclusion formula.
So the probability of a union is--
there's the sum of the probabilities, but then you
subtract all possible probabilities of two way
intersections.
Then we add probabilities of three way intersections, then
you subtract probabilities of four way intersections, and
you keep going this way alternating sings until you
get to the last term, which is the probability of the
intersection of all the events involved.
And this exponent here of n minus 1 is the exponent that
you need so that the last term has the correct sign.
So, for example, if n is equal to 3, the exponent would be 2,
so this would be a plus sign, which is consistent with what
we got here.
So this is a formula that is quite useful when you want to
calculate probabilities of unions of events.
But also, this derivation using indicator functions, is
quite beautiful.


# 4. Independence of random variables versus independence of events

![](C:/Users/qp/Pictures/Screenshots/4. Independence of random variables versus independence of events - 1.png)
![](C:/Users/qp/Pictures/Screenshots/4. Independence of random variables versus independence of events - 2.png)
![](C:/Users/qp/Pictures/Screenshots/4. Independence of random variables versus independence of events - 3.png)
![](C:/Users/qp/Pictures/Screenshots/4. Independence of random variables versus independence of events - 4.png)
![](C:/Users/qp/Pictures/Screenshots/4. Independence of random variables versus independence of events - 5.png)
![](C:/Users/qp/Pictures/Screenshots/4. Independence of random variables versus independence of events - 6.png)
![](C:/Users/qp/Pictures/Screenshots/4. Independence of random variables versus independence of events - 7.png)
![](C:/Users/qp/Pictures/Screenshots/4. Independence of random variables versus independence of events - 8.png)
![](C:/Users/qp/Pictures/Screenshots/4. Independence of random variables versus independence of events - 9.png)
By now, we have defined the notion of independence of
events and also the notion of
independence of random variables.
The two definitions look fairly similar, but the
details are not exactly the same, because the two
definitions refer to different situations.
For two events, we know what it means for them to be
independent.
The probability of their intersection is the product of
their individual probabilities.
Now, to make a relation with random variables, we introduce
the so-called indicator random variables.
So for example, the random variable X is defined to be
equal to 1 if event A occurs and to be equal
to 0 if event [A]
does not occur.
And there is a similar definition for random variable
Y.
In particular, the probability that random variable X takes
the value of 1, this is the probability
that event A occurs.
It turns out that the independence or the two
events, A and B, is equivalent to the independence of the two
indicator random variables.
And there is a similar statement, which
is true more generally.
That is, n events are independent if and only if the
associated n indicator random variables are independent.
This is a useful statement, because it allows us to
sometimes, instead of manipulating events, to
manipulate random variables, and vice versa.
And depending on the context, one maybe
easier than the other.
Now, the intuitive content is that events A and B are
independent if the occurrence of event A does not change
your beliefs about B. And in terms of random variables, one
random variable taking a certain value, which indicates
whether event A has occurred or not does not to give you
any information about the other random variable, which
would tell you whether event B has occurred or not.
It is instructive now to go through the derivation of this
fact, at least for the case of two events, because it gives
us perhaps some additional understanding about the
precise content of the definitions we have
introduced.
So let us suppose that random variables X and Y are
independent.
What does that mean?
Independence means that the joint PMF of the two random
variables, X and Y, factors as a product of the corresponding
marginal PMFs.
And this factorization must be true no matter what arguments
we use inside the joint PMF.
And the combination of X and Y in this instance have a total
of four possible values.
These are the combinations of zeroes and
ones that we can form.
And for this reason, we have a total of four equations.
These four equalities are what is required for X and Y to be
independent.
So suppose that this is true, that the random variables are
independent.
Let us take this first relation and write it in
probability notation.
The random variable X taking the value of 1, that's the
same as event A occurring.
And random variable Y taking the value of 1, that's the
same as event B occurring.
So the joints PMF evaluated at 1, 1 is the probability that
events A and B both occur.
On the other side of the equation, we have the
probability that X is equal to 1, which is the probability
that A occurs, and similarly, the probability that B occurs.
But if this is true, then by definition, A and B are
independent events.
So we have verified one direction of this statement.
If the random variables are independent, then events A and
B are independent.
Now, we would like to verify the reverse statement.
So suppose that events A and B are independent.
In that case, this relation is true.
And as we just argued, this relation is the same as this
relation but just written in different notation.
So we have shown that if A and B are independent, this
relation will be true.
But how about the remaining three relations?
We have more work to do.
Here's how we can proceed.
If A and B are independent, we have shown some time ago that
events A and B complement will also be independent.
Intuitively, A doesn't tell you anything about
B occuring or not.
So A does not tell you anything about whether B
complement will occur or not.
Now, these two events being independent, by the definition
of independence, we have that the probability of A
intersection with B complement is the product of the
probabilities of A and of B complement.
And then we realize that this equality, if written in PMF
notation, corresponds exactly to this equation here.
Event A corresponds to X taking the value of 1, event B
complement corresponds to the event that Y takes
the value of 0.
By a similar argument, B and A complement will be
independent.
And we translate that into probability notation.
And then we translate this equality into PMF notation.
And we get this relation.
Finally, using the same property that we used to do
the first step here, we have that A complement and B
complement are also independent.
And by following the same line of reasoning, this applies the
fourth relation as well.
So we have verified that if events A and B are
independent, then we can argue that all of these four
equations will be true.
And therefore, random variables X and Y will also be
independent.


## Course  /  Unit 4: Discrete random variables  /  Problem Set 4

# 1. Tosses of a biased coin

![](C:/Users/qp/Pictures/Screenshots/1. Tosses of a biased coin - 1.png)
![](C:/Users/qp/Pictures/Screenshots/1. Tosses of a biased coin - 2.png)
![](C:/Users/qp/Pictures/Screenshots/1. Tosses of a biased coin - 3.png)
![](C:/Users/qp/Pictures/Screenshots/1. Tosses of a biased coin - 4.png)
![](C:/Users/qp/Pictures/Screenshots/1. Tosses of a biased coin - 5.png)
![](C:/Users/qp/Pictures/Screenshots/1. Tosses of a biased coin - 6.png)
![](C:/Users/qp/Pictures/Screenshots/1. Tosses of a biased coin - 7.png)
![](C:/Users/qp/Pictures/20220930_220949.jpg)
![](C:/Users/qp/Pictures/20220930_221042.jpg)


# 2. Three-sided dice

![](C:/Users/qp/Pictures/Screenshots/2. Three-sided dice - 1.png)
![](C:/Users/qp/Pictures/Screenshots/2. Three-sided dice - 2.png)
![](C:/Users/qp/Pictures/Screenshots/2. Three-sided dice - 3.png)
![](C:/Users/qp/Pictures/20221001_011654.jpg)
![](C:/Users/qp/Pictures/20221001_011819.jpg)
![](C:/Users/qp/Pictures/20221001_012027.jpg)


# 3. PMF, expectation, and variance
![](C:/Users/qp/Pictures/Screenshots/3. PMF, expectation, and variance - 1.png)
![](C:/Users/qp/Pictures/Screenshots/3. PMF, expectation, and variance - 2.png)
![](C:/Users/qp/Pictures/Screenshots/3. PMF, expectation, and variance - 3.png)
![](C:/Users/qp/Pictures/Screenshots/3. PMF, expectation, and variance - 4.png)
![](C:/Users/qp/Pictures/Screenshots/3. PMF, expectation, and variance - 5.png)
![](C:/Users/qp/Pictures/Screenshots/3. PMF, expectation, and variance - 6.png)
![](C:/Users/qp/Pictures/Screenshots/3. PMF, expectation, and variance - 7.png)
![](C:/Users/qp/Pictures/Screenshots/3. PMF, expectation, and variance - 8.png)
![](C:/Users/qp/Pictures/20221001_013328.jpg)
![](C:/Users/qp/Pictures/20221001_013412.jpg)
![](C:/Users/qp/Pictures/20221001_013534.jpg)
![](C:/Users/qp/Pictures/20221001_013627.jpg)


# 4. Joint PMF

![](C:/Users/qp/Pictures/Screenshots/4. Joint PMF - 1.png)
![](C:/Users/qp/Pictures/Screenshots/4. Joint PMF - 2.png)
![](C:/Users/qp/Pictures/Screenshots/4. Joint PMF - 3.png)
![](C:/Users/qp/Pictures/Screenshots/4. Joint PMF - 4.png)
![](C:/Users/qp/Pictures/Screenshots/4. Joint PMF - 5.png)
![](C:/Users/qp/Pictures/Screenshots/4. Joint PMF - 6.png)
![](C:/Users/qp/Pictures/Screenshots/4. Joint PMF - 7.png)
![](C:/Users/qp/Pictures/Screenshots/4. Joint PMF - 8.png)
![](C:/Users/qp/Pictures/20221001_161958.jpg)
![](C:/Users/qp/Pictures/20221001_162032.jpg)
![](C:/Users/qp/Pictures/20221001_162103.jpg)


# 5. Indicator variables

![](C:/Users/qp/Pictures/Screenshots/5. Indicator variables - 1.png)
![](C:/Users/qp/Pictures/Screenshots/5. Indicator variables - 2.png)
![](C:/Users/qp/Pictures/Screenshots/5. Indicator variables - 3.png)
![](C:/Users/qp/Pictures/Screenshots/5. Indicator variables - 4.png)
![](C:/Users/qp/Pictures/Screenshots/5. Indicator variables - 5.png)
![](C:/Users/qp/Pictures/Screenshots/5. Indicator variables - 6.png)
![](C:/Users/qp/Pictures/Screenshots/5. Indicator variables - 7.png)
![](C:/Users/qp/Pictures/Screenshots/5. Indicator variables - 8.png)
![](C:/Users/qp/Pictures/Screenshots/5. Indicator variables - 9.png)


# 6. True or False

![](C:/Users/qp/Pictures/Screenshots/6. True or False - 1.png)
![](C:/Users/qp/Pictures/Screenshots/6. True or False - 2.png)
![](C:/Users/qp/Pictures/Screenshots/6. True or False - 3.png)
![](C:/Users/qp/Pictures/Screenshots/6. True or False - 4.png)
![](C:/Users/qp/Pictures/Screenshots/6. True or False - 5.png)
![](C:/Users/qp/Pictures/Screenshots/6. True or False - 6.png)


# Course  /  Unit 4: Discrete random variables  /  Unit summary


# 1. Unit 4 summary

In this unit, we introduced many different concepts,
definitions, and formulas, so it may be useful to put in one
place a summary of the key concepts and the key formulas
that we have developed.
We first defined random variables.
And then we discussed that random variables are described
in terms of a probability mass function that tells you the
probabilities of the different values that the random
variable can take.
And for the case of multiple random variables, we may use a
joint probability mass function.
We also defined conditional probability mass functions
that refer to the distribution of a random variable X in a
universe in which we are told that a certain event has
occurred or that a certain random variable takes on a
specific value.
A key concept that we introduced was the concept of
expectation.
We defined the notion of the expected
value of a random variable.
But if we're given some information, then we are
transported to a conditional universe, and we calculate the
so-called conditional expectation that takes into
account the information that we have available.
And this calculation makes use of the corresponding
conditional PMF of X, given an event or given the value of
another random variable.
The main facts about
expectations were the following.
We have the expected value rule for calculating the
expectation of a function of one or multiple random
variables without having to calculate the distribution of
this function of random variables.
Instead, we can do the calculations directly, using
the original PMF of the original random variables.
And once more we have conditional versions of the
expected value rule that take the same form, except that we
need to use conditional PMFs when we carry out the
calculations.
The second important fact about expectations is that
they're linear.
If we have a linear function, let's say, of two random
variables, then the expected value of this linear function
is the same linear function of the expectations.
Another concept that we introduced was the variance of
a random variable that measures the dispersion or the
spread of the distribution of a random variable.
And if we're talking about a conditional universe where
we're given some information, then we have the conditional
variance given that an event has occurred or given that a
random variable takes a specific value.
A useful formula that allows us to calculate in a somewhat
easier manner the variance of a random variable is this one.
And we had a few opportunities to use it.
Now, an important concept about random variables is the
notion of independence.
And independence basically means that the joint PMF
factors out as a product of marginal PMFs.
This is the mathematical definition.
The intuitive definition would be that information about one
of the random variables gives us no information about the
values of the other random variable.
Now, independence has some interesting, nice mathematical
consequences.
In particular, if X and Y are independent, the expected
value of the product is the product of the expectations.
And the variance of the sum is equal to
the sum of the variances.
Then we extended some of the basic skills that we had
introduced earlier in this course-- the multiplication
rule and the total probability theorem.
Here are two formulas of this kind that are exactly the same
as the analogous formula that we had for probabilities,
except now that they're written in PMF notation.
So the multiplication rule tells us that the probability
of several things happening is the product of the probability
that one thing happens times the probability that the
second thing happens given that the first happened times
the probability that the third event happens given that the
first two events have happened.
And the total probability theorem allows us to calculate
the probability of an event happening by considering
different scenarios, different values of Y in this context,
looking at the probability of the event of interest under
each one of the different scenarios and forming a
weighted sum, where the weights are the probabilities
of the different scenarios.
An extension or variation of the total probability theorem
is the so-called total expectation theorem.
It is an analogous result.
But now, we deal with expectations.
We calculate the expected value of a random variable by
considering a number of scenarios, finding the
expected value of the random variable under each one of the
different scenarios, and then taking a weighted average of
these conditional expectations.
And finally, in the process of developing all those concepts,
we introduced a few special random variables and PMFs and
did some calculations with them, for example, calculate
their means, variances, or derive certain
properties that they had.
And this is the list of the types of random variables that
we introduced.
In the next unit, we're going to see counterparts of all of
these facts and properties but now, in the context of
continuous random variables.


## Course  /  Exam 1  /  Exam 1

![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 1.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 2.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 3.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 4.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 5.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 6.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 7.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 8.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 9.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 10.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 11.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 12.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 13.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 14.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 15.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 16.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 17.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 18.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 19.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 20.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 21.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 22.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 - exam 1 - 23.png)
![](C:/Users/qp/Pictures/20221005_212413.jpg)
![](C:/Users/qp/Pictures/20221005_212532.jpg)
![](C:/Users/qp/Pictures/20221005_212643.jpg)
![](C:/Users/qp/Pictures/20221005_212731.jpg)
![](C:/Users/qp/Pictures/20221005_212911.jpg)
![](C:/Users/qp/Pictures/20221005_213030.jpg)
![](C:/Users/qp/Pictures/20221005_213217.jpg)
![](C:/Users/qp/Pictures/20221005_213238.jpg)
![](C:/Users/qp/Pictures/20221005_213255.jpg)
![](C:/Users/qp/Pictures/20221005_213309.jpg)

[][************************************************************************************************************************]
![](C:/Users/qp/Pictures/Screenshots/Exam 1 after due question 5 - 1.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 after due question 5 - 2.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 after due question 5 - 3.png)
![](C:/Users/qp/Pictures/20221006_155304.jpg)
![](C:/Users/qp/Pictures/20221006_155327.jpg)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 after due question 6 - 1.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 after due question 6 - 2.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 after due question 6 - 3.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 after due question 6 - 4.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 after due question 6 - 5.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 after due question 6 - 6.png)
![](C:/Users/qp/Pictures/Screenshots/Exam 1 after due question 6 - 7.png)
![](C:/Users/qp/Pictures/20221006_185124.jpg)
[][************************************************************************************************************************]


## Course  /  Unit 5: Continuous random variables  /  Lec. 8: Probability density functions