-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMITx 6.431x -- Probability - The Science of Uncertainty and Data + Unit_7.Rmd
1714 lines (1433 loc) · 81.1 KB
/
MITx 6.431x -- Probability - The Science of Uncertainty and Data + Unit_7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "MITx 6.431x -- Probability - The Science of Uncertainty and Data + Unit_7.Rmd"
author: "John HHU"
date: "2022-12-03"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
## Including Plots
You can also embed plots, for example:
```{r pressure, echo=FALSE}
plot(pressure)
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot.
## Course / Unit 7: Bayesian inference / Unit overview
# 1. Motivation






An imaging radar sends radio waves to objects
and uses the reflections of these waves
to determine properties of the objects.
For example, is it big or small?
Is it a boat or a large rock?
This method relies on the fact that the reflectivity
of different materials, as well as the distribution
of the associated noise, are known, based
on calibration experiments.
Suppose now that before turning the radar on,
we know from past experience that 90% of the objects
surrounding a boat in the ocean are other boats and 10%
are large rocks, but once we turn the radar on
and measure reflected waves, we should update our beliefs
on the identity and properties of any object that gets
detected.
This update of our beliefs is the key step
in Bayesian inference.
In this unit, we will go over the ingredients
of Bayesian inference, a systematic way of calculating
distributions or expectations, while properly incorporating
any newly acquired information.
# 2. Unit 7 overview


By this point in this class, we have developed all of the
basic tools that we need to study and analyze
probabilistic models.
So this is a good time to move to a practical subject, the
subject of inference.
The general idea is that we have a probabilistic model
involving several random variables.
We observe the values of some of them.
And we want to make inferences on some of the others.
Note that the unknown quantities are modeled as
random variables, which means that we can
use the Bayes rule.
And so we will stay within the realm of
so-called Bayesian inference.
In the four lectures that follow, we will illustrate the
use of the Bayes rule in various settings.
We will discuss different methods of coming up with
estimates of unobserved random variables.
And we will illustrate the methodology
through several examples.
If you have mastered the material in previous units,
you should not face any challenges here.
We will only apply tools that we already have, together with
some new definitions and terminology.
However, this may be a good time to review the different
versions of the Bayes rule and the examples covered in the
second half of lecture 10.
And by the end of this unit, you should have a working
knowledge of the key elements of Bayesian inference.
And you should be ready to apply your knowledge to actual
problems, as they arise in the real world.
## Course / Unit 7: Bayesian inference / Lec. 14: Introduction to Bayesian inference
# 1. Lecture 14 overview and slides
In this lecture, we start by discussing the numerous domains in which inference is useful. We then develop the conceptual framework of Bayesian inference, and review the various forms of the Bayes rule. We discuss possible ways of arriving at a point estimate based on the posterior distribution, and present the relevant performance metrics, namely, the probability of error for hypothesis testing problems and the mean squared error for estimation problems.



In this lecture, we start our systematic
study of Bayesian inference.
We will first talk a little bit about the big picture,
about inference in general, the huge range of possible
applications, and the different types of problems
that one may encounter.
For example, we have hypothesis testing problems in
which we are trying to choose between a finite and usually
small number of alternative hypotheses or estimation
problems where we want to estimate as close as we can an
unknown numerical quantity.
We then move into the
specifics of Bayesian inference.
The central idea is that we always use the Bayes rule to
find the posterior distribution of an unknown
random variable based on observations of a related
random variable.
Depending on whether the random variables are discrete
or continuous, we must of course you use the appropriate
version of the Bayes rule.
If we want to summarize the posterior in a single number,
that is, to come up with a numerical estimate of the
unknown random variable, we then have some options.
One is to report the value at which the
posterior is largest.
Another is to report the mean of the conditional
distribution.
These go under the acronyms MAP and LMS.
We will see shortly what these acronyms stand for.
Given any particular method for coming up with a point
estimate, there are certain performance metrics that tell
us how good the estimate is.
For hypothesis testing problems, the appropriate
metric is the probability of error, the probability of
making a mistake.
For problems of estimating a numerical quantity, an
appropriate metric that we will be using a lot is the
expected value of the squared error.
As we will see, there will be no new mathematics in this
lecture, just a few definitions, a few new terms,
and an application of the Bayes rule.
Nevertheless, it is important to be able to apply the Bayes
rule systematically and with confidence.
For this reason, we will be going over several examples.
Printable transcript available here.
https://courses.edx.org/assets/courseware/v1/c36abfb5db20cdb8428a87f6bb0ec37e/asset-v1:MITx+6.431x+2T2022+type@asset+block/transcripts_L14-Overview.pdf
Lecture slides: [clean] [annotated]
https://courses.edx.org/assets/courseware/v1/7cc7ffe1100786c2660ac3371b05252b/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L14-clean.pdf
https://courses.edx.org/assets/courseware/v1/d1146be0ccdf4f519873c048343938e9/asset-v1:MITx+6.431x+2T2022+type@asset+block/lectureslides_L14-annotated.pdf
More information is given in Sections 8.1 and 8.2 of the text.
https://courses.edx.org/courses/course-v1:MITx+6.431x+2T2022/pdfbook/0/chapter/1/61
You are also encouraged to review the different variants of the Bayes rule, in the last part of Lecture 10 and in Section 3.6 of the text.
https://courses.edx.org/courses/course-v1:MITx+6.431x+2T2022/pdfbook/0/chapter/1/27
Source attributions:
S&P 500 chart from http://finance.yahoo.com/ (fair use)
http://finance.yahoo.com/
Genomics graphic from unknown source (fair use)
Systems biology graphic from http://commons.wikimedia.org/wiki/File:Signal_transduction_v1.png (CC license)
http://commons.wikimedia.org/wiki/File:Signal_transduction_v1.png
Electoral vote distribution graphic copyright Stanford University, 2012 (fair use)
# 2. Overview of some application domains








Before we get going with our discussion
of inference methods, it is worth
looking at the big picture for some perspective.
So far, we have concentrated on ways
to analyze probability models.
This part of the picture.
If our model has been selected in a careful way,
then it should also be relevant to the real world
and help us make predictions or decisions.
But how do we know that this is the case?
This is why we need to look at data that are generated
by the real world, and then use these to come up with a model.
This is what the field of inference and statistics
is all about.
This field has undergone a radical transformation
in recent years.
I will exaggerate a little, but in the past,
a statistician might be called to look
at problems such as this one.
You're given data on a few patients,
and you need to figure out whether a certain treatment is
effective or not.
But today, a statistician lives in a dreamland.
There are tons of data that are generated everywhere.
These allow us to build quite detailed large models involving
thousands of parameters.
And we do have the computational power to do all that.
In this landscape, the opportunities
for a statistician are endless.
So let me give you a small representative sample.
In a somewhat traditional setting,
one designs a data collection method,
and then uses these data to make a simple prediction.
This is the case, for example, in polling.
Where the purpose is to predict the outcome of an election.
Another field is marketing and advertising,
where the situation is somewhat similar.
Except now, we want to make predictions
not for a population as a whole, but
for each individual consumer.
And a particular application has to do
with so-called recommendation systems.
You collect ratings that people give to movies, as
in a famous competition that was announced by Netflix.
So you have data for every movie and the people
who have watched them.
You make a note of what rating that person
gave to a particular movie.
And now after you collect huge amounts of data of this kind,
you try to use this information to guess
whether, for example, this person is
interested in this particular movie or not.
This is a quite difficult problem.
A quite complicated one.
And it gave the community an opportunity
to develop fancier and fancier combinations of methods
in order to come up with good predictions
of unknown entries in this table.
Another field is, of course, finance.
The markets are truly uncertain.
And there are quite complete historical data.
Lots of them.
How do we use these data to make predictions?
Coming now to the natural sciences,
a revolution has been taking place in the life sciences.
There are tons of genomic data to be processed to find out
what combination of genes causes what disease.
Or we may want to find out the details
of the chemical reactions inside a living cell.
And there is an upcoming new frontier, neuroscience,
where there will be vast amounts of data that will be generated.
These will consist of brain measurements.
Of measurements of what each neuron is doing.
And hopefully, these will lead us one day
to finding out what the brain really does and how it works.
In the sciences, the list is endless.
It goes on and on.
In modeling climate and the environment,
scientists are using a huge models these days.
Which they try to calibrate using lots of available data.
And in physics as well, scientists
to use fancy inference methods trying to find
needles in a haystack.
Like rare particles or remote planets.
Finally, engineering is a fight against noise.
Engineers try to make devices that
will work in uncertain environments.
The field of signal processing is a prime example
where the generic question is to recover
the content of a signal.
For example, the content of a radio transmission
when a signal is received after it gets corrupted by noise.
I could go on and on for hours generating lists of this kind,
but we have to stop somewhere.
The bottom line is that the opportunities and the needs
are vast.
For this reason, we will look into the core methodologies
that come into play.
Fortunately for us, the fundamental concepts
and approaches turn out to be the same independent
of the particular application.
# 3. Types of inference problems





Before we dive into the heart of the subject,
I want to make a few comments on the different problem
types that show up in the field of inference.
You can think of a general distinction
between model building versus making inferences
about unobserved variables.
We said a little earlier that one
of the main uses of the field of inference
is to construct models of certain situations.
But in many cases, we already have a model.
On the other hand, there may be variables that are unknown,
that are unobserved-- variables that are part of the model,
but whose values are not known.
In such cases, we still want to use
data to make some predictions or decisions
about those unobserved variables.
So model building might or might not be part of the problem
that we're dealing with.
To illustrate the difference between these two versions
of the problem, let us think of a concrete setting.
You have a transmitter who is sending a signal.
Call it S. And that signal goes through some medium.
It could be just the atmosphere.
And what that medium does is that it attenuates
the signal by a certain factor, a.
And then as the signal travels, it also gets hit by some noise,
call it W, and what the receiver sees is an observation,
X. So the situation is described by this simple equation here.
This situation often brings up the following inference
problem.
We want to find out what the medium is.
How do we do this?
We send a pilot signal, S, that is
a signal that we know what it is.
We observe X, and then using this equation,
and, knowing that W is random noise coming
from some distribution, we try to make
an inference about the variable a.
So this is an instance of model building.
We're trying to make a model of the medium that's involved.
But we can also think of a different problem.
Suppose that we know what the medium is.
Perhaps we already went through this particular phase here.
But we're sitting at the receiver,
and we do not know what has been sent.
And we want to find out what S is.
So we are looking again at this equation.
This time we know a, and we're trying
to make inferences about S.
You notice that these two versions of the problem
are essentially of the same mathematical structure.
We have a linear equation.
In one case, we know S. We want to find out a.
In the other case, we know a.
We want to find out what S is.
So even though the interpretation
of these two problems [is] quite different,
the mathematical structure is exactly the same.
This is fortunate.
It means that one and the same methodology
would be applicable to both types of problems.
There is another distinction between problem types
which turns out to be a little more substantial.
There are problems that we call hypothesis testing problems.
In those problems the unknown takes one out
of a few possible values.
That is, we may have a few different alternative
models of the world.
And we're trying to figure out which one of those models
is the correct one.
We're going to decide in favor of one of the candidate models,
and what we want to achieve is that we
make a correct decision.
Or if not, we want to have a small probability
of making an incorrect decision.
An example of this kind is the radar detection problem
that we had discussed in the very beginning of this course,
in which we were getting a signal.
We were getting a radar reading.
And the question was to make an inference
whether the radar is seeing an airplane
or whether an airplane is not present.
So in hypothesis testing problems,
we're essentially making a choice
out of a small number of discrete possible choices.
Instead, in estimation problems, the unknown quantities
are more of a numerical type.
They could even take continuous values.
And what we want to do is to come up
with an estimate of an unknown quantity that
is close to the true but unknown value of the quantity
that we're trying to estimate.
So here, our performance objective
is in terms of some kind of distance function.
We want to be close to the truth.
And typically, we have a continuum of possible choices
that is, our estimates can be general real numbers.
Generally speaking, these two types of problems, hypothesis
testing and estimation, have some significant differences
in the way that they are treated,
as we will be seeing next.
# 4. Exercise: Hypothesis testing versus estimation



# 5. The Bayesian inference framework










[][* Think why its called LMS ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ *]
[][ Maximum a posteriori probability (MAP) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ]

We can finally go ahead and introduce the basic elements
of the Bayesian inference framework.
There is an unknown quantity, which
we treat as a random variable, and this is what's special
and why we call this the Bayesian inference framework.
This is in contrast to other frameworks
in which the unknown quantity theta is just
treated as an unknown constant.
But here, we treat it as a random variable,
and as such, it has a distribution.
This is the prior distribution.
This is what we believe about Theta
before we obtain any data.
And then, we obtain some data, which are some observation.
That observation is a random variable,
but when the process gets realized,
we observe an actual value, numerical value,
of this random variable.
The observation process is modeled,
again in terms of a probabilistic model.
We specify the distribution of X,
but we actually specify the conditional distribution of X.
We say how X will behave if Theta happens
to take on a specific value.
These two pieces, the prior and the model of the observations,
are the two components of the model
that we will be working with.
Once we have obtained a specific value for the observations,
then we can use the Bayes rule to calculate
the conditional distribution of Theta,
either a conditional PMF if Theta is discrete
or a conditional PDF if Theta is continuous.
And this will be a complete solution, in some sense,
of the Bayesian inference problem.
There's one philosophical issue about this framework, which
is where does this prior distribution come from?
How do we choose it?
Sometimes we can choose it using a symmetry argument.
If there's a number of possible choices for Theta
and there's a reason to believe that they're all
equally likely, we have no reason
to believe that one is more likely than the other, then
the symmetry consideration gives us a uniform prior.
We definitely take into account any information
we have about the range of the parameter Theta,
so we use that range and we assign 0 prior probability
for values of Theta outside the range.
Sometimes, we have some knowledge about Theta
from previous studies of a certain problem, that tell us
a little bit about what Theta might be,
and then when we obtain new observations,
we refine those results that were obtained
from previous studies by applying the Bayes rule.
And in some cases, finally, the choice
could be arbitrary or subjective just reflecting our beliefs
about Theta, some plausible judgment
about the relative likelihoods of different choices of Theta.
Now, as we just discussed, the complete solution
or the complete answer to a Bayesian inference problem
is just the specification of the posterior distribution
of Theta given the particular observation that we
have obtained.
Pictorially, if Theta is discrete,
a complete answer might be in the form of such a diagram that
tells us that certain values of Theta
are possible with certain probabilities.
Or if Theta is continuous, a complete solution
might be in the form of a conditional PDF that again
tells us the conditional distribution of Theta.
To appreciate the idea here, consider the problem
of guessing the number of electoral votes
that a candidate gets in the presidential election.
The electoral votes are certain votes
that the candidate gets from each one
of the states in the United States.
And there is a certain number that the candidate
needs to get in order to be elected president.
One possible prediction could be a statement
that I predict that candidate A will win,
but actually a more complete presentation
of the results of a poll could be
a diagram of this kind, which is essentially a PMF.
Here, a particular pollster collected all the data
and gave the posterior probability distribution
for the different possible numbers of electoral votes.
And this diagram is a lot more informative
than the simple statement that we expect a certain candidate
to get more than the required electoral votes.
So what is next?
As we just discussed, the complete solution
is in terms of a posterior distribution,
but sometimes, you may want to summarize this posterior
distribution in a single number or a single estimate,
and this could be a further stage
of processing of the results.
So let us talk about this.
Once you have in your hands the posterior distribution
of Theta, either in a discrete or in a continuous setting,
and if you're asked to provide a single guess about what
Theta is, how might you proceed?
In the discrete case, you could argue as follows.
These values of Theta all have some chance of occurring.
This value of Theta is the one which is the most likely,
so I'm going to report this value
as my best guess of what Theta is.
And using a similar philosophy, you
could look at the continuous case
and find the value of Theta at which the PDF is largest
and report that particular value.
This particular way of estimating Theta
is called the maximum a posteriori probability rule.
We already have in our hands the specific value of X,
and therefore, we have determined
the conditional distribution for Theta.
What we then do is to find the value of theta
that maximizes over all possible thetas the conditional PMF
of this random variables capital Theta.
And similarly in the continuous case,
the value of theta that maximizes the conditional PDF
of the random variable Theta.
This is one way of coming up with an estimate.
One can think of other ways.
For example, I might want to report instead, the mean
of the conditional distribution, which in this diagram
might be somewhere here, and in this picture,
it might be somewhere here.
This way of estimating theta is the conditional expectation
estimator.
It just reports the value of the conditional expectation,
the mean of this conditional distribution.
It is called the least mean squares estimator,
because it has a certain useful and important property.
It is the estimator that gives you
the smallest mean squared error.
We will discuss this particular issue
in much more depth a little later.
Now, let me make two comments about terminology.
What we have produced here is an estimate.
I gave you the conditional PDF or conditional PMF,
and you tell me a number.
This number, the estimate, is obtained
by starting with the data, doing some processing to the data,
and eventually, coming up with a numerical value.
Now, g is the way that we process the data.
It's a certain rule.
Now, if we know the value of the data,
we know what the estimate is going to be.
But if I do not tell you the value of the data
and you look at the situation more abstractly,
then the only thing you can tell me
is that I will be seeing a random variable,
capital X, I will do some processing to it,
and then I will obtain a certain quantity.
Because capital X is random, the quantity that I will obtain
will also be random.
It's a random variable.
This random variable, capital Theta hat,
we call it an estimator.
Sometimes, we might also use the term estimator
to [refer to] the function g, which
is the way that we process the data.
In any case, it is important to keep this distinction in mind.
The estimator is the rule that we use to process the data,
and it is equivalent to a certain random variable.
An estimate is the specific numerical value
that we get when the data take a specific numerical value.
So if little x is the numerical value of capital X,
in that case, little theta hat is the numerical value
of the estimator capital Theta hat.
So at this point, we have a complete conceptual framework.
We know, abstractly speaking, what
it takes to calculate conditional distributions,
and we have two specific estimators at hand.
All that's left for us to do now is
to consider various examples in which we can discuss what
it takes to go through these various steps.
# 6. Exercise Estimates and estimators





# 7. Discrete parameter, discrete observation






[][* Why Theta got fat??? Think it as groups of Theta hat ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ *]
[][* Total probability rule, based on conditional probability ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ *]


Let us now discuss in some more detail
what it takes to carry out Bayesian inference,
when both random variables are discrete.
The unknown parameter, Theta, is a random variable
that takes values in the discrete set.
And we can think of these values as alternative hypotheses.
In this case, we know how to do inference.
We have in our hands the Bayes rule
and we have seen plenty of examples.
So instead of going through one more example in detail,
let us assume that we have a model, that we have observed
the value of X, and that we have already determined
the conditional PMF of the random variable Theta.
As a concrete example, suppose that Theta
can take values 1, 2, or 3.
We have obtained our observation,
and the conditional PMF takes this form.
We could stop at this point or we
could continue by asking for a specific estimate of Theta--
our best guess as to what Theta is.
One way of coming up with an estimate
is to use the [][**maximum a posteriori of probability rule**], which looks for that value of theta that
has the largest posterior, or conditional, probability.
In this example, it is this value,
so our estimate is going to be equal to 2.
An alternative way of coming up with an estimate
could be the LMS rule, which calculates
an estimate equal to the conditional expectation
of the unknown parameter, given the observation that we
have made.
This is just the mean of this conditional distribution.
In this example, it would fall somewhere around here,
and the numerical value, as you can check, is equal to 2.2.
Next, we may be interested in how good a certain estimate is.
And for the case where we interpret the values of Theta
as hypotheses, a relevant criterion
is the probability of error.
In this case, because we already have
some data available in our hands and we're
called to make an estimate, what we care about
is the conditional probability, given the information
that we have, that we're making an error.
Making an error means the following.
We have the observation, the value of the estimate
has been determined, it is now a number,
and that's why we write it with a lowercase theta hat.
But the parameter is still unknown.
We don't know what it is.
It is described by this distribution.
And there's a probability that it's
going to be different from our estimate.
What is this probability?
It depends on how we construct the estimates.
If in this example, we use the MAP rule
and we make an estimate of 2, there
is probability 0.6 that the true value of Theta
is also equal to 2, and we are fine.
But there's a remaining probability of 0.4
that the true value of Theta is different than our estimate.
So there's probability 0.4 of having made a mistake.
If, instead of an estimate equal to 2,
we had chosen an estimate equal to 3,
then the true parameter would be equal to our estimate
with probability 0.3, but we would have made an error
with probability 0.7, which would
be a bigger probability of error.
More generally, the probability of error
of a particular estimate is the sum
of the probabilities of the other values of Theta.
And if we want to keep the probability of error small,
we want to keep the sum of the probabilities
of the other values small, which means
we want to pick an estimate for which its own probability is
large.
And so by that argument, we see that the way
to achieve the smallest possible probability of error
is to employ the MAP rule.
This is a very important property of the MAP rule.
Now, this is the conditional probability
of error, given that we already have data in our hands.
But more generally, we may want to compare estimators or talk
about their performance in terms of their overall probability
of error.
We're designing a decision-making system
that's going to process data and making decisions.
In order to say how good our system is,
we want to say that overall, whenever you use the system,
there's going to be some random parameter,
there's going to be some value of the estimate.
And we want to know what's the probability that these two will
be different.
We can calculate this overall probability of error
by using the total probability theorem.
And the conditional probabilities of error as
follows.
We condition on the value of X. For any possible value of X,
we have a conditional probability of error.
And then we take a weighted average
of these conditional probabilities of error.
There's also an alternative way of using the total probability
theorem, which would be to first condition on Theta
and calculate the conditional probability of error
for a given choice of this unknown parameter.
And both of these formulas can be used.
Which one of the two is more convenient
really depends on the specifics of the problem.
Finally, I would like to make an important observation.
We argued that for any particular choice
of an observation, the MAP rule achieves the smallest
possible probability of error.
So under the MAP rule, this term is as small
as possible for any given value of the random variable,
capital X.
Since each term of this sum is as small as possible
under the MAP rule, it means that the overall sum will also
be as small as possible.
And this means is that the overall probability of error
is also smallest under the MAP rule.
In this sense, the MAP rule is the optimum way
of coming up with estimates in the hypothesis-testing context,
where we want to minimize the probability of error.
# 8. Exercise: Discrete unknowns





# 9. Discrete parameter, continuous observation




[][*Image how those two approach help us doing the calculation +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ *]

In the next variation that we consider,
the random variable Theta is still discrete.
So it might, for example, represent
a number of alternative hypothesis.
But now our observation is continuous.
Of course, we do have a variation of the Bayes rule
that's applicable to this situation.
The only difference from the previous version of the Bayes
rule is that now the PMF of X, the unconditional
and the conditional one, is replaced by a PDF.
Otherwise, everything remains the same.
A standard example is the following.
Here we're sending a signal that takes one of, let's say,
three alternative values.
And what we observe is the signal
that was sent plus some noise.
And the typical assumption here might
be that the noise has zero mean and a certain variance,
and is independent from the signal that was sent.
This is an example that we more or less studied some time ago.
Actually, at that time, we looked at an example
where Theta could only take one out of two values,
but the calculations and the methodology
remains essentially the same as for the case of three values.
So in principle, we do know at this point
how to apply the Bayes rule in this situation
to come up with a conditional PMF of theta.
And the key to that calculation was that the term that we need,
the conditional PDF of X, can be obtained from this equation
as follows.
If I tell you the value of Theta,
then X is essentially the same as W plus a certain constant.
Adding a constant just shifts the PDF of W
by an amount equal to that constant.
And, therefore, the conditional PDF of X
is the shifted PDF of the random variable W. Using
this particular fact, we can then apply the Bayes rule,
carry out of the calculations, and suppose that in the end
we came up with these results.
That is we obtain the specific observation
x and based on that observation, we
calculate the conditional probabilities
of the different choices of Theta.
At this point, we may use the MAP rule
and come up with an estimate which
is the value of Theta, which is the more likely one.
And then we can continue exactly as in the case
of discrete measurements, of discrete observations,
and talk about conditional probabilities of error
and so on.
Now, the fact that X is continuous
really makes no difference, once we arrive at this picture.
With the MAP rule we still choose the most likely value
of theta, and this is our estimates.
And we can calculate the probability
of error, which with the MAP rule
would be 0.4, exactly the same argument
as for the case of discrete observations
applies and shows that this conditional probability
of error is smallest under the MAP rule.
And then we can continue similarly
and talk about the overall probability of error, which
can be calculated using the total probability
theorem in two ways.
One way is to take the conditional probability
of error for any given value of X
and then average those conditional probabilities
of errors over all the possible choices of X.
Because X is now continuous, here
we're going to have an integral.
Alternatively, you can condition on the possible values
of Theta, calculate conditional probabilities of error
for any particular choice of theta,
and then take a weighted average of them.
In practice, this calculation sometimes
turns out to be the simpler one.
Finally, we can replicate the argument
that we had in the discrete case.
Since the MAP rule makes this term here as small as possible,
it is less than or equal to the probability of error
that you would get under any other estimate or estimator,
then it follows that the integral will also
be as small as possible.
And therefore, the conclusion is that the overall probability
of error is, again, the smallest possible
when we use the MAP rule.
And so the MAP rule remains the optimal way
of choosing between alternative hypothesis,
whether X is discrete or continuous.
# 10. Exercise: Discrete unknown and continuous observation




# 11. Continuous parameter, continuous observation




[][*The main candidate for now is MAP rule *]


In the next variation we consider, all random variables
are continuous.
For this case, we do have a Bayes rule, once more.
And we have worked [out] quite a few examples.
So there's no point, again, in going
through a detailed example.
Let us just discuss some of the issues.
One question is when do these models arise?
One particular class of models that is very useful and very
commonly used are so-called linear normal models.
In these models, we, basically, combine
various random variables in a linear function.
And all the random variables of interest are now to be normal.
For instance, we might have a signal, a noisy signal,
call it Theta, which is now a continuous valued signal.
We receive that signal, but corrupted
by some noise, which is independent from what was sent.
And we wish to recover, on the basis of the observation X,
we wish to recover the value of Theta.
And then there are versions of this problem that
involve Theta vectors instead of single values.
So that Theta consists of multiple components,
and where we obtain many measurements X. We will,
actually, see in the next lecture sequence,
a quite detailed discussion of models of this type.
And this will be one of our main examples
within our study of inference.
There will be another example that we will see a few times,
and this involves estimating the parameter
of a uniform distribution.
So X is a random variable that's uniform over a certain range.
But the range itself is random and unknown.
And on the basis of observations X,
we would like to make an estimation of what
the true value of Theta is.
This is an example that you will see
in our collection of solved problems for this class.
So what are the questions in this setting, we wish to come up with ways of estimating Theta, we form an estimator, and the main candidates for estimators at this points are, once more, **the maximum a posteriori probability** estimator, which looks at this conditional density and picks a value of theta that makes this conditional density as large as possible. And then the alternative one is the **least mean squares** estimator, which just computes the expected value of Theta given X.
For any given estimator, we then want to characterize its performance. In this case, a natural notion of performance is the distance between our estimate, or estimator, from the true value of Theta. And commonly we use the squared distance and then take the average of that squared distance.
So in a conditional universe where we have already observed some data, we might be interested in this particular expectation, which is the [][mean squared error of this particular estimator], given that we obtain some particular data.