STAT 101, Module 3: Numerical Summaries For

STAT 101, Module 6: Probability, Expected Values, Variance

(Book: chapter 4)

Reasons for wanting probability statements

Quantify uncertainty: “For the manager the probability of a future event presents a level of knowledge” (p. 79).
Quantification makes uncertainty “predictable” in that probabilities describe what happens in large numbers. For example, a company can figure an assumption of 5.5% faulty shipments from its supplier into its operations planning.
Many statements about “rates” are hidden probability statements:

“The take-rate of our services from a telemarketing campaign is 2.1%.” This means that in the past and presumably in the future about 21 out of 1000 targeted households sign up for the offered service.

We can say the estimated probability of taking the service after targeting by telemarketing is 0.021.

Quantification with probabilities can give insight. Some examples found by googling “probability of successful corporate take-over”:
“The probability of merger success is inversely related to analyst coverage at the time of announcement.”
“The probability of takeover success rose to 71.3% when a bidder initiallyheld the shares of the target firm.”

Probability as an idealization

Probability is an idealization. It is to statistics what straight lines are to geometry, point masses to Newtonian mechanics, point charges to electrodynamics,…
The idealization comes about for relative frequencies when the number of cases goes to infinity. The limit is the probability. (Book: p.90)
Probabilities do not “exist”, just like straight lines don’t “exist”. Relative frequencies and taught ropes exist, but probabilities and straight lines do not “exist”. Both are constructed by a thought process called “idealization”.
Why do we need idealizations? Idealizations simplify our thinking and focus on aspects of reality that are essential in a given context.
In geometry we don’t want to talk about the thickness of a line. Thickness matters for taught ropes, but when we do geometry we “idealize away” thickness by thinking of a line as “infinitely thin”.
In statistics we need probability to have a target that is estimated by relative frequencies. We can think of the probability as the relative frequency if we had an infinite amount of data.

Like geometry does away with thickness of taught ropes, probability does away with limitations of finite data samples.

Foundations of frequentist probability

Probabilities are numbers between 0 and 1 (inclusive) assigned tooutcomes ofrandom events.
A random event can have one of two outcomes, which we may characterize as “yes” or “no”.
Examples of random events:
head on a coin toss,
6 on the rolling of a die,
ticket 806400058584is drawn in blind drawing from a well-tumbled box of lottery tickets,
any trader committing insider trading in 2007,
any targeted household signing up for my firm’s services,
a randomly caught bear is tagged and found next season again,
GM stock gains by more than 5% in a given week,
the S&P loses more than 3% in a given day,…
The term “random” does not mean haphazard; it means “repeatable in principle, but with unpredictable outcome”.
The term “unpredictable” is relative to what we know; if we know more, less is unpredictable, and vice versa.

For a random trader on Wallstreet, we would not know whether he/she would do insider trading; if the trader is my dad or mom, I might be certain that he/she wouldn’t. Hence for a “random trader” we might estimate the probability of insider trading to be 1%; for mom or dad there is no probability because there is no repeatability, and I know.

It should be possible to “independentlyrepeat” the random event. Independence means knowing the outcome of the first realization of the event should not help in guessing the outcome of a second realization of the event.

Independence of realizations of a random event is sometimes difficult to justify. For example, if it rains today, chances are probably elevated that it will rain tomorrow, too. Hence there is information in the outcome of the first realization for the outcome of the second realization.

A counter-example: the event that a whale jumps on my yacht.

Why is this not a random event? It only defines the ‘yes’ side of an event but not the ‘no’ side. To actually have an event in our sense, one has to be able to determine when it did not occurred. The ‘no’ side of an event is needed so we have a notion of total number of trials/observations/measurements.

To make “a whale jumps on my yacht”a well-specified event, one would have to modify it to something like “the event that a whale jumps on my yacht on a given day”, or “the event that a whale jumps on a given yacht throughout its lifetime”, or “the event that a whale jumps on any yacht anywhere on a given day”, or “the event that a whale jumps on a given yacht on a given day”. These specifications define the repeatability as across days, or yachts, or combinations of days and yachts. Day-to-day repetitions may be justified as independent if one thinks that next days jumping of a whale on a yacht is not linked to today’s behaviors of whales and yachts.

A similar thought applies to events such as 5% daily gains of GM stock and 3% daily losses of the S&P index. Is it so clear that tomorrow’s likelihood of these events is unrelated to today’s movements of GM stock or the S&P index?

The preceding thoughts show that there exist deep problems in applying probability.

There exist more free-wheeling notions of probabiliy, called “subjective probability”. It formalizes strength of belief and can be founded on betting: how much money one is willing to bet on an outcome implies a belief in a certain probability of the outcome. Our notion of probabiliy, however, is “frequentist”.

Definition: Probability is the limit of relative frequencies of “yes” outcomes when the number of observed outcomes goes to infinity.

Recall the definition of relative frequency:

rel.freq.(yes) = #yes / (#yes + #no)

Do we ever know probabilities?

No we don’t. But for thought experiments we often assume them.

Example: Flipping a coin; we assume P(head) = P(tail) = ½.

This is actually taken as the definition or model of a fair coin. Similarly, we’d take P(any particular card) = 1/52 as the definition of a well-shuffled deck. Finally, P(any of 1…6) = 1/6 is the definition of a fair die.

For a finite sequence of independent repetitions of a random event, we can estimate the probability by the relative frequency. Later we will learn how to estimate the precision of this estimate with so-called “confidence intervals”.

PS: In past long-run experiments with flipping coins one has never seen relative frequencies as close to ½ as they should be. We can say that fair coins do not exist. Yet the deviations from fairness are so small that they don’t matter.

Notations, definitions, and properties of probabilities

A sample spaceis the set of all possible outcomes

of arandom experiment. Notation: Ω

A random event A is a subset of the set Ω of possible outcomes.
The set of outcomes that are not in A are denoted AC.

Note that this is also a random event.

Events A and B are said to bedisjointif A∩ B = Ø.

That is,events A and Bcannotoccur simultaneously.

The probability of a random event A is written P(A).
Probabilities must satisfy the following axioms:
P(A) ≥ 0
P(Ω) = 1
P(A or B) = P(A) + P(B) if A and B are disjoint.

Illustrations, using the example S&P losses and gains:
P(dailyloss≥ 3%) ≥ 0
P(daily loss or gain or same) = 1
P(daily loss≥ 3% ordaily gain ≥ 3%)

= P(daily loss ≥ 3%) + P(daily gain ≥ 3%)

The axioms are direct consequences of the fact that probabilities are limits of relative frequencies. The same properties hold for observed relative frequencies:
#(daily market losses≥ 3% in last 1000 trading days)/1000 ≥ 0
#(daily losses or gains or same in…)/1000 = 1
#(daily losses≥ 3%... or daily gains≥ 3% in...)/1000

= #(daily losses≥ 3% in...)/1000 + #(daily gains≥ 3% in...)/1000

Properties of probabilities (Book: Sec. 4.3)

Again, note that all of the following properties also hold for relative frequencies.

Complement rule:

P(AC) = 1 – P(A)

Proof: A and AC are disjoint, hence P(A)+P(AC) = P(Ω) = 1

Illustration: P(daily loss ≥ 3%)

= 1 – P(daily loss < 3% or a daily gain or same)

Monotonicity rule:

A contained in B => P(A) ≤ P(B)

Proof: B = A or (B w/o A), where A and (B w/o A) are disjoint. Hence P(B) = P(A) + P(B w/o A) ≥ P(A) because P(B w/o A) ≥ 0.

Illustration: P(daily loss ≥ 5%) ≤ P(daily loss ≥ 3%)

General addition rule: For any two random events A and B, not necessarily disjoint, we have

P(A or B) = P(A) + P(B) – P(A and B)

Important: The conjunction “or” is always used in the inclusive sense, meaning “A or B or both”, not “either A or B but not both”. In the inclusive sense, “or” expresses set-theoretic union.

Proof: The intersection C = (A and B) is double-counted by P(A)+P(B).

Draw a Venn-diagram to make this clear. For a formal proof, write:

A = (A w/o B) or (A and B)

B = (B w/o A) or (A and B)

(A or B) = (A w/o B) or (A and B) or (B w/o A)

and note that the events on the right hand sides are all disjoint, hence their probabilities add up.

Illustration: P(earthquake or hurricane)

= P(earthquake) + P(hurricane) – P(earthquake and hurricane)

The summation axiom can be generalized to more than two disjoint events. Here is the generalization to three:

P(A1 or A2 or A3) = P(A1) + P(A2) + P(A3)

if A1, A2, A3 are pairwise disjoint.

Example:

A1 = (S&P drops between 0 and 3% today)

A2 = (S&P stays same today)

A3 = (S&P rises between 0 and 3% today)

Partitioning rule: If B1, B2, B3 partition Ω in the sense that

B1 or B2 or B3 = Ω and

(B1 and B2) = (B1 and B3) = (B2 and B3) = Ø,

and if A is an arbitrary event, then

P(A) = P(A and B1) + P(A and B2) + P(B and B3)

This formula generalizes of course to partitions of Ω consisting of more than three events.

Example: B1 = market drop, B2 = market unchanged, B3 = market rise

A = (GM rises by more than 1%)

Conditional probability (Book: p. 99, Sec. 4.4)

Start with relative frequencies and pictures, a familiar mosaic plot:

The skinny spine on the rightshows in blue the overall relative frequency of survival, across classes.

The spines of the mosaic, however,show the relative frequencies of survival conditional on being in a given class.

Because concepts and properties of relative frequencies carry over to their limits for infinitely many cases (Titanic with infinitely many passengers, imagine!), the concept of conditional relative frequency carries over to the concept of conditional probability.

P(A|B) := P(A and B) / P(B)

That is, restrict the event A (“survival”) to the subset B (“1st Class”) by forming (A and B), and renormalize the probability to

P(A and B)/P(B), such that the event B has conditional probability1:

P(B|B) = 1.

Example: What is the conditional probability of an outcome of A = {1} conditional on the outcomes B = {1,2,3}, assuming a fair die?

Example: What is the conditional probability of an outcome of A = {Ace} conditional on the outcome B = {spade}, assuming a well-shuffled deck?

Example: What is the conditional probability of a daily market drop of over 10% given that the market drop is over 3%? (There are no numbers given here; this is only a small conceptual point about two events where one implies the other.)

Ramifications of conditional probability (Book: Sec. 4.4, 4.5)

Sometimes the conditional probability P(A|B) is given, and so is P(B). We can then calculate P(A and B):

P(A and B) = P(A|B) · P(B)

Example: Assume if it rains today it rains tomorrow with probability 0.6. The probability of rain is 0.1 on any day. What is the probability that it rains on any two consecutive days?

In other applications, the conditional probability P(A|B) is given, and so is P(A and B). Then one can calculate P(B):

P(B) = P(A and B) / P(A|B)

Example: If you know a plausible example, let me know…

A famous application of conditional probability is Bayes’ rule. In its simplest case, it allows us to infer P(B|A) from P(A|B), given also P(A) and P(B). (Book: Sec. 4.5)

P(A|B) = P(B|A) · P(A) / P(B)

Example: The conditional probability of getting disease Y given gene X is 0.8. The probability of someone having gene X is 0.0001, and the probability of having disease Y is 0.04. What is the conditional probability of having gene X given one has disease Y?

Instead of applying the above formula, it might be easier to calculate P(B|A) to P(A and B) to P(A|B) in steps:

P(gene X) = 0.0001

P(disease Y | gene X) = 0.8

P(gene X and disease Y) = 0.8 · 0.0001 = 0.00008

P(gene X | disease Y)

= P(gene X and disease Y) / P(disease Y)

= 0.00008 / 0.04 = 8/4000 = 0.002

Are you surprised? It seemed that if the gene causes the disease with such high probability, having the disease should be a pretty good indicator for the gene? What’s the hook?

Exercise: Change the numbers in the example. What happens if the probability of having disease Y is 0.0001? What if it were0.00001?

In the above Titanic example, one should think that it would be possible to calculate P(survival) from the mosaic plot, that is, from the conditional probabilities P(survival|class). That’s only true actually if one is given the probabilities of the classes, P(class). Then it works as follows (strictly speaking for relative frequencies, not probabilities):

P(survival) = P(survival | 1st class) · P( 1st class)

+ P(survival | 2nd class) · P( 2nd class)

+ P(survival | 3rd class) · P( 3rd class)

+ P(survival | crew) · P( crew)

Proof: The four terms on the right side equal

= P(survival and 1st class) + P(survival and 2nd class)

+ P(survival and 3rd class)+ P(survival and crew)

These four events are disjoint and their union is simply “survival”. QED

Numerically, this works out as follows:

0.3230 = 0.6246·0.1477 + 0.4140·0.1295 + 0.2521·0.3208 + 0.2395·0.4021

The numbers can be gleaned from the table below the mosaic plot in JMP.

The above example can be restated in general terms as follows:

P(A) = P(A|B1) · P(B1) + P(A|B2) · P(B2)

+ P(A|B3) · P(B3)+ P(A|B4) · P(B4)

For two conditioning events, which you illustrate with B = (1st class) and BC = (not 1st class), and A = (survival), this simplifies as follows:

P(A) = P(A|B) · P(B) + P(A|BC) · P(BC)

These are called marginalization formulas because they turn conditional probabilities of A into themarginal (= plain) probability of A.

Here is an example from the Titanic with conditioning on sex:

SEX By SURVIVED

Count
Total %
Col %
Row % / no / yes
female / 126
5.72
8.46
26.81 / 344
15.63
48.38
73.19 / 470
21.35
male / 1364
61.97
91.54
78.80 / 367
16.67
51.62
21.20 / 1731
78.65
1490
67.70 / 711
32.30 / 2201

It takes some effort to decode the table on the right, but the four terms in the top left cell give the necessary clues:

P(survival) = 0.3230

= P(survived|female)• P(female) + P(survived|male)• P(male)

= 0.7319 • 0.2135 + 0.2120 • 0.7865

Here is a hairy but typical legal application that involves both Bayes’ rule and the probability summation of the preceding bullet. It shows how methods of detection with very small error probabilities can fail to be conclusive (from

Assume a company is testing its employees for drugs, and assume the test is 99% accurate; also assume half a percent of employees take drugs. What is the probability of taking drugs given a positive test result?

A first problem is to make sense of the term “99% accurate”. We take it to mean this:

P(pos | drug) = 0.99 true positives

P(neg | no drug) = 0.99 true negatives

We infer the following, which is useful later:

P(pos | no drug) = 0.01 false positives

P(neg | drug) = 0.01 false negatives

Then we know this:

P(drug) = 0.005 marginal prob of drug use

P(no drug) = 0.995 marginal prob of no drug use

The question is this:

P(drug | pos) = ? prob of being correct given a positive

Solution:

P(drug | pos) = P(drug and pos) / P(pos)

The numerator is easy:

P(drug and pos) = P(pos | drug) · P(drug) = 0.99 · 0.005 = 0.00495

The denominator is messy and requires collecting the pieces with the marginalization formula:

P(pos) = P(pos | drug) · P(drug) + P(pos | no drug) · P(no drug)

= 0.99 · 0.005 + 0.01 · 0.995

= 0.0149 marginal prob of a positive

Finally:

P(drug | pos) = P(drug and pos) / P(pos)

= 0.00495 / 0.0149

= 0.3322148

Now that’s a disappointment! The probability of catching a drug user with a positive drug test is only 1 in 3!!! Two thirds of the positives are law-abiding citizens!!!

Intuitively, what is going on? The jist is this: In order to catch a small minority of half a percent (the drug users), the detection method must have an error rate of much less than half a percent. The rate of false positives in the example is one percent, hence out of the majority of 99.5% non-drug users the test finds almost 1% false positives, which is large compared to the 0.5% actual drug users. Sure, the 0.5% drug users get reliably detected, but the true positives get swamped by the false positives…

Based on the above example, what do you think about a DNA test whose error rate is one in a million when used in a murder case? What would happen if the DNA test were administered to all adults in this country?

Independent events (Book: p.113)

Definition: Two events A and B are said to be independent if

P(A and B) = P(A) · P(B)

Note: This is a definition! Some pairs of events will be independent, others won’t!

Examples: 1) A = (head now) and B = (head next) are usually considered independent events, unless someone cheats.

2) Whether a rise in the S&P on any day is independent from a rise the previous day cannot be known without doing data analysis. Most likely one will find some deviation from independence, although it might be small.

Independence in terms of conditional probability:

If P(B) ≠ 0, then A and B are independent if and only if

P(A|B) = P(A)

This makes intuitive sense: Knowing that B occurred does not give us any information about the frequency of A occurring.

What would this mean for the relative frequencies of survival and 1st class? It means that P(survival | 1st class ) = P(survival).

In other words, it doesn’t matter whether one was in 1st class or not; the survival probability is the same. (This is only a thought experiment, not the truth!)

What does it mean for two flips of a coin?

P(head now | head before) = P(head)

That is, any coin flip is independent of the preceding coin flip. We usually assume that coin flips are “independent”, that is, no one is rigging the flip such that a head is more or less likely next time.

(The idea that observing many consecutive heads makes a tail more likely next time is a typical form of magical thinking and applied by many people when playing the lottery: “I’m getting close to winning the lottery anytime now…”)

Source of confusion: “Independent” is not the same as “disjoint”. In fact, two disjoint events A and B with positive probability are never independent:

P(A and B) = P(Ø) = 0 ≠ P(A) · P(B) > 0

Intuitively, knowing that B occurred gives a lot of knowledge about the frequency of A occuring: zero!

Example: A = (spades), B = (heart).

The notion of independence shows the advantage of using idealized probabilities as opposed to empirical relative frequencies. In any finite series of pairs of coin flips, the product condition will rarely be satisfied. Yet in the limit, for infinitely many pairs of coin flips, the product formula should hold.

The notion of more than two independent events is a little messy: We say A1, A2, …, An are independent, if product formulas for all subsets of all sizes. For three events we ask that all of the following hold:

P(A1 and A2) = P(A1) · P(A2)

P(A1 and A3) = P(A1) · P(A3)

P(A2 and A3) = P(A2) · P(A3)

P(A1 and A2 and A3) = P(A1) · P(A2) · P(A3)