STAT 2010, Business Stat2006Jaimie Kwon
STAT 2010, Elements of Statistics for
Business and Economics
Lecture Notes
Prof. Jaimie Kwon
Statistics Dept
CalStateEastBay
Disclaimer
These lecture notesare for internal use of Prof. Jaimie Kwon, but are provided as a potentially helpful material for students taking the course. A few things to note:
The lecture in class always supersedes what’s in the notes
These notes are provided “as-is” i.e. the accuracy and relevance of the contents are not guaranteed
The contents are fluid due to constant update during the lecture
The contents may contain announcements etc. that are not relevant to the current quarter
Students are free to report typos or make suggestions on the notes via emailing or in person to improve the material, but they need to understand the above nature of the notes
Do not distribute these notes outside the class
Best Practice for note-taking in class
I do not recommend students relying on this lecture notes in place of actual notes he/she writes down
Bring a notepad and write down materials that I go over in the class, using this lecture notes as the independent reference; you don’t miss a thing by not having a printout of this lecture note in (and outside) the class
If you still want to print these notes, it’d be better to print them 4 pages on a single page (using “pages per sheet” feature in MS Word), preferably double sided (to save trees)
Some canonical examples:
Benefit of low-fat diet (Jan 2006)
# of supporters of Bush/Gore in Florida exit poll (Florida, 2000)
Is driving an SUV more dangerous than driving a passenger car?
To cash in now and retire or keep working, for GM workers (Mar 2006)?
When do I have to leave home to be at school on time (this morning)?
Has consumer confidence in the US increased or decreased from last to this month (March 2006)?
Where do I put this $1,000? Google stock? Coca-Cola stock? A mutual fund? Certificate of deposit (CD)? What are expected returns and risks? (pay day)
The number of mothers opting for cesarean birth is on the rise. On the other hand, cesarean babies have higher risk of breathing problem (March 30, 2006)
Arnold is back (almost). The Californian governor’s approval rating is 47% now, a 7% increase in a single month. (March 30, 2006)
What’s the daily number of reports related to statistics? Interval variable? Categorical?
What’s common in above examples: decision under uncertainty
1What is statistics?
Statistics: a way to extract information from data
Descriptive statistics: methods of organizing, summarizing, and presenting data in such a way that useful information is produced
Graphical methods
Numerical summary of data
Inferential statistics: a body of methods used to draw conclusions or inferences about characteristics of population based on sample data
Key paradigm of statistics
Population: the group of all items of interest
Parameter: a descriptive measure of a population
Sample: a set of data drawn from the population
Statistic: a descriptive measure of a sample
Statistical inference: the process of making and estimate, prediction or decision about a population based on sample data
Exercises 1.3, 4
2Graphical and tabular descriptive statistics
2.1Types of data
Variable: some characteristic of a population or sample
The values of the variable are the possible observations of the variable. (Integers b/w 0-100, real numbers, M/F, A-F)
Data are the observed values of a variable (plural for datum)
Types of data/variable
Interval data/variable are real numbers,a.k.a. quantitative or numerical
Nominal data/variable have categorical values without orders, a.k.a. qualitative or categorical
Ordinal data/variable are similar to nominal but their values can be ordered
(“Categorical variable”is the generic name for nominal and ordinal variables)
Hierarchy? (Course grade: score to letter grade to pass/fail)
Exercises 2.1-2.3
2.2Techniques for nominal data
Frequency distribution: a table of the categories and their counts
Relative frequency distribution: shows the proportion (not count) of each category
A bar chart is used to display frequencies
A pie chart shows relative frequencies
Exercises 2.11
2.3Graphical techniques for interval data
How to visualize the data?Histogram
E.g. Items with defects (Xr02-35)
x=c(4, 9, 13, 7, 5, 8, 12, 15, 5, 7, 3, 8, 15, 17, 19, 6, 4, 10, 8, 22, 16, 9, 5, 3, 9, 19, 14, 13, 18, 7); hist(x)
Example (recycle below): mean time spent on the internet; 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 (hrs /month)
x=c(0, 7, 12, 5, 33, 14, 8, 0, 9, 22); hist(x, nclass=4)
We’ve all seen histograms. Here’s how you draw one:
Build class intervals, equally wide, non-overlapping intervals that cover the complete range of observations.
Create a frequency distribution, by counting the # of observations that fall into each class interval
Draw the histogram, rectangles whose bases are class intervals and heights are frequencies
How many class intervals?
More class intervals for {more, less} data points.
Table 2.6 for the rule of thumbs;
Sturges’ formula: “1+3.3 log(n)”
My favorite: eyeballing
How wide is each interval? Round (range/# of classes) to something convenient.
Reading histograms…
Symmetry and Skewness (positively/negatively)
How many peaks? unimodal, bimodal
Bell shape (symmetric, unimodal; important)
Which variables are likely to have
A positively skewed distribution?
A negatively skewed distribution?
Symmetric distribution?
Symmetric, bell shaped distribution?
Bimodal distribution?
Stem-and-leaf display
Ogive
Ex. 2.33, 35(a)(c)
2.4Describing the relationship between two variables
Bivariate methods are used to study the relationship between two variables (Cf. Univariate methods)
Dependent variable (Y)vs. independent variable (X)
Four possible combinations: {categorical, integer} {X, Y} variable
Two categorical variables:
E.g. Gender and choice of doctorate, 1998 (Ex. 2.56, Xr02-56)
Example: Blue collar/white collar/professional vs NYTimes/USA today/SF Chronicles; ad targeting
A contingency table lists the frequency of each combination of the values of two categorical variables
To study the differences in the row variable among the column variable; compute the column totals and divide each frequency by it to obtain column relative frequencies
Two interval variables:
E.g. Size vs. price of home (100 ft2 vs K dollars)which are dependent and independent variable? Use of X and Y. (e.g.Xm02-09)
Draw scatter diagram using X and Y
Interpreting scatter diagrams:
Linear relationship: most of the points fall close to a straight line through points (cf. least squares method)
Two main characteristics of linear relationship:
Strength (strong, medium, weak, none)
Direction (positively linear, negatively linear)
Nonlinear relationship
Ex. 2.55 (Xr02-55), 56 (Xr02-56)
2.5Time series data
Bankrate, Hbrhomes graph (> cross-sectional data)
Ex 2.73 (Xr02-73)
3Art and science of graphical presentations
graphical excellence
graphical deception
presenting statistics: writing reports and oral presentations
4Numerical descriptive techniques
4.1Measures of central location
Label observations in a sample as
We typically use n for the sample size, N for population size
Population quantities are usually not computable, especially when N=
Example (recycle below): mean time spent on the internet; 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 (hrs /month)
x=c(0, 7, 12, 5, 33, 14, 8, 0, 9, 22);mean(x);hist(x)
Three measures of central location
Arithmetic mean:
sample mean ; population mean:
Median: the observation that falls in the middle of the sorted data
Mode: value that occurs with the greatest frequency
Which to use?
Mode is usually a poor measure.
Compared to mean, median is less sensitive to extreme observations and in many cases more interpretable
Geometric mean: useful for finance, when averaging growth rate over years
Let Ri be the rate of return in period i. The geometric mean Rg of the returns R1,…,Rn is (1+Rg)n = (1+R1)…(1+Rn); Solving for Rg, we have ; example with R1=100% and R2=-50%. ($1,000 -> $2,000 -> $1,000 again)
Ex 4.3, 4.10 (geometric mean)
4.2Measures of variability
Measure of spread or variability of the data
Example: 8, 4, 9, 11, 13 (# of hours the students spent studying stat last week)
Range = largest value observed -smallest value observed (too simple)
Variance: sample variance , population variance
Why n-1? We will see in Chapter 10.1;
Compute “deviations” first and squaring, summing, dividing.
Why squaring? (absolute value is also possible; MAD)
The unit? (square of the original unit)
Shortcut for sample variance:
Standard deviation (SD): sample standard deviation , population standard deviation
Same unit as the original data; easy to interpret
s2=2=0 if and only if ___
Empirical Rule: Given a set of n measurements that is approximately normal (bell-shaped), it follows that the interval with endpoints
contains ~ 68% of the measurements
contains ~ 95% of the measurements
contains almost all of the measurements
E.g. Analysis of the monthly returns on an investment shows the distribution is approximately bell shaped and mean=10% and sd=4%. What can you say about the distribution of the return?
hist(rnorm(240, 10, 4), col=’red’)
How often is the return between 6 to 14%?
How often is the return larger than 14%?
Coefficient of variation (CV): or
Ex 4.23, 24((b) and (c) only; also compute standard deviations as well), 27, 28
4.3Percentiles and box plots
Percentiles are everywhere (test scores…)
The p’th percentile: the value for which p percent of observations are less than that value and (100-p)% are greater than that value
Quartiles are 25th, 50th, 75th percentiles (divide the data into quarters),
each called first/lower quartile, median, and third/upper quartile
each labeled Q1, Q2, Q3
(cf. quintiles and deciles)
Location of a p’th percentilein the sorted numbers is approximately
Recycle the internet data example:
Simple, rounding approach
Detailed approach
Relationship between the skewness and distribution of quartiles
If Q2 is closer to Q1 than Q3, then ____ skewed
If Q2 is closer to Q3 than Q1, then ____ skewed
Inter-quartile range (IQR) : Q3-Q1; spread of the middle 50% of the observations
(horizontal) Box plots:
Q1, Q2, Q3 for the box boundaries;
Left and right ‘whiskers’ extend outward from the box boundaries to the outermost values that are within 1.5 * IQR from the box boundaries
Points outside the whiskers are ‘outliers’ (>1.5*IQR outward from Q1 or Q3); interesting or incorrect points
Multiple box plots: Great tool for comparing distribution of multiple groups
Ex 4.37, 4.43, 4.48 (do only “describe your findings” part; the boxplot is provided in the handout; feel free to try Minitab to draw the boxplot per in class instruction but it’s not required)
4.4Measures of linear relationship
Numerical measure for direction and strength of the linear relationship
Example: (which are X and which are Y?)
baseball wins vs. home/road attendance (Baseball attendance);
GMAT score vs. MBA GPA (xm04-16)
Covariance between variables X and Y:
Population covariance ,
Sample covariance:,
Shortcut for sample covariance:
Manual calculation:
I / xi / yi / / / /1 / 2 / 13 / / / /
… / 6 / 20
N / 7 / 27 /
Total / /
Average /
Xi=2,6,7; yi=13, 20, 27;
How about yi=27, 20, 13?
How about yi=20, 27, 13?
Look at the sign (direction) and magnitude (strength) –
How do we judge magnitude of covariance?
Coefficient of correlation
Population correlation; sample correlation
Correlation is between -1 and 1
Java Applet for correlation coefficient
Least squares method: an objective way of producing a straight line through data points in scatter diagram
It produces a straight line such that the sum of squared deviations between the points and the line is minimized
Equation for a line:
,
where
: intercept
slope
: the (predicted) value of y determined by the line
Use calculus to find coefficients b0, b1 which minimizes
Least squares line coefficients are given by
and .
Ex 4.55, 56, 58 (xr04-58; computer use is OK but show your work)
4.5Comparing graphical and numerical techniques
Comparing returns on two investment; centers=expected return; spreads=risks (low-risk vs high-risk)
Business stat marks vs. math stat marks: unimodal, bimodal, …
Relationship b/w price and size of houses
4.6General guidelines for exploring data
Look at the shape of the distribution; find Center; spread; peaks; skewness (bell curve?)
Shapes guide on which numerical techniques to use
Optional (won't be graded): Ex 4.84, 4.86(you have to use the computer, preferrably Minitab, for these two problems)
5Data collection and sampling
5.1Methods of collecting data
Direct observation (observational data): aspirin vs. heart attack example; limitations; inexpensive
Surveys: Gallup Poll example; market research; response rate
Personal interview
Telephone interview
Self-administered survey
Questionnaire design
Experiment (experimental data): same example
Ex 5.1
5.2Sampling
The chief motif for a sample rather than population: cost
Use sample quantities as ‘estimates’ for the corresponding population quantities
E.g. Nielson ratings (what is watched by 1000 television viewers); quality control
“Target population” (the population about which we want to draw inferences) vs. “sampled population” (the actual population from which the sample has been taken)
E.g. The Literary Digest : predicted Alfred Landon’s 3 to 2 victory over the incumbent Franklin D. Roosevelt based on 10 million sample ballots
That are sampled from phone directory
Of which “only” 2.3 million were returned (‘self-selected samples’)
Ex. 5.6, 5.7
5.3Sampling plans
A “simple random sample” is a sample selected in such a way that every possible sample with the same # of observations is equally likely to be chosen
Simple and good (do it “randomly”!!)
How to do it?? (random sample; jar; …)
A “stratified random sample” is obtained by separating the population into mutually exclusive sets, or strata, and then drawing simple random samples from each stratum
To extract more information
Criteria for separating a population into strata include: gender, age, occupation,…
Sampling procedure and analysis can be complicated: plan ahead and consult stat pros!
A “cluster sample” is a simple random sample of groups or clusters of elements
Reduce geometric distances the surveyor must cover to gather data (reduce cost)
Increases sampling error
Sample size and accuracy: The larger the sample size is, the more accurate the sample estimates becomes
Details in Chapters 10 and 12
Ex 5.11, 14-16
5.4Sampling and nonsampling errors
Sampling error: differences between the sample and the population that exist only because of the observations that happened to be selected for the sample
E.g. the mean annual income of North American blue-collar workers
Estimate the mean income of the population by the mean of the sample. The value of will deviate from simply by chance
This deviation can be large simply due to bad luck
The only way to reduce the expected size of this error is to take a larger sample
Given a fixed sample size, we state the probability that the sampling error is less than certain amount (Ch. 10)
Nonsampling error: more serious; taking a larger sample won’t help here; due to mistakes made in the acquisition of data or due to the sample observations being selected improperly
Error in data acquisition
“Non-response error”: error or bias introduced when responses are not obtained from some members of the sample
Selection bias
Ex 5.17, 5.18
6Probability
Probability is critical in statistical inference since it provides the link between the population and the sample
6.1Assigning probability to events
A “random experiment” is aprocess that leads to one of several possible outcomes
E.g. coin flipping; grade on a stat test; time to assemble computer; party preference
A “sample space’ of a random experiment is a set of all possible outcomes of the experiment (exhaustive and mutually exclusive)
Requirements of probabilities: given a sample space S, the probabilities assigned to outcome must satisfy two requirements:
The probability of any outcome must be between 0 and 1, i.e.
The sum of the probabilities of all the outcomes in the sample space must be 1, i.e.
Three approaches to assigning probabilities
The classical approach
The relative frequency approach
The subjective approach
An “event” is a set of outcomesin a sample space
A “simple event” is an individual outcome
The “probability of an event” is the sum of probabilities of the simple events that constitute the event
Most useful way to interpretprobability is the relative frequency approach for a hypothetical, infinite number of experiments
Ex. 6.1-3 (in class), 8
6.2Joint, marginal, and conditional probability
Want to consider ‘combinations’ of events
Example: relationship between whether a mutual fund outperforms market and whether the manager of the fund has an MBA from a top-20 program
Consider a population of 1,000 mutual funds
Mutual fund outperforms market / Mutual fund does not outperform market / TotalsThe manager has MBA / 110 / 290
The manager does not have MBA / 60 / 540
Totals / 1,000
The “intersection of events A and B,” denoted “A and B,” is the event that occurs when both A and B occurs.
The probability of the intersection is called the “joint probability”
P(A randomly selected mutual fund outperforms and its manager has an MBA degree) =
What is the joint probability if we sample a mutual fund from the above population?
Mutual fund outperforms market / Mutual fund does not outperform market / TotalsThe manager has MBA / .11 / .29 /
The manager does not have MBA / .06 / .54 /
Totals / /
“Marginal probabilities” are computed by adding across rows or down columns
P(A randomly selected mutual fund manager has MBA degree) = ?
i.e., When a mutual fund is randomly selected, the probability that its manager has an MBA is ___
i.e., ___ all mutual fund managers have an MBA
Try, P(A randomly selected mutual fund outperforms the market) = ?
“Given that a fund is fund is managed by an MBA, what’s the probability that it outperforms the market?”
Given A, what’s the probability of B?
The “Conditional probability of B given A”, written P(B|A),is the probability of event B given the occurrence of another related event A.
Formally, it can be computed as P(B|A)=P(A and B)/P(A)
Two events A and B are “independent” if P(A|B)=P(A) or P(B|A)=P(B)
i.e., the probability of one event is not affected by the occurrence of the other event
Checking dependence: For the table like above, we can check all four combinations but showing it for only one of them [P(B) P(B|A) for some A and B] is enough. On the other hand, showing independence would be more work
The “union” of events A and B is the event that occurs when either A or B or both occur. It is denoted as “A or B”
E.g. determine P(A1 or B1)
Approach #1 : sum the components
#2 : 1- P(the other component)
Ex 6.86
6.3Probability rules and trees
Want to calculate the probability of more complex events from the probability of simpler events
Complement rule: the “complement” of event A is … and is denoted by AC. The rule says P(AC)=1-P(A); e.g.