Introduction to Nonparametric Statistical Methods

STAT 2010, Business Stat2006Jaimie Kwon

STAT 2010, Elements of Statistics for
Business and Economics

Lecture Notes

Prof. Jaimie Kwon

Statistics Dept

CalStateEastBay

Disclaimer

These lecture notesare for internal use of Prof. Jaimie Kwon, but are provided as a potentially helpful material for students taking the course. A few things to note:

The lecture in class always supersedes what’s in the notes

These notes are provided “as-is” i.e. the accuracy and relevance of the contents are not guaranteed

The contents are fluid due to constant update during the lecture

The contents may contain announcements etc. that are not relevant to the current quarter

Students are free to report typos or make suggestions on the notes via emailing or in person to improve the material, but they need to understand the above nature of the notes

Do not distribute these notes outside the class

Best Practice for note-taking in class

I do not recommend students relying on this lecture notes in place of actual notes he/she writes down

Bring a notepad and write down materials that I go over in the class, using this lecture notes as the independent reference; you don’t miss a thing by not having a printout of this lecture note in (and outside) the class

If you still want to print these notes, it’d be better to print them 4 pages on a single page (using “pages per sheet” feature in MS Word), preferably double sided (to save trees)

Some canonical examples:

Benefit of low-fat diet (Jan 2006)

# of supporters of Bush/Gore in Florida exit poll (Florida, 2000)

Is driving an SUV more dangerous than driving a passenger car?

To cash in now and retire or keep working, for GM workers (Mar 2006)?

When do I have to leave home to be at school on time (this morning)?

Has consumer confidence in the US increased or decreased from last to this month (March 2006)?

Where do I put this $1,000? Google stock? Coca-Cola stock? A mutual fund? Certificate of deposit (CD)? What are expected returns and risks? (pay day)

The number of mothers opting for cesarean birth is on the rise. On the other hand, cesarean babies have higher risk of breathing problem (March 30, 2006)

Arnold is back (almost). The Californian governor’s approval rating is 47% now, a 7% increase in a single month. (March 30, 2006)

What’s the daily number of reports related to statistics? Interval variable? Categorical?

What’s common in above examples: decision under uncertainty

1What is statistics?

Statistics: a way to extract information from data

Descriptive statistics: methods of organizing, summarizing, and presenting data in such a way that useful information is produced

Graphical methods

Numerical summary of data

Inferential statistics: a body of methods used to draw conclusions or inferences about characteristics of population based on sample data

Key paradigm of statistics

Population: the group of all items of interest

Parameter: a descriptive measure of a population

Sample: a set of data drawn from the population

Statistic: a descriptive measure of a sample

Statistical inference: the process of making and estimate, prediction or decision about a population based on sample data

Exercises 1.3, 4

2Graphical and tabular descriptive statistics

2.1Types of data

Variable: some characteristic of a population or sample

The values of the variable are the possible observations of the variable. (Integers b/w 0-100, real numbers, M/F, A-F)

Data are the observed values of a variable (plural for datum)

Types of data/variable

Interval data/variable are real numbers,a.k.a. quantitative or numerical

Nominal data/variable have categorical values without orders, a.k.a. qualitative or categorical

Ordinal data/variable are similar to nominal but their values can be ordered

(“Categorical variable”is the generic name for nominal and ordinal variables)

Hierarchy? (Course grade: score to letter grade to pass/fail)

Exercises 2.1-2.3

2.2Techniques for nominal data

Frequency distribution: a table of the categories and their counts

Relative frequency distribution: shows the proportion (not count) of each category

A bar chart is used to display frequencies

A pie chart shows relative frequencies

Exercises 2.11

2.3Graphical techniques for interval data

How to visualize the data?Histogram

E.g. Items with defects (Xr02-35)

x=c(4, 9, 13, 7, 5, 8, 12, 15, 5, 7, 3, 8, 15, 17, 19, 6, 4, 10, 8, 22, 16, 9, 5, 3, 9, 19, 14, 13, 18, 7); hist(x)

Example (recycle below): mean time spent on the internet; 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 (hrs /month)

x=c(0, 7, 12, 5, 33, 14, 8, 0, 9, 22); hist(x, nclass=4)

We’ve all seen histograms. Here’s how you draw one:

Build class intervals, equally wide, non-overlapping intervals that cover the complete range of observations.

Create a frequency distribution, by counting the # of observations that fall into each class interval

Draw the histogram, rectangles whose bases are class intervals and heights are frequencies

How many class intervals?

More class intervals for {more, less} data points.

Table 2.6 for the rule of thumbs;

Sturges’ formula: “1+3.3 log(n)”

My favorite: eyeballing

How wide is each interval? Round (range/# of classes) to something convenient.

Reading histograms…

Symmetry and Skewness (positively/negatively)

How many peaks? unimodal, bimodal

Bell shape (symmetric, unimodal; important)

Which variables are likely to have

A positively skewed distribution?

A negatively skewed distribution?

Symmetric distribution?

Symmetric, bell shaped distribution?

Bimodal distribution?

Stem-and-leaf display

Ogive

Ex. 2.33, 35(a)(c)

2.4Describing the relationship between two variables

Bivariate methods are used to study the relationship between two variables (Cf. Univariate methods)

Dependent variable (Y)vs. independent variable (X)

Four possible combinations: {categorical, integer} {X, Y} variable

Two categorical variables:

E.g. Gender and choice of doctorate, 1998 (Ex. 2.56, Xr02-56)

Example: Blue collar/white collar/professional vs NYTimes/USA today/SF Chronicles; ad targeting

A contingency table lists the frequency of each combination of the values of two categorical variables

To study the differences in the row variable among the column variable; compute the column totals and divide each frequency by it to obtain column relative frequencies

Two interval variables:

E.g. Size vs. price of home (100 ft2 vs K dollars)which are dependent and independent variable? Use of X and Y. (e.g.Xm02-09)

Draw scatter diagram using X and Y

Interpreting scatter diagrams:

Linear relationship: most of the points fall close to a straight line through points (cf. least squares method)

Two main characteristics of linear relationship:

Strength (strong, medium, weak, none)

Direction (positively linear, negatively linear)

Nonlinear relationship

Ex. 2.55 (Xr02-55), 56 (Xr02-56)

2.5Time series data

Bankrate, Hbrhomes graph (> cross-sectional data)

Ex 2.73 (Xr02-73)

3Art and science of graphical presentations

graphical excellence

graphical deception

presenting statistics: writing reports and oral presentations

4Numerical descriptive techniques

4.1Measures of central location

Label observations in a sample as

We typically use n for the sample size, N for population size

Population quantities are usually not computable, especially when N=

Example (recycle below): mean time spent on the internet; 0, 7, 12, 5, 33, 14, 8, 0, 9, 22 (hrs /month)

x=c(0, 7, 12, 5, 33, 14, 8, 0, 9, 22);mean(x);hist(x)

Three measures of central location

Arithmetic mean:
sample mean ; population mean:

Median: the observation that falls in the middle of the sorted data

Mode: value that occurs with the greatest frequency

Which to use?

Mode is usually a poor measure.

Compared to mean, median is less sensitive to extreme observations and in many cases more interpretable

Geometric mean: useful for finance, when averaging growth rate over years

Let Ri be the rate of return in period i. The geometric mean Rg of the returns R1,…,Rn is (1+Rg)n = (1+R1)…(1+Rn); Solving for Rg, we have ; example with R1=100% and R2=-50%. ($1,000 -> $2,000 -> $1,000 again)

Ex 4.3, 4.10 (geometric mean)

4.2Measures of variability

Measure of spread or variability of the data

Example: 8, 4, 9, 11, 13 (# of hours the students spent studying stat last week)

Range = largest value observed -smallest value observed (too simple)

Variance: sample variance , population variance

Why n-1? We will see in Chapter 10.1;

Compute “deviations” first and squaring, summing, dividing.

Why squaring? (absolute value is also possible; MAD)

The unit? (square of the original unit)

Shortcut for sample variance:

Standard deviation (SD): sample standard deviation , population standard deviation

Same unit as the original data; easy to interpret

s2=2=0 if and only if ___

Empirical Rule: Given a set of n measurements that is approximately normal (bell-shaped), it follows that the interval with endpoints
contains ~ 68% of the measurements
contains ~ 95% of the measurements
contains almost all of the measurements

E.g. Analysis of the monthly returns on an investment shows the distribution is approximately bell shaped and mean=10% and sd=4%. What can you say about the distribution of the return?

hist(rnorm(240, 10, 4), col=’red’)

How often is the return between 6 to 14%?

How often is the return larger than 14%?

Coefficient of variation (CV): or

Ex 4.23, 24((b) and (c) only; also compute standard deviations as well), 27, 28

4.3Percentiles and box plots

Percentiles are everywhere (test scores…)

The p’th percentile: the value for which p percent of observations are less than that value and (100-p)% are greater than that value

Quartiles are 25th, 50th, 75th percentiles (divide the data into quarters),
each called first/lower quartile, median, and third/upper quartile
each labeled Q1, Q2, Q3
(cf. quintiles and deciles)

Location of a p’th percentilein the sorted numbers is approximately

Recycle the internet data example:

Simple, rounding approach

Detailed approach

Relationship between the skewness and distribution of quartiles

If Q2 is closer to Q1 than Q3, then ____ skewed

If Q2 is closer to Q3 than Q1, then ____ skewed

Inter-quartile range (IQR) : Q3-Q1; spread of the middle 50% of the observations

(horizontal) Box plots:

Q1, Q2, Q3 for the box boundaries;

Left and right ‘whiskers’ extend outward from the box boundaries to the outermost values that are within 1.5 * IQR from the box boundaries

Points outside the whiskers are ‘outliers’ (>1.5*IQR outward from Q1 or Q3); interesting or incorrect points

Multiple box plots: Great tool for comparing distribution of multiple groups

Ex 4.37, 4.43, 4.48 (do only “describe your findings” part; the boxplot is provided in the handout; feel free to try Minitab to draw the boxplot per in class instruction but it’s not required)

4.4Measures of linear relationship

Numerical measure for direction and strength of the linear relationship

Example: (which are X and which are Y?)

baseball wins vs. home/road attendance (Baseball attendance);

GMAT score vs. MBA GPA (xm04-16)

Covariance between variables X and Y:

Population covariance ,

Sample covariance:,

Shortcut for sample covariance:

Manual calculation:

I / xi / yi / / / /
1 / 2 / 13 / / / /
… / 6 / 20
N / 7 / 27 /
Total / /
Average /

Xi=2,6,7; yi=13, 20, 27;
How about yi=27, 20, 13?
How about yi=20, 27, 13?

Look at the sign (direction) and magnitude (strength) –

How do we judge magnitude of covariance?

Coefficient of correlation

Population correlation; sample correlation

Correlation is between -1 and 1

Java Applet for correlation coefficient

Least squares method: an objective way of producing a straight line through data points in scatter diagram

It produces a straight line such that the sum of squared deviations between the points and the line is minimized

Equation for a line:
,
where
: intercept
slope
: the (predicted) value of y determined by the line

Use calculus to find coefficients b0, b1 which minimizes

Least squares line coefficients are given by
and .

Ex 4.55, 56, 58 (xr04-58; computer use is OK but show your work)

4.5Comparing graphical and numerical techniques

Comparing returns on two investment; centers=expected return; spreads=risks (low-risk vs high-risk)

Business stat marks vs. math stat marks: unimodal, bimodal, …

Relationship b/w price and size of houses

4.6General guidelines for exploring data

Look at the shape of the distribution; find Center; spread; peaks; skewness (bell curve?)

Shapes guide on which numerical techniques to use

Optional (won't be graded): Ex 4.84, 4.86(you have to use the computer, preferrably Minitab, for these two problems)

5Data collection and sampling

5.1Methods of collecting data

Direct observation (observational data): aspirin vs. heart attack example; limitations; inexpensive

Surveys: Gallup Poll example; market research; response rate

Personal interview

Telephone interview

Self-administered survey

Questionnaire design

Experiment (experimental data): same example

Ex 5.1

5.2Sampling

The chief motif for a sample rather than population: cost

Use sample quantities as ‘estimates’ for the corresponding population quantities

E.g. Nielson ratings (what is watched by 1000 television viewers); quality control

“Target population” (the population about which we want to draw inferences) vs. “sampled population” (the actual population from which the sample has been taken)

E.g. The Literary Digest : predicted Alfred Landon’s 3 to 2 victory over the incumbent Franklin D. Roosevelt based on 10 million sample ballots

That are sampled from phone directory

Of which “only” 2.3 million were returned (‘self-selected samples’)

Ex. 5.6, 5.7

5.3Sampling plans

A “simple random sample” is a sample selected in such a way that every possible sample with the same # of observations is equally likely to be chosen

Simple and good (do it “randomly”!!)

How to do it?? (random sample; jar; …)

A “stratified random sample” is obtained by separating the population into mutually exclusive sets, or strata, and then drawing simple random samples from each stratum

To extract more information

Criteria for separating a population into strata include: gender, age, occupation,…

Sampling procedure and analysis can be complicated: plan ahead and consult stat pros!

A “cluster sample” is a simple random sample of groups or clusters of elements

Reduce geometric distances the surveyor must cover to gather data (reduce cost)

Increases sampling error

Sample size and accuracy: The larger the sample size is, the more accurate the sample estimates becomes

Details in Chapters 10 and 12

Ex 5.11, 14-16

5.4Sampling and nonsampling errors

Sampling error: differences between the sample and the population that exist only because of the observations that happened to be selected for the sample

E.g. the mean annual income of North American blue-collar workers

Estimate the mean income of the population by the mean of the sample. The value of will deviate from simply by chance

This deviation can be large simply due to bad luck

The only way to reduce the expected size of this error is to take a larger sample

Given a fixed sample size, we state the probability that the sampling error is less than certain amount (Ch. 10)

Nonsampling error: more serious; taking a larger sample won’t help here; due to mistakes made in the acquisition of data or due to the sample observations being selected improperly

Error in data acquisition

“Non-response error”: error or bias introduced when responses are not obtained from some members of the sample

Selection bias

Ex 5.17, 5.18

6Probability

Probability is critical in statistical inference since it provides the link between the population and the sample

6.1Assigning probability to events

A “random experiment” is aprocess that leads to one of several possible outcomes

E.g. coin flipping; grade on a stat test; time to assemble computer; party preference

A “sample space’ of a random experiment is a set of all possible outcomes of the experiment (exhaustive and mutually exclusive)



Requirements of probabilities: given a sample space S, the probabilities assigned to outcome must satisfy two requirements:

The probability of any outcome must be between 0 and 1, i.e.

The sum of the probabilities of all the outcomes in the sample space must be 1, i.e.

Three approaches to assigning probabilities

The classical approach

The relative frequency approach

The subjective approach

An “event” is a set of outcomesin a sample space

A “simple event” is an individual outcome

The “probability of an event” is the sum of probabilities of the simple events that constitute the event

Most useful way to interpretprobability is the relative frequency approach for a hypothetical, infinite number of experiments

Ex. 6.1-3 (in class), 8

6.2Joint, marginal, and conditional probability

Want to consider ‘combinations’ of events

Example: relationship between whether a mutual fund outperforms market and whether the manager of the fund has an MBA from a top-20 program

Consider a population of 1,000 mutual funds

Mutual fund outperforms market / Mutual fund does not outperform market / Totals
The manager has MBA / 110 / 290
The manager does not have MBA / 60 / 540
Totals / 1,000

The “intersection of events A and B,” denoted “A and B,” is the event that occurs when both A and B occurs.

The probability of the intersection is called the “joint probability”

P(A randomly selected mutual fund outperforms and its manager has an MBA degree) =

What is the joint probability if we sample a mutual fund from the above population?

Mutual fund outperforms market / Mutual fund does not outperform market / Totals
The manager has MBA / .11 / .29 /
The manager does not have MBA / .06 / .54 /
Totals / /

“Marginal probabilities” are computed by adding across rows or down columns

P(A randomly selected mutual fund manager has MBA degree) = ?

i.e., When a mutual fund is randomly selected, the probability that its manager has an MBA is ___

i.e., ___ all mutual fund managers have an MBA

Try, P(A randomly selected mutual fund outperforms the market) = ?

“Given that a fund is fund is managed by an MBA, what’s the probability that it outperforms the market?”

Given A, what’s the probability of B?

The “Conditional probability of B given A”, written P(B|A),is the probability of event B given the occurrence of another related event A.

Formally, it can be computed as P(B|A)=P(A and B)/P(A)

Two events A and B are “independent” if P(A|B)=P(A) or P(B|A)=P(B)

i.e., the probability of one event is not affected by the occurrence of the other event

Checking dependence: For the table like above, we can check all four combinations but showing it for only one of them [P(B)  P(B|A) for some A and B] is enough. On the other hand, showing independence would be more work

The “union” of events A and B is the event that occurs when either A or B or both occur. It is denoted as “A or B”

E.g. determine P(A1 or B1)

Approach #1 : sum the components

#2 : 1- P(the other component)

Ex 6.86

6.3Probability rules and trees

Want to calculate the probability of more complex events from the probability of simpler events

Complement rule: the “complement” of event A is … and is denoted by AC. The rule says P(AC)=1-P(A); e.g.