SPS 580 Lecture 4 CI Case selection outliers T
I. PRECISION OF DATA -- The “plus or minus”
Research all about analyzing the mean or the percentage. When you take a random sample for a survey how much confidence do you have in the mean, or the percentage? The level of confidence is expressed as the “plus or minus” that goes along with the result. Also called the “margin of error”.
Illustration: I designed a questionnaire with a scale of 0-10 to determine whether the students in this class agree or disagree with <something really important. A score of 5 is neutral; above that = favorable opinion; below that = unfavorable opinion. I’d like to know the average score in the class (aka the TRUE MEAN).
ß Universe = the 26 students in SPS 580, “Scale Score” shows what each person would say if they were asked the question. There is a “TRUE MEAN” -- I don’t know what it is, that’s why I’m doing the survey.
I didn’t have enough money to survey everyone in the course (i.e., conduct a census). I only have enough money to survey 4 people.
So I randomly selected 4 people and interviewed them.
ß These are the answers I have for my survey.
ß And here’s the results from my data analysis: the observed mean is 3.25
STATISTICS ALERT . . . MEAN = Sum(x) / n = 13/4= 3.25
The 95% confidence interval equals “Observed Mean” +/- 2.93 … which means I am 95% certain that the TRUE MEAN – i.e., average score for all students in the class (for the UNIVERSE) is between 0.32 and 6.18
STATISTICS ALERT . . .
95% Confidence Interval = +/- 1.96 * Standard Error of the Mean = +/- 1.96 * 1.49
Standard Error of the Mean = Square Root( Variance / n ) = Sqrt (8.92/4) = 1.49
Variance = Sum of ( (individual score – MEAN)^2 ) / (n-1) = Sum ( (x – 3.25)^2 ) / 3 = 8.92
ß The variance has to do with the amount of VARIETY in the scores – it bounces around the same value regardless of how many people you interview
The standard error of the mean has to do with the variance and the SAMPLE SIZE, it gets smaller if the sample gets larger.
A. The meaning of the 95% confidence interval . . . The 95% CI is a way of saying we are 95% certain that the “REAL MEAN” – i.e., the one we would get if we surveyed everybody -- is within the interval . . . “Observed Mean” +/- 1.96 * SEM
Observed mean= 3.25
95% CI = 0.32 ………………………..6.18
95% of what? Well, if we did 100 surveys with the same sample size, then 95% of the time – i.e. 95 times out of 100, the 95% confidence interval will contain the “TRUE MEAN”
ß To test this, I did four more surveys, based on a random sample of the same size, from the same universe
ß These are the results of surveys 2,3,4, and 5.
Here are the mean and 95% CI for each of the 5 samples . . .
From the TOTAL data base we calculate that the “TRUE MEAN” is 3.60
However, in a research setting you don’t know this, you just have an observed sample mean and a 95% confidence interval
ß In my 5 samples the 95% CI included the “TRUE MEAN” every time. If I had done 100 samples, I would expect that the 95% CI included the true mean 95 times
B. Interpreting the 95% confidence interval
In my survey I found out that the “True Opinion” is likely (95%) to be between .32 and 6.18
Observed mean= 3.25
95% CI = 0.32 ………………………..6.18
Q: So how helpful was my survey? A: Not very – Another way to say it is that I’m 95% certain that the “True opinion” is either negative ( 5), neutral (5), or positive (5).
Q: How could I do a survey that is more helpful? A: Increase the sample size.
II. PRECISION OF PERCENTAGES
When Y is a dichotomy coded (0,1) , the mean is the proportion in category 1.
Don’t believe me . . .. add up these 10 responses and divide by 10 to get the average (mean)
ß mean = proportion coded (1)
ß also can be expressed as % coded (1)
ß formula for variance is simpler
ß formula for SEM is THE SAME
ß formula for +/- is THE SAME
ß formula for 95% CI is THE SAME
ß interpretation of results is THE SAME
Observed mean= 60.0%
95% CI = 29.6% …………..………..90.4%
Q: Is the majority opinion in the class above or below 50% ?
A: I don’t know, but I’m sure it’s between 29.6% and 90.4% !!!
Q: What can you do to make this more precise? A:
è LOOK AT ASSGT 4 Part 1
III. EXPLORING Confidence Intervals with Live Data
WBEZ marketing committee wants to know how to increase revenue from its younger audience
è Listenership, familiarity w/WBEZ Membership in NFPs Usual payment for membership
A. what % listen to WBEZ radio station?
FREQUENCIES VARIABLES= wbezrng /ORDER=ANALYSIS.
A. NOTE: the marketing committee wants to target its research on population 45 and under.
1. Define a selection variable . . .
RECODE age01 (10 thru 45=1) (46 thru 98=2) (ELSE=9) INTO AGE2.
VARIABLE LABELS AGE2 'age2 '.
VALUE LABELS age2 1 '18- 45' 2 '46+' 9 'not valid'.
MISSING VALUES age2 (9) .
2. Select that data only for analysis . . .
DATA / SELECT CASES / IF CONDITION IS SATISFIED / IF age2=1 OK/PASTE
Other selection variables mentioned in SPS570/580. . . transit riders, low income
Univariate selection variables (for now) . . . typology construction later
USE ALL.
COMPUTE filter_$=(AGE2 = 1).
VARIABLE LABEL filter_$ 'AGE2 = 1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMAT filter_$ (f1.0).
FILTER BY filter_$.
NEW DATA FOR
AGE 18-45 ONLY
Potential listeners
Current listeners
B. What % belong to non-profit arts or cultural organizations?
RECODE mems1 mems10 mems11 mems12 mems13 mems2 mems3 mems4 mems5
mems6 mems7 mems8 mems9 (1=1) (2=0) (8=0).
VALUE LABELS mems1 mems10 mems11 mems12 mems13 mems2 mems3 mems4 mems5
mems6 mems7 mems8 mems9 0 ' no DK ' 1 ' yes member'.
COMPUTE ARTCULTMEMBERSHIPS = mems1+ mems2+ mems3+ mems4+ mems5
+ mems6 + mems7 + mems8 + mems9+ mems10+ mems11+ mems12+ mems13.
KEEP THE SELECTION VARIABLE OPERATING
FREQUENCIES VARIABLES=ARTCULTMEMBERSHIPS /ORDER=ANALYSIS.
Inspection => (0,1) utility only
One or more memberships (1 – 6)
= 18% +/- 1.8%
Note: you can get different data points on the same population even though memberships and WBEZ listenership are not asked the same year
C. How much are people willing to pay for a membership in an arts/cultural organization?
MEMSPED How much would you be willing to spend for a one-year family membership in one of these types of organizations?
MEAN = $82 +/- $14
ß anybody see any problems here?
ß Anything above $500 is quite possibly a mistake
ß Anything above $200 is an OUTLIER
Rule of thumb: values more than 150% distance from the mean are going to cause trouble
D. Deal with OUTLIERS one of two ways
- Work with the median (as opposed to the mean). . . Between $45 and $50 -- where the 50th percentile falls
- But the median can’t be used for many statistical procedures
- So we RECODE OUTLIERS to an acceptable maximum value and then re-calculate the mean
RECODE memsped (225 thru 6000=200) (ELSE=Copy) INTO memsped2.
VARIABLE LABELS memsped2 'money for membership'.
MISSING VALUES MEMSPED2 (9997, 9998) .
FREQUENCIES VARIABLES= MEMSPED2 /ORDER=ANALYSIS.
DESCRIPTIVES VARIABLES=MEMSPED2
/STATISTICS=MEAN STDDEV VARIANCE MIN MAX SEMEAN.
MEAN = $60 +/- $3
Look at the impact of trimming the outliers $82 vs. $60
Rule of thumb: the reason you trim outliers is that means (and lots of other really important statistics) are VERY STRONGLY INFLUENCED by extreme values. You get a more stable, and therefore more accurate picture of the REAL WORLD by trimming (not eliminating) outliers.
IV. Should WBEZ market differently in the city vs. suburbs?
RECODE region (1=1) (2 thru 7=2) (ELSE=9) INTO region2.
VARIABLE LABELS region2 'region recoded'.
value labels region3 1 'Chicago' 2 'Suburbs'.
missing values region2 (9).
KEEP THE SELECTION VARIABLE OPERATING
A. Is WBEZ listenership higher in the city or in the suburbs?
Chi Sq(3) = 98
Phi = .22
But chi square, phi are blanket tests, WBEZ wants to know specifically about listenership
What is the CONFIDENCE INTERVAL for the difference of means
If it includes the value ZERO then the difference of the means is NOT SIGNIFICANT
STATISTICAL THEORY ALERT . . .
Mean 1 has its uncertainty (SEM1)
Mean 2 has its uncertainty (SEM2)
Logical conclusion à Wouldn’t it make sense that the uncertainty of the difference is equal to the sum of the two uncertainties? Well it is, sort of . . .
STATISTICS ALERT à
95% CONFIDENCE INTERVAL for the difference of means = +/- 1.96 * SEDiff
ß the CI(Diff) does NOT include ZERO, so we conclude that there is a SIGNIFICANT DIFFERENCE in listenership by place . . .
In the city, listenership is 10% higher than in the suburbs
Another way to look at this is that the difference is significant if the t-test > 1.96 (or < -1.96 for negative differences) df = INFINITE
E. Is the percent who belong to nfp arts/cultural organizations higher in the city?
KEEP THE SELECTION VARIABLE OPERATING
region2 region recoded * ARTCULTMEMBERSHIPS Crosstabulation /% within region2 region recoded /
/ ARTCULTMEMBERSHIPS / Total /
.00 / 1.00 / 2.00 / 3.00 / 4.00 / 6.00 /
region2 region recoded / .00 Suburbs / 80.5% / 15.2% / 3.5% / .4% / .3% / 100.0% /
1.00 Chicago / 83.2% / 12.5% / 3.0% / 1.1% / .1% / .1% / 100.0% /
Total / 81.8% / 14.0% / 3.2% / .7% / .2% / .1% / 100.0%
àgrouping together to focus on any memberships vs. none
Chi square (5) = 7.7 p > .05
phi = .077
ß Not Sig, answer is NO, % who belong is same in city and in suburbs
F. Is the amount people are willing to pay for a membership higher in the suburbs?
Report /memsped2 money for membership /
region2 region recoded / Mean / Std. Error of Mean /
dimension1 / .00 Suburbs / 59.8833 / 1.81970 /
1.00 Chicago / 60.3991 / 2.16942 /
Total / 60.1166 / 1.39812
ANALYZE/ COMPARE MEANS / MEANS / Dependent Memsped2 / Independent Region2 / OPTIONS Mean Std Error of Mean /
ß Not Sig, answer is NO, people pay the same in the suburbs as in the city
Assignment 4:
Part 1: The Excel spreadsheet for Assignment 4 contains a list of the students in SPS 580 and their opinions on two really important issues. Opinion Item 2 is measured on a (1,10) scale. Opinion Item 3 is measured as a (0,1) dichotomy.
1. Randomly select 10 people from the list; analyze the scores for the answers they gave to Opinion Item 2 and Opinion Item 3.
2. For Opinion Item #2: What is the observed mean, the variance, the SEM, the 95% CI, what do you conclude from your survey?
3. For Opinion Item #3: ditto
Part 2: Define a policy research problem on a TARGETED POPULATION using a univariate selection variable (recoded, but no typologies)
- TARGET POPULATION: Use PASW to select the targeted population, describe how this is done
- DEPENDENT VARIABLES . . . Define one dichotomous (0,1) outcome variable (Y1), and one interval scale outcome variable (Y2) -- can be a scale you compute or an interval variable on the data set
- For Y1 what is the 95% Confidence Interval for the percent
- For Y2 . . .
- is there a need to trim outliers, take the necessary action, explain it
- What are the low/high (trimmed) values, what is the mean and the 95% Confidence Interval for the mean?
- INDEPENDENT VARIABLE: Define a (0,1) dichotomous independent variable (X1) that classifies the target population according to a characteristic of policy interest, explain the variable and categories
- What is the theory being tested for X1 à Y1
- Crosstabulate X1 and Y1, show a PQ table of percents, with added columns/rows as needed to show the steps in calculating a T-test for the difference in percentages
- What do conclude from the data and the T-test?
- What is the theory being tested for X1 à Y2
- Calculate a table of means for X1 and Y2, show a PQ table of means, with added columns/rows as needed to show the steps in calculating a T-test for the difference in means
- What do conclude from the data and the T-test?
1