Relationship between confidence interval and two sided test
Say (L,U) is a (1-)100% CI for p and we want to test H0: p = p0 vs Ha: p p0.
=> If p0 (L,U) then we are confident that H0 is true, don’t reject H0.
If p0 is outside the interval then we are confident that we can reject H0
090317 T
Suppose we have a sample of 25 observations from a normal distribution and we obtained x = 100, s = 20. We wish to test H0: = 95 vs Ha: 95 with = 0.05
Test statistic:
Reject H0 if
What would confidence interval be?
What’s the relationship? 0 (91.7,108.3) = 95% CI, so we are confident that 0 = 95 is a likely value for .
The Confidence Interval is precisely the non-rejection region of the two-sided test.
Note p-value > since we can’t reject H0
Bootstrap Testing
Suppose we want to test H0: = 0 vs Ha: 0
As a test statistic we will use x (estimate), X (estimator)
Reject H0 in favour of Ha if X is large.
Data: x1, …, xn from some population with mean
p-value (recall = probability that test statistic will be as or more extreme than what we observed) will be P(X x | H0 is true)
Want bootstrap estimate of p-value.
Need to generate bootstrap samples with H0 being true (since p-value is based on this assumption)
Instead of re-sampling from original data [non-parametric] (since we don’t know if this is from a population described by H0, that’s what we’re testing) we re-sample from yi = xi – x + 0, 1 i n
E[yi] = E[xi – x + 0] = E[xi] – E[x] + E[0] = x – x + 0 = 0
Draw B bootstrap samples from y1, …, yn and for each bootstrap sample we calculate
yj*, j = 1, …, B
Note:
-for testing, B is typically 3 000
-sampling is with replacement for non-parametric bootstrap
The bootstrap estimate of the p-value
p* = (number of yj* x) / B
If Ha: 0
Case 1: x > 0
Case 2: x < 0
Test for Population Variance
Suppose x1, …, xn is a sample from a N(,2) distribution where both and 2 are unknown
We are interested in testing H0: 2 = 02 vs Ha: 202
A good point estimator would be s2. The test statistic we use should be a function using the point estimator.
So the test statistic is , which under H0 has a distribution of
p-value = (only take upper tail, chi-squared is skewed)
What if sample is not normal, i.e. not coming from a normal population? Use bootstrap.
Want bootstrap samples under H0.
So we draw bootstrap samples from,
yi = 0xi/sx, 1 i n
since: suppose var(xi) = x2
var(0xi/x) = (0x/x)2 = 02
Comparing Two Samples
Suppose X1, …, Xn1 underlying iid random variables with mean X and variance X2, observations x1, …, xn1
And Y1, …, Yn2 underlying iid random variables with mean Y and variance Y2, observations y1, …, yn2
Further, Xi and Yj are independent (not identically distributed)
We are interested in testing whether distributions of X and Y have the same mean
i.e. H0: X = Y => X - Y = 0
Consider an estimator for the quantity of interest (X - Y)
Estimator: X – Y
Estimate: x – y
Need the sampling distribution of X - Y.
E(X - Y) = E(X) – E(Y) = X - Y(unbiased)
Var(X - Y) = Var(X) + Var(-Y) = Var(X) + (-1)2Var(Y)(since X, Y are independent)
= Var(X) + Var(Y)
=
Estimate of the standard error of x - y:
If
X1, …, Xn1 are normally distributed N(X,X2)
and Y1, …, Yn2 are normally distributed N(Y,Y2)
everything independent
Then the distribution of X - Y ~ N(X - Y, )
Hypotheses being tested here are,
H0:X = Y => X - Y = 0
Ha:XY => X - Y > 0
XY => X - Y < 0
XY => X - Y 0
What are the test statistics?
Consider a few cases:
1) If X2 and Y2 are known
use standardized version of point estimator
~N(0,1)
2) If X2 and Y2 are unknown
use t distribution, replacing them with sample standard deviations
which has an approximate t distribution with degrees of freedom approximated by
(Satterthwaites approximation)
*a conservative approach if using tables, use min(n1 – 1, n2 – 1) for df
generally gives larger p-values and wider CI
3) If X2 and Y2 are unknown, but X2 = Y2 = 2 (common assumption)
Both groups have distribution with same variance.
Why make this assumption?
-resulting distribution of test statistic under H0 is exactly a t distribution
-relationship with analysis of variance (coming soon…)
If X2 = Y2 = 2 then
To estimate 2, pool the samples and use
So the test statistic will be
with exact t distribution with df = n1 + n2 – 2
In practice, when can we assume this?
Rule of thumb: when the ratio of the sample variances is <4, can assume approximated with this assumption
so when sX2/sY2 and sY2/sX2 < 4, then can assume X2 = Y2 = 2
(largest sample variance / smallest sample variance)
Can also perform a hypothesis test to check whether the variances are the same.
090319 R
Bootstrap testing in R
have data with n = 19, = 100
H0: = 90 vs. Ha 90
#generate data
data=c(119.7, 100.0, 104.1, 114.2, 92.8, 150.3, 85.4, 102.3, 108.6, 105.8, 93.4, 107.5, 67.1, 0.9, 88.4, 101.0, 97.2, 95.4, 77.2)
#shift data so that H0 is true
y=data - mean(data) + 90
#generate 3 000 bootstrap samples from y
B=3000
bootsamples = matrix(sample(y,B*length(y),replace=T),nrow=B)
#calculate the means
bootmeans = apply(bootsamples,1,mean)
#hist of boot means
hist(bootmeans)
#mean of data
m=mean(data)
#bootstrap estimate of the p-value
sum(bootmeans >= m)/B + sum(bootmeans <= 2*90 - m)/B
#conclusion:
# data are consistent with a mean of 90 (cannot reject H0)
#what if we do t-test
t.test(data)
Comparing Two Means
Example
Data: length of humerus in adult male sparrows who survived and sparrows who perished after a storm.
Question: are the mean lengths of humerus different between these two groups?
090324 T
since the sample variances have ratio less than 4, can use pooled t-test
if rule of thumb applies, use pooled test since it provides more accurate results
to perform these tests, underlying assumption about the data: it comes from normal
Example
Are there physiological indicators of schizophrenia?
15 pairs of identical twins, one schizophrenic the other healthy
measured brain volume (cm^3), data: left hippocampus
want to know if means of left hippocampus volume is the same for schiz and non-schiz
note these are not independent; each pair is independent, but in each pair the two data are not independent (“paired data” or “matched pairs”)
Answer: take differences and use 1-sample t-test to test
H0: diff = 0
Ha: diff 0
di = xi – yi, for i = 1, …, n
Test statistic
Two-sample Bootstrap Test
Test for equality of location
Two samples X1, …, Xn1 of size n1, and Y1, …, Yn2 of size n2
Suppose we want to test H0: X = Y vs Ha: XY
Test statistic: V = X - Y, observed: v = x - y
p-value = P(V x - y | H0 true)
want a bootstrap estimate of p-value
must generate bootstrap samples under assumption that H0 is true
one way to do this: assuming X, Y have same distribution
-combine two samples into one sample of size n1 + n2
-resample with replacement from this
-each resampling will be split into two groups:
1 of size n1, x1*, …, xn1*
1 of size n2, y1*, …, yn2*
-for each bootstrap sample, calculate bootstrap estimate of test statistic vj* = xj* - yj*, j=1…B
-then bootstrap estimate of p-value is (# of vj* v)/B
Data Collection
Three methods:
- Observational studies
- Sample surveys (also observational but random selection of participants)
- Experiments
Level of strength of conclusions is from lowest to highest (experiments offer best conclusion because more variables are under control of experimenter)
- Observational studies
Data collected without intervention
ex: observe admissions to graduate school, gender, age, GRE, etc…
090326 R
Problems:
*Confounding – can’t separate effect of one variable from another
eg breastfeeding and babies intelligence
may be effect of some other variable (for example the mothers who are breastfeeding have higher intelligence to begin with => babies smarter from genetics)
eg smoking and coffee drinking and heart disease
*can’t be generalized
- Sample Surveys
Still observational
Data collected on a random sample from a population
Confounding is still a problem
results can be generalized to the population
To allow generalization and to avoid bias, samples must be chosen randomly
Simplest method: simple random sample
-each member of the population is equally likely to be chosen
- Experiments
researchers randomly assign a treatment to the subjects or “experimental units”
can have cause-effect conditions (don’t have to worry about confounding variables)
because we are imposing the treatments, we can control for those confounding variables
treatments in experiment are sometimes called “factors” and sometimes “predictor variables”
predictor variables since trying to see if they predict a certain outcome in another variable
the values of a factor are its “levels”
a design (of experiment) is “balanced” if each treatment has same number of experimental units
key step: randomization
each subject is randomly assigned a treatment
-so no bias in treatment assignment
-eliminating the effects of confounding variables; differences among the treatment groups are random
Principles of Design of Experiments
-Control – a group for comparison
-randomization – randomly assign treatments to experimental units
-replication – need multiple observations per treatment
– allows measurement of variability
Problem: can’t always carry out an experiment
In experiments, it’s nice to have if possible (especially for experiments on humans),
- Placebo – “fake” treatment; sometimes people do better just by being treated, they know someone is watching them, giving them attention, etc…
“Placebo Effect” – psychological
- Double-blind – both the experimental units nor the researchers know which treatment is being received/administered
Comment:
all statistical techniques we have learned assume observations are independent and if they aren’t but are treated as if they are then we get narrower CI and more power than we should.
---
Some Normal based distribution theory
Suppose X1, …, Xn are iid N(,2)
any linear combination of the Xi’s is normally distributed
in particular we know X ~ N(,2/n)
Let Zi = (Xi - )/, Zi ~ N(0,1)
And Zi2 ~ 2(1)
(Z12 + Z22 + … + Zk2) ~ 2(k) will be less skewed than 2(1)
If sample variance S2, then
(n-1)S2/2 ~ 2(n-1)
If Z ~ N(0,1) independent of Y ~ 2(m)
then Z/sqrt(Y/m) ~ t(m)
090331 T
Professor Moshonov SICK
090402 R
Normal based distribution theory continued
If Z~N(0,1) independent of Y ~ 2mthen
In particular, X-bar, s2 are independent,
If Y1 ~ 2m and Y2 ~ 2n, Y1 and Y2 independent,
F-distribution with m and n degrees of freedom
Also right-skewed distribution
note: The square of a random variables which has a tm distribution has an F1,m distribution
recall that Z2 ~ 21
The F distribution is useful when testing the equality of variance from two independent distributions.
F-test for equality of variance
X1, …, Xn are iid N(1,12) independent of Y1, …, Yn are iid N(2,22)
We calculate s12 and s22
Suppose we want to test:
H0: 12 = 22<=>12/22 = 1
Ha: 1222
Test statistic is…
independent from
Use , which under H0 is
Good Statistic:
-distribution under H0 is known
-not a function of any unknown parameter
Problems:
-very sensitive to departure from normality
*the assumption that both Xi and Yi are from normal distributions is very strong (there exist specific tests for normality of samples)
-a small p-value could mean that 1222or that the data are not from normal distributions
Better test: Levene’s Test
Analysis of Variance – ANOVA
Generalization of 2-sample equal variance t-test
Are there differences in means of more than 2 groups?
Variable: Y (response variable)
Measures: Yij = for the jth subject in the ith group
Statistical Model: Yij = i + ij for 1 i k, 1 j ni
i is the mean response in the ith group (unknown)
ij is the random error in the jth subject in the ith group
these are random variables so Yij is also a random variable
Assumptions
ij are iid N(0,2)
3 key assumptions
-independence
-normally distributed
-same variance in each group
=> Yij ~ N(i, 2)
Want to test
H0: 1 = 2 = … = k
Ha: at least one pair of means is not equal (at least one mean is different)
090407 T
(ex: Yij is running time of program on three different machines)
Derivation of test statistic
Decomposition of sum of squares
Yij = Yi- + Yij - Yi-
where Yi- = estimate of i = (average of observations in the ith group)
Subtract Y--
where
So,
= (between groups) – (within groups)
Square both sides and take sums
So then,
total sum of squares = between groups SS + within group SS
SSTotal = SSTr (treatment SS) + SSE (error SS, residual SS)
SSTr = variation in response variable due to differences between groups
SSE = variation in response variable due to variations (errors) within each group
Summarized in ANOVA Table:
Source / SS / df / MS / F / pGroups (treatments) / SSTr / k – 1 / SSTr/(k-1) = MSTr /
Error / SSE / N – k / SSE/(N-k) = MSE
Total / SSTot / N – 1
MS = mean square difference
Want to test:
H0: 1 = … = k
Ha: at least one i is different
Idea of F test:
compare variability between groups to variability within the groups
if similar, then the groups have the same means
MSE = is the pooled sample variance
where si2 is the variance calculated on the observations in the ith group
It can be shown that under H0: 1 = … = k, the test statistic F = MSTr/MSE has an F distribution with (k-1, N-k) degrees of freedom
If H0 is false, F-test statistic will be large because we have more variability between groups than within groups.
So reject H0 if the test statistic is large.
So calculate p-value in right tail of distribution (even though alternative is 2-sided)
ANOVA Assumptions – Check
- Observations are independent
-most important because analyses we have learned don’t work otherwise
-hard to check; must understand how the data was collected
-if the data was collected over time or space, can check for correlation over time or space
- All groups have same variance
-check boxplots of groups (visual)
-rule of thum: calculate variances of observations in each group, if largest variance is less than 4 times the smallest variance then this assumption is valid
-ANOVA is robust against unequal variances if the number of observations in each group is approximately equal
- Errors are Normally distributed
-check the residuals which should be approx N(0,2) distributed
-by CLT, ANOVA is robust against departures from normality except when data are very skewed or there are extreme outliers
When ANOVA F-test is significant (small p-value) examine which means are significantly different.
H0: a = b=> a - b = 0
Ha: ab=> a - b 0
for a, b {1, …, k}, so there are k choose 2 pairs of means to test
Test statistic:
~ tN-k (if a = b)
use df associated with sp (error degrees of freedom)
sp = sqrt(MSE)
But the chance of making at least one Type I error when doing so many tests is high.
Two possible solutions:
- Bonferroni Method
If you want overall Type I error rate , declare pair of means statistically significantly different if p-value of t-test is less than /(k choose 2)
Or construct a 100(1 - /(k choose 2))% CIs for differences in pairs of means and consider which don’t include 0
(this makes sure that significance level is less than , but doesn’t define what it is)
- Tukey’s procedure
-less conservative than Bonferroni Method
-CIs for the differences of all pairs of means such that the confidence level is the coverage probability of all (k choose 2) CIs simultaneously, i.e. the probability that all CIs capture the true difference in their respective pairs of means is 100(1 - )%
-based on “studentized range distribution” based on the difference maximum and minimum means