Relationship between confidence interval and two sided test

Say (L,U) is a (1-)100% CI for p and we want to test H0: p = p0 vs Ha: p  p0.

=> If p0 (L,U) then we are confident that H0 is true, don’t reject H0.

If p0 is outside the interval then we are confident that we can reject H0

090317 T

Suppose we have a sample of 25 observations from a normal distribution and we obtained x = 100, s = 20. We wish to test H0:  = 95 vs Ha:  95 with  = 0.05

Test statistic:

Reject H0 if

What would confidence interval be?

What’s the relationship? 0 (91.7,108.3) = 95% CI, so we are confident that 0 = 95 is a likely value for .

The Confidence Interval is precisely the non-rejection region of the two-sided test.

Note p-value >  since we can’t reject H0

Bootstrap Testing

Suppose we want to test H0:  = 0 vs Ha: 0

As a test statistic we will use x (estimate), X (estimator)

Reject H0 in favour of Ha if X is large.

Data: x1, …, xn from some population with mean 

p-value (recall = probability that test statistic will be as or more extreme than what we observed) will be P(X x | H0 is true)

Want bootstrap estimate of p-value.

Need to generate bootstrap samples with H0 being true (since p-value is based on this assumption)

Instead of re-sampling from original data [non-parametric] (since we don’t know if this is from a population described by H0, that’s what we’re testing) we re-sample from yi = xi – x + 0, 1  i  n

E[yi] = E[xi – x + 0] = E[xi] – E[x] + E[0] = x – x + 0 = 0

Draw B bootstrap samples from y1, …, yn and for each bootstrap sample we calculate

yj*, j = 1, …, B

Note:

-for testing, B is typically 3 000

-sampling is with replacement for non-parametric bootstrap

The bootstrap estimate of the p-value

p* = (number of yj* x) / B

If Ha: 0

Case 1: x > 0

Case 2: x < 0

Test for Population Variance

Suppose x1, …, xn is a sample from a N(,2) distribution where both  and 2 are unknown

We are interested in testing H0: 2 = 02 vs Ha: 202

A good point estimator would be s2. The test statistic we use should be a function using the point estimator.

So the test statistic is , which under H0 has a distribution of

p-value = (only take upper tail, chi-squared is skewed)

What if sample is not normal, i.e. not coming from a normal population? Use bootstrap.

Want bootstrap samples under H0.

So we draw bootstrap samples from,

yi = 0xi/sx, 1  i  n

since: suppose var(xi) = x2

var(0xi/x) = (0x/x)2 = 02

Comparing Two Samples

Suppose X1, …, Xn1 underlying iid random variables with mean X and variance X2, observations x1, …, xn1

And Y1, …, Yn2 underlying iid random variables with mean Y and variance Y2, observations y1, …, yn2

Further, Xi and Yj are independent (not identically distributed)

We are interested in testing whether distributions of X and Y have the same mean

i.e. H0: X = Y => X - Y = 0

Consider an estimator for the quantity of interest (X - Y)

Estimator: X – Y

Estimate: x – y

Need the sampling distribution of X - Y.

E(X - Y) = E(X) – E(Y) = X - Y(unbiased)

Var(X - Y) = Var(X) + Var(-Y) = Var(X) + (-1)2Var(Y)(since X, Y are independent)

= Var(X) + Var(Y)

=

Estimate of the standard error of x - y:

If

X1, …, Xn1 are normally distributed N(X,X2)

and Y1, …, Yn2 are normally distributed N(Y,Y2)

everything independent

Then the distribution of X - Y ~ N(X - Y, )

Hypotheses being tested here are,

H0:X = Y => X - Y = 0

Ha:XY => X - Y > 0

XY => X - Y < 0

XY => X - Y 0

What are the test statistics?

Consider a few cases:

1) If X2 and Y2 are known

use standardized version of point estimator

~N(0,1)

2) If X2 and Y2 are unknown

use t distribution, replacing them with sample standard deviations

which has an approximate t distribution with degrees of freedom approximated by

(Satterthwaites approximation)

*a conservative approach if using tables, use min(n1 – 1, n2 – 1) for df

generally gives larger p-values and wider CI

3) If X2 and Y2 are unknown, but X2 = Y2 = 2 (common assumption)

Both groups have distribution with same variance.

Why make this assumption?

-resulting distribution of test statistic under H0 is exactly a t distribution

-relationship with analysis of variance (coming soon…)

If X2 = Y2 = 2 then

To estimate 2, pool the samples and use

So the test statistic will be

with exact t distribution with df = n1 + n2 – 2

In practice, when can we assume this?

Rule of thumb: when the ratio of the sample variances is <4, can assume approximated with this assumption

so when sX2/sY2 and sY2/sX2 < 4, then can assume X2 = Y2 = 2

(largest sample variance / smallest sample variance)

Can also perform a hypothesis test to check whether the variances are the same.

090319 R

Bootstrap testing in R

have data with n = 19,  = 100

H0:  = 90 vs. Ha 90

#generate data

data=c(119.7, 100.0, 104.1, 114.2, 92.8, 150.3, 85.4, 102.3, 108.6, 105.8, 93.4, 107.5, 67.1, 0.9, 88.4, 101.0, 97.2, 95.4, 77.2)

#shift data so that H0 is true

y=data - mean(data) + 90

#generate 3 000 bootstrap samples from y

B=3000

bootsamples = matrix(sample(y,B*length(y),replace=T),nrow=B)

#calculate the means

bootmeans = apply(bootsamples,1,mean)

#hist of boot means

hist(bootmeans)

#mean of data

m=mean(data)

#bootstrap estimate of the p-value

sum(bootmeans >= m)/B + sum(bootmeans <= 2*90 - m)/B

#conclusion:

# data are consistent with a mean of 90 (cannot reject H0)

#what if we do t-test

t.test(data)

Comparing Two Means

Example

Data: length of humerus in adult male sparrows who survived and sparrows who perished after a storm.

Question: are the mean lengths of humerus different between these two groups?

090324 T

since the sample variances have ratio less than 4, can use pooled t-test

if rule of thumb applies, use pooled test since it provides more accurate results

to perform these tests, underlying assumption about the data: it comes from normal

Example

Are there physiological indicators of schizophrenia?

15 pairs of identical twins, one schizophrenic the other healthy

measured brain volume (cm^3), data: left hippocampus

want to know if means of left hippocampus volume is the same for schiz and non-schiz

note these are not independent; each pair is independent, but in each pair the two data are not independent (“paired data” or “matched pairs”)

Answer: take differences and use 1-sample t-test to test

H0: diff = 0

Ha: diff 0

di = xi – yi, for i = 1, …, n

Test statistic

Two-sample Bootstrap Test

Test for equality of location

Two samples X1, …, Xn1 of size n1, and Y1, …, Yn2 of size n2

Suppose we want to test H0: X = Y vs Ha: XY

Test statistic: V = X - Y, observed: v = x - y

p-value = P(V x - y | H0 true)

want a bootstrap estimate of p-value

must generate bootstrap samples under assumption that H0 is true

one way to do this: assuming X, Y have same distribution

-combine two samples into one sample of size n1 + n2

-resample with replacement from this

-each resampling will be split into two groups:

1 of size n1, x1*, …, xn1*

1 of size n2, y1*, …, yn2*

-for each bootstrap sample, calculate bootstrap estimate of test statistic vj* = xj* - yj*, j=1…B

-then bootstrap estimate of p-value is (# of vj* v)/B

Data Collection

Three methods:

  1. Observational studies
  2. Sample surveys (also observational but random selection of participants)
  3. Experiments

Level of strength of conclusions is from lowest to highest (experiments offer best conclusion because more variables are under control of experimenter)

  1. Observational studies

Data collected without intervention

ex: observe admissions to graduate school, gender, age, GRE, etc…

090326 R

Problems:

*Confounding – can’t separate effect of one variable from another

eg breastfeeding and babies intelligence

may be effect of some other variable (for example the mothers who are breastfeeding have higher intelligence to begin with => babies smarter from genetics)

eg smoking and coffee drinking and heart disease

*can’t be generalized

  1. Sample Surveys

Still observational

Data collected on a random sample from a population

Confounding is still a problem

results can be generalized to the population

To allow generalization and to avoid bias, samples must be chosen randomly

Simplest method: simple random sample

-each member of the population is equally likely to be chosen

  1. Experiments

researchers randomly assign a treatment to the subjects or “experimental units”

can have cause-effect conditions (don’t have to worry about confounding variables)

because we are imposing the treatments, we can control for those confounding variables

treatments in experiment are sometimes called “factors” and sometimes “predictor variables”

predictor variables since trying to see if they predict a certain outcome in another variable

the values of a factor are its “levels”

a design (of experiment) is “balanced” if each treatment has same number of experimental units

key step: randomization

each subject is randomly assigned a treatment

-so no bias in treatment assignment

-eliminating the effects of confounding variables; differences among the treatment groups are random

Principles of Design of Experiments

-Control – a group for comparison

-randomization – randomly assign treatments to experimental units

-replication – need multiple observations per treatment

– allows measurement of variability

Problem: can’t always carry out an experiment

In experiments, it’s nice to have if possible (especially for experiments on humans),

  1. Placebo – “fake” treatment; sometimes people do better just by being treated, they know someone is watching them, giving them attention, etc…

“Placebo Effect” – psychological

  1. Double-blind – both the experimental units nor the researchers know which treatment is being received/administered

Comment:

all statistical techniques we have learned assume observations are independent and if they aren’t but are treated as if they are then we get narrower CI and more power than we should.

---

Some Normal based distribution theory

Suppose X1, …, Xn are iid N(,2)

any linear combination of the Xi’s is normally distributed

in particular we know X ~ N(,2/n)

Let Zi = (Xi - )/, Zi ~ N(0,1)

And Zi2 ~ 2(1)

(Z12 + Z22 + … + Zk2) ~ 2(k) will be less skewed than 2(1)

If sample variance S2, then

(n-1)S2/2 ~ 2(n-1)

If Z ~ N(0,1) independent of Y ~ 2(m)

then Z/sqrt(Y/m) ~ t(m)

090331 T

Professor Moshonov SICK

090402 R

Normal based distribution theory continued

If Z~N(0,1) independent of Y ~ 2mthen

In particular, X-bar, s2 are independent,

If Y1 ~ 2m and Y2 ~ 2n, Y1 and Y2 independent,

F-distribution with m and n degrees of freedom

Also right-skewed distribution

note: The square of a random variables which has a tm distribution has an F1,m distribution

recall that Z2 ~ 21

The F distribution is useful when testing the equality of variance from two independent distributions.

F-test for equality of variance

X1, …, Xn are iid N(1,12) independent of Y1, …, Yn are iid N(2,22)

We calculate s12 and s22

Suppose we want to test:

H0: 12 = 22<=>12/22 = 1

Ha: 1222

Test statistic is…

independent from

Use , which under H0 is

Good Statistic:

-distribution under H0 is known

-not a function of any unknown parameter

Problems:

-very sensitive to departure from normality

*the assumption that both Xi and Yi are from normal distributions is very strong (there exist specific tests for normality of samples)

-a small p-value could mean that 1222or that the data are not from normal distributions

Better test: Levene’s Test

Analysis of Variance – ANOVA

Generalization of 2-sample equal variance t-test

Are there differences in means of more than 2 groups?

Variable: Y (response variable)

Measures: Yij = for the jth subject in the ith group

Statistical Model: Yij = i + ij for 1 i k, 1 j ni

i is the mean response in the ith group (unknown)

ij is the random error in the jth subject in the ith group

these are random variables so Yij is also a random variable

Assumptions

ij are iid N(0,2)

3 key assumptions

-independence

-normally distributed

-same variance in each group

=> Yij ~ N(i, 2)

Want to test

H0: 1 = 2 = … = k

Ha: at least one pair of means is not equal (at least one mean is different)

090407 T

(ex: Yij is running time of program on three different machines)

Derivation of test statistic

Decomposition of sum of squares

Yij = Yi- + Yij - Yi-

where Yi- = estimate of i = (average of observations in the ith group)

Subtract Y--

where

So,

= (between groups) – (within groups)

Square both sides and take sums

So then,

total sum of squares = between groups SS + within group SS

SSTotal = SSTr (treatment SS) + SSE (error SS, residual SS)

SSTr = variation in response variable due to differences between groups

SSE = variation in response variable due to variations (errors) within each group

Summarized in ANOVA Table:

Source / SS / df / MS / F / p
Groups (treatments) / SSTr / k – 1 / SSTr/(k-1) = MSTr /
Error / SSE / N – k / SSE/(N-k) = MSE
Total / SSTot / N – 1

MS = mean square difference

Want to test:

H0: 1 = … = k

Ha: at least one i is different

Idea of F test:

compare variability between groups to variability within the groups

if similar, then the groups have the same means

MSE = is the pooled sample variance

where si2 is the variance calculated on the observations in the ith group

It can be shown that under H0: 1 = … = k, the test statistic F = MSTr/MSE has an F distribution with (k-1, N-k) degrees of freedom

If H0 is false, F-test statistic will be large because we have more variability between groups than within groups.

So reject H0 if the test statistic is large.

So calculate p-value in right tail of distribution (even though alternative is 2-sided)

ANOVA Assumptions – Check

  1. Observations are independent

-most important because analyses we have learned don’t work otherwise

-hard to check; must understand how the data was collected

-if the data was collected over time or space, can check for correlation over time or space

  1. All groups have same variance

-check boxplots of groups (visual)

-rule of thum: calculate variances of observations in each group, if largest variance is less than 4 times the smallest variance then this assumption is valid

-ANOVA is robust against unequal variances if the number of observations in each group is approximately equal

  1. Errors are Normally distributed

-check the residuals which should be approx N(0,2) distributed

-by CLT, ANOVA is robust against departures from normality except when data are very skewed or there are extreme outliers

When ANOVA F-test is significant (small p-value) examine which means are significantly different.

H0: a = b=> a - b = 0

Ha: ab=> a - b 0

for a, b  {1, …, k}, so there are k choose 2 pairs of means to test

Test statistic:

~ tN-k (if a = b)

use df associated with sp (error degrees of freedom)

sp = sqrt(MSE)

But the chance of making at least one Type I error when doing so many tests is high.

Two possible solutions:

  1. Bonferroni Method

If you want overall Type I error rate , declare pair of means statistically significantly different if p-value of t-test is less than /(k choose 2)

Or construct a 100(1 - /(k choose 2))% CIs for differences in pairs of means and consider which don’t include 0

(this makes sure that significance level is less than , but doesn’t define what it is)

  1. Tukey’s procedure

-less conservative than Bonferroni Method

-CIs for the differences of all pairs of means such that the confidence level is the coverage probability of all (k choose 2) CIs simultaneously, i.e. the probability that all CIs capture the true difference in their respective pairs of means is 100(1 - )%

-based on “studentized range distribution” based on the difference maximum and minimum means