Chapter 1 Statistical Inference: One and Two Sample Problems

3 topics will be discussed:

l exploratory data analysis (EDA).

l statistical inference.

l nonparametric methods.

1. Exploratory data analysis (EDA):

Motivation: the classical methods of statistical inference depend heavily on the assumption that the data is outlier-free and nearly normal, and that the data is serially uncorrelated. Exploratory data analysis (EDA) uses graphical displays to help you obtain an understanding of whether or not such assumptions hold. The basic graphical exploratory data analysis (EDA) include

l histogram and the boxplot: used to get the shape of the distribution and to detect the presence of the outlier.

l normal qq-plot: used to check if the normal assumption is sensible and also used to detect the outlier.

l time series plot and autocorrelation function (ACF) plot: used to check for the serial correlation.

Now, we introduce how the qq-plot can be obtained via a simple example.

Example:

n=4

x / -1.0 / -0.1 / 0.3 / 1.1
i / 1 / 2 / 3 / 4
Empirical CDF (i/n) / 0.25 / 0.5 / 0.75 / 1

Therefore,

Percent / 25 / 50 / 75 / 100
Percentile of x / -1.0 / -0.1 / 0.3 / 1.1
Percentile of N(0,1) / -0.67 / 0 / 0.67 / 3

Intuitively, if x is nearly standard normal, the percentile of x should be very close to the corresponding percentile of the standard normal random variable. Then, if we plot these percentiles of the standard normal variable in the above table versus the corresponding percentile of x, the points (-0.67, -1), (0, -0.1), (0.67, 0.3), and (3, 1.1) should fall around the line y=x.

In S-plus, the above process has been modified. As n<=10, the empirical CDF formula used in S-plus is . As n> 10, the empirical CDF is

. Therefore, in S-plus,

x / -1.0 / -0.1 / 0.3 / 1.1
i / 1 / 2 / 3 / 4
Empirical CDF / 0.147 / 0.382 / 0.617 / 0.852

Thus,

Percent / 14.7 / 38.2 / 61.7 / 85.2
Percentile of x / -1.0 / -0.1 / 0.3 / 1.1
Percentile of N(0,1) / -1.04 / -0.29 / 0.29 / 1.04

The plot of the percentile of N(0,1) versus the percentile of x is

In S-plus, the qqnorm(x) will generate the same plot as the one of the percentile of N(0,1) versus the percentile of x as the plot. The following S-plus commands can compare the plot generated by qqnorm(x) directly and the plot using the above process.

Example (Splus):

>par(mfrow=c(2,1))

>x<-c(-1.0,-0.1,0.3,1.1)

>qqnorm(x) # the qq-normal plot

>lines(seq(-1.1,1.1,by=0.01),seq(-1.1,1.1,by=0.01)) # the line y=x

>px<-ppoints(x) # px: the empirical CDF for x

qn<-qnorm(px) # the percentile for the standard normal

>plot(qn,x) # distribution. The plot is the same as the one

> lines(seq(-1.1,1.1,by=0.01),seq(-1.1,1.1,by=0.01)) # generated by qqnorm(x)

We now introduce the autocorrelation function (ACF) which is important tool for describing the serial (or temporal) dependence structure. Suppose we have the data x, x,…, x generated by the random variables X, X, …, X. The autocovariance function at lag k is define as r(k)=Cov(X,X). The estimated autocovariance function at lag k is

(k)=, where . Note that is the biased estimate of the population variance r(0). The autocorrelation function at lag k is then defined as . Therefore, a natural estimate of ρ(k) is . Intuitively, if there is no serial correlation, then, except the other estimated autocorrelation functions, should be very close to 0. Thus, the estimated autocorrelation function plot should look like

On the other hand, if some stands out, this might imply there is serial correlation at lag k.

We now use EDA method for the famous experiment data, measurements of the speed of light, conducted by the first American Nobel Prize winner A. A. Michelson.

Example (Splus):

>light<-c(850,740,900,1070,930,850,950,980, # the ture light=light

+980,880,1000,980,930,650,760,810,1000,1000,960,960) # +299000

>par(mfrow=c(2,2))

>plot(light)

>hist(light,nclass=5)

>boxplot(light)

>qqnorm(light)

>qqline(light)

Note: the above plots seem to imply the distribution might not be normal. The distribution is skewed toward the left, but rather normal in the middle region by the histogram. 650 seems to be an outlier according to the box plot. qq-plot also indicates the data might not be normally distributed.

2. Statistical Inference (Parametric Test Under Normal Assumption):

Two problems will be considered:

One sample problem
Two sample problem

(a) One Sample Problem:

X, X, …, X i.i.d. normal random variables with mean and variance . For testing , the statistic, , can be used, where and is the sample variance and n is the sample size. Theoretically, as H is true, t is distributed as a t distribution with degrees of freedom n-1.

Example (Splus):

>t.test(light,mu=990) # test H: vs H:

>t.test(light,conf.level=0.9,mu=990) # 90％ confidence interval can be obtained

>t.test(light,alternative=”greater”,mu=990) # test H: vs H: >990

>help(t.test)

>qt(0.975,19)

(b) Two Sample Problem:

Suppose X, X, …, X are i.i.d. normal random variables with mean and variance and Y, Y,…, Y are i.i.d. random variables with mean and variance . X’s and Y’s are independent. To test H: , we first need to check if H: =.

(i) Variance test:

For testing H: =, the statistic can be used, where and are the estimates of and , respectively, and where and . Intuitively, if H is true, F statistic should be close to 1. On the other hand, if F statistic has much larger value than 1, then this might imply H is not true. Theoretically, as H is true, F is distributed as F distribution with degrees of freedom n-1 and m-1.

Example (Splus):

Protein(H) / 134 / 146 / 104 / 119 / 124 / 161 / 107 / 83 / 113 / 129 / 97 / 123
Protein(L) / 70 / 118 / 101 / 85 / 107 / 132 / 94

19 rats were divided into two groups, one group with 12 rats and the other with 7 rats. The larger group was given high protein food while the smaller group is given low protein food. The data of the weight gains for these rats under the two diets are given in above table. We now demonstrate the variance test using this data in S-Plus.

>rat1<-c(134,146,104,119,124,161,107,83,113,129,97,1123)

>rat2<-c(70,118,101,85,107,132,94)

>var.test(rat1,rat2) # H: =.vs H: ≠

>var.test(rat1,rat2,conf.level=0.9,alt=”g”) # H: =.vs H:

>qf(0.95,11,6) #

(ii) T test:

As the variance test indicates =, then for testing H: , the following statistic can be used , where is the pooled estimate of =.

Example (Splus):

>t.test(rat1,rat2) # H: vs H: ≠

>t.test(rat1,rat2,alt=”g”) # H: vs H: >

As the variance indiates ≠, then for testing H: , the following statistic can be used , where is the estimate of . Theoretically, as H is true, t statistic is approximately t-distribution with (non-integral) degrees of freedom, , where

Example (Splus):

>t.test(rat1,rat2,var.equal=F)

3. Nonparametric Methods:

Both the variance test and t-test used in the previous section depend heavily on the common normal assumption of the observations. In addition, these tests are very sensitive to some unusual observations, such as outliers. However, the conclusions drawn from the nonparametric methods are not much affected if the normal assumption holds or some outliers exist.

(i) One sample :problem:

X, X, …, X are random variables with common distribution (not necessarily normal distribution). Suppose M is the median of the distribution. We want to test if : , where is some number. We now introduce the Wilcoxon sign rank test.

Wilcoxon sign rank test:

Let and is the rank of . Let be the sum of all the ranks corresponding to >0 while is the sum of all the ranks corresponding to <0.

Example:

/ 7.1 / 13.5 / 5.2 / 4.2 / 6.4 / 10.4 / 5.7 / 5.4 / 6.3 / 7.5
/ 0.9 / 5.5 / 2.8 / 3.8 / 1.6 / 2.4 / 2.3 / 2.6 / 1.7 / 0.5
/ 2 / 10 / 8 / 9 / 3 / 6 / 5 / 7 / 4 / 1
Sign of
/ _ / + / _ / _ / _ / + / _ / _ / _ / _

Then, .

The following are the Wilcoxon sign rank tests under three alternatives:

(i) : vs

, where can be found by table.

(ii) : vs

, where can be found by table.

(iii) : vs

, where can be found by table.

Note:

Example (continue):

Test : vs ,

[sol]:

not reject .

Example (Splus):

>s<-c(7.5,13.5,5.2,4.2,6.4,10.4,5.7,5.4,6.3,7.5)

>Wilcox.test(s,mu=8,alt=”less”)

>help(wilcox.test)

As the sample size is large, p-value can be obtained vis a normal approximation. As n

is large, Thus, as : vs and then p-value=. The p-values under the other alternatives can be obtained similarly.

(ii) Two sample problem:

X, X, …, X and Y, Y,…, Y , where and are some distributions. Let and be the medians of and , respectively. We want to test : = (=).

Wilcoxon rank sum test:

I. Combine X, X, …, X, Y, Y,…, Y and find all the ranks of these data.

II. Find the sum of the ranks W corresponding to X, X, …, X.

III. The following are the Wilcoxon rank sum tests under three alternatives:

i. : vs

reject as , where can be found by table.

ii. : vs

reject as , where can be found by table.

iii. : vs

reject as or .

Example:

South / 60 / 100 / 150 / 290
North / 110 / 130 / 140 / 170 / 200 / 310

The above data are the incomes in both North and South areas. Test : vs with

[sol]:

Income / 60 / 100 / 150 / 290 / 110 / 130 / 140 / 170 / 200 / 310
Rank / 1 / 2 / 6 / 9 / 3 / 4 / 5 / 7 / 8 / 10

II.

III. Since not reject .

Conclusion: there is no difference in income between the two areas.

Example (Splus):

>in1<-c(60,100,150,290)

>in2<-c(110,130,140,170,200,310)

>Wilcox.test(in1,in2)