1
8.ONE WAY ANALYSIS OF VARIANCE
Example – 1: Designing a new package for a certain product (from 7.1 in Ilk (2011)) - Marketing
- A food company. They want to have a new package designed for a certain product they produce (say a chocolate bar) to help increase the sale.
- The packaging company designs 4 different types of package for the chocolate bar and they are all sold in the stores.
- Stores: 5 different stores, in the same area and in the same size.
- Marketing question of interest:Does the package design play a role on the sales?
- Stating the marketing question in statistical terms: μ1, μ2, μ3, μ4 denote the average sale of chocolate bar in each design. Then hypothesis of interest: H0: μ1= μ2= μ3= μ4, H1: at least one of them is different than the rest
- Data:
design / store / sales
1 / 1 / 11
1 / 2 / 17
1 / 3 / 16
1 / 4 / 14
1 / 5 / 15
2 / 1 / 12
2 / 2 / 10
2 / 3 / 15
2 / 4 / 19
2 / 5 / 11
3 / 1 / 23
3 / 2 / 20
3 / 3 / 18
3 / 4 / 17
4 / 1 / 27
4 / 2 / 33
4 / 3 / 22
4 / 4 / 26
4 / 5 / 28
- R codes:
> sales = read.csv("kenton.csv",header=T)
> aov1 = aov(sales~design,data=sales) #or aov1 = aov(sales$sales~ sales$design)
> summary(aov1)
Df Sum Sq Mean Sq F value Pr(>F)
design 1 483.08 483.08 31.186 3.289e-05 ***
Residuals 17 263.34 15.49
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
There are 4 designs and yet the df for design is 1! WHY?
- Corrected R codes:
> class(sales$design)
[1] "integer"
> sales$design=as.factor(sales$design)
> class(sales$design)
[1] "factor"
> aov1 = aov(sales~design,data=sales)
> summary(aov1)
Df Sum Sq Mean Sq F value Pr(>F)
design 3 588.22 196.07 18.591 2.585e-05 ***
Residuals 15 158.20 10.55
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
There are 4 designs and now the df is 3.
- Alternative coding:
> aov1 = aov(sales~factor(design),data=sales)
> summary(aov1)
Df Sum Sq Mean Sq F value Pr(>F)
factor(design) 3 588.22 196.07 18.591 2.585e-05 ***
Residuals 15 158.20 10.55
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Interpretation of the results:
P value being 2.585e-05: there is a difference between different package designs on the sale.
But some may be different; some may be the same among the four designs. Which are the same, which are different? PAIRWISE COMPARISONS.
- Pairwise comparisons: e.g.
H0,1: μ1= μ2
H0,2: μ1= μ3
H0,3: μ1= μ4
Resulting in testing these hypotheses simultaneously.
How to do pairwise testing in R? e.g.
- The R function TukeyHSD
- Pairwise comparison in R by TukeyHSD
> TukeyHSD(aov1,"design",conf.level=0.9)
Tukey multiple comparisons of means
90% family-wise confidence level
Fit: aov(formula = sales ~ design, data = sales)
$design
diff lwr upr p adj
2-1 -1.2 -6.3411769 3.941177 0.9352978
3-1 4.9 -0.5530416 10.353042 0.1548895
4-1 12.6 7.4588231 17.741177 0.0001013
3-2 6.1 0.6469584 11.553042 0.0582866
4-2 13.8 8.6588231 18.941177 0.0000368
4-3 7.7 2.2469584 13.153042 0.0142180
9. WRITING YOUR OWN FUNCTION
- Example – 1: Write a function that calculates the square of any given number
> sqr = function(x) { # Input: x
a = x*x
print(a) # Output: a
}
> sqr(2) # User defined x=2
[1] 4
- Example – 2: Finding the larger number among two numbers
> larger=function(x,y){ # Input: x and y
ifelse(y>x,y,x) # Output: If the condition is right, output is y; o.w. x.
}
> larger(1,8)# User defined x=1 and y=8.
[1] 8
- Example – 3: Calculating t statistic in atwo sample t-test (from An Introduction to R Manual)
> twosam <- function(y1, y2) {
n1 <- length(y1); n2 <- length(y2)
yb1 <- mean(y1); yb2 <- mean(y2)
s1 <- var(y1); s2 <- var(y2)
s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2)
tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2))
tst
}
Consider the weights of two groups:
> data1=c(54.3,54.7,54.9,54.9,53.5,54.9,54.8,53.3,55.0,54.5,54.4,55.0,55.8,54.9,54.7,55.2,55.3,55.0,55.1,55.4)
> data2=c(66.2,65.7,65.9,62.8,62.6,65.7,65.3,65.5,64.6,66.9,67.1,64.2,64.9,65.3,66.2,66.0,66.2,64.0,65.5,65.6)
> twosam(data1,data2)
- Example – 4: use of if and else
- In a Notepad, write the code below:
fact = function(n)
{
if(n<=1) 1
else n*fact(n-1)
}
- Compiling:
> source('fact.R')
> fact(4)
[1] 24
Note that there is a built-in function in R called factorial, which does the same job.
- Example – 5: use of if and else if (based on Ilk 2011)
- Converting the numerical grades to letter grades
- In a Notepad, write the code below
# input arguments:
#g: numerical grades
# output arguments:
#l: letter grades
letters = function(g)
{
if(g<50) l="FF"
else if(g<60) l="FD"
else if(g<64) l="DD"
else if(g<69) l="DC"
else if(g<74) l="CC"
else if(g<79) l="CB"
else if(g<84) l="BB"
else if(g<89) l="BA"
else l="AA"
list(grades=g,letters=l)
}
- Compiling:
> source('fromgrade_toletter.R')
- Running:
> letters(35)
$grades
[1] 35
$letters
[1] "FF"
- Example – 6: Just another example for returning multiple objects
powers = function(x) {
z = list(x2=x*x, x3=x*x*x, x4=x*x*x*x);
return(z) }
- Example – 7:
f.good <- function(x, y) {
z1 <- 2*x + y
z2 <- x + 2*y
z3 <- 2*x + 2*y
z4 <- x/y
return( z1, z2, z3, z4)
}
> f.good(1,2)
Error in return(z1, z2, z3, z4) :
multi-argument returns are not permitted
- Example – 8:
> f2 <- function(x, y) {
z1 <- x + y
z2 <- x + 2*y
list(z1, z2)
}
> f2(2, 5)
[[1]]:
[1] 7
[[2]]:
[1] 12
> f2(2, 5)[[1]]
[1] 7
> f2(2, 5)$z1
NULL
- Example – 9:
> f3 <- function(x, y) {
z1 <- x + y
z2 <- x + 2*y
list(result1=z1, result2=z2)
}
> f3(2, 5)
$result1:
[1] 7
$result2:
[1] 12
> f3(2, 5)$result1
[1] 7
> f3(2, 5)$result2
[1] 12
> d = f3(2,2)
> d$z1
NULL
> d$result1
[1] 4
> d
$result1
[1] 4
$result2
[1] 6
> y <- f3(1, 4)
> names(y)
[1] "result1" "result2"
> y$result2
[1] 9
- Example – 10:
#Using vectors
> v1 <- 1:6
> v1
[1] 1 2 3 4 5 6
> v2 <- seq(2, 12, 2)
> v2
[1] 2 4 6 8 10 12
> f3(v1, v2)
$result1:
[1] 3 6 9 12 15 18
$result2:
[1] 5 10 15 20 25 30
- Example – 11:
#Using matrices
> mat1 <- matrix( c(1, 2, 3, 4, 5, 6), ncol=2)
> mat1
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> mat2 <- matrix(c(2, 4, 6, 8, 10, 12), ncol = 2)
> mat2
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
> f3(mat1, mat2)
$result1:
[,1] [,2]
[1,] 3 12
[2,] 6 15
[3,] 9 18
$result2:
[,1] [,2]
[1,] 5 20
[2,] 10 25
[3,] 15 30
- Example – 12:
> f4 <- function(x=3, y=2) {
z1 <- x + y
z2 <- x + 2*y
list(result1=z1, result2=z2)
}
#using the defaults values for the x and y arguments
> f4()
$result1:
[1] 5
$result2:
[1] 7
#using the default value for the y argument
> f4(1, )
> f4(x=1)
#using the default value for the x argument
> f4(, 1)
> f4(y=1)
#switching the order of the arguments
> f4(y = 1, x = 2)
- Example – 13: Calling a function inside another function
> mysum=function(y){
sum(1:y)
}
mysum(3)
mysumsqr=function(x){
m=mysum(x)
a=m*m
print(a)
}
mysumsqr(4)
- Example-14:Finding the number of nonmissing and missing observations in a given vector
samp.size = function(x)
{
n = length(x) - sum(is.na(x))
nas = sum(is.na(x))
out = c(n, nas)
names(out) = c("", "NAs")
out #it is same as return(out)
}
nums=c(1:24,NA)
> samp.size(nums)
NAs
24 1
> sqrt(var(nums,na.rm=T)/samp.size(nums)[1])
Notes:
In, e.g., C++ the return from the function is done via the arguments of the function.This is not the case in R. It is the information coded on the final line of the function that is returned. E.g. see Example 5 above.
You can use listto pass multiple arguments out of the function (see Example 5 above).
You can also use return to pass multiple arguments out of the function. E.g. replace the last line of Example 5 by
return(c(g,l))
When you call this function by, e.g. a = letters(35)on R console you get,
> a #a is a 1 by 2 vector
[1] "FF" "35"
If the function has just one line you do not have to use the curly brackets.
10. SOME FINALNOTES ON R
You have to get used to write your programs in an external editor such as Notepad. This is supported by the experts of R as well. For instance:
From “Using R for Introductory Statistics”: If you plan on doing any significant programming in R, then it is worthwhile to familiarize yourself with using external files to store your work (rather than the default workspace) and using a different editor to facilitate interaction between the external files and an R process.
From “A Beginner’s Guide to R”: Typing the code into a special text editor for copying and pasting into R is strongly recommended. This allows the user to easily save code, document it, and rerun it at a later stage.
WORKSPACE: Instead of saving your work in the workspace, I recommend that you save your code in a text editor.
In your data file, if you have column names such as “cell type” change it to e.g. “cell.type” or “cell_type”.
THE ATTACH FUNCTION: e.g. you have a data frame called trial1. To refer to the data contained in one of the compartments of trial1, say y, we have to do trial1$y. Other way: >attach(trial1). Then we can refer to y not with trial1$y but just with y. e.g. boxplot(y), mean(y) instead of boxplot(trial1$y) and mean(trial1$y). You can take this back by >detach(trial1).
O.Ilk Dag, STAT 291 lecture notes, 2013, 12-13 November