Training and Testing Sets, Crossvalidation

Training and testing sets, Crossvalidation

Training and testing sets:

·  Which percentage of the data should go to the training and testing sets?

·  Should I reverse the sets and repeat the analysis?

·  Partition the data into many parts? This sounds more and more like cross validation.

Crossvalidation

Consider a Class of models M1,…, Mc . The objective could be:

·  Choose the best model for our data

·  Calculate residuals that are not biased

·  Calculate a measure of performance.

The algorithm:

1.  Divide the data into k parts (say k=10) = {P1,…, Pk } and choose one of the c models.

2.  For each part predict the model without using the part, for example, predict P1 using P2,…, Pk .

3.  Calculate a measure of performance for the model.

4.  Repeat this for each of the c models.

5.  Choose the one that gives best performance.

This can easily be implemented in R as follows:

f.cvANN <- function(X,Y, ng=10) {

# Make sure that Y is a factor

if (!is.factor(Y)) Y= factor(Y)

# Initialization

n <- nrow(X); z1 <- NULL ; z2 <- array(NA,c(n,nlevels(Y)) )

X$Y= Y

# Make a variable with the groups and

# make sure that it is random

j = sample((1:n)%/%(n/(ng-0.01))+1)

# Loop through the groups

for(i in 1:ng) {

# Estimate the model

z <- nnet(factor(Y)~.,data=X[j!=i,],maxit=300,skip=T,size=10)

# Prediction

z1[j==i] <- predict(z,X[j==i,,drop=F],type="class")

z2[j==i,] <- predict(z,X[j==i,,drop=F])

}

# Here you can continue with a different model

# if needed and so on

# Here we output the prediction

data.frame(class=z1, p=round(z2,4))

# Or you may output some performance measure

}

## FUNCTION TO PREDICT THE CLASS FROM NUMERIC ANSWERS

## LIKE WHAT YOU GET FROM MARS OR ANN

classpred = function(x) {

p = ncol(x); y=NULL; j=1:p

for(i in 1:nrow(x)) y[i]= j[ x[i,]==max(x[i,])]

y

}

> table(gr,classpred( f.cvANN(x[,1:4],gr,50)[,-1]))

1 2 3

1 11 0 0

2 0 11 0

3 1 0 18

# DO A SMALL SIMULATION AND GET RESULTS.

> j =sum(gr!=classpred( f.cvANN(x[,1:4],gr,50)[,-1]))

> for( i in 2:100) j[i] =table(gr,classpred( f.cvANN(x[,1:4],gr,50)[,-1]))

MARS example

x.mars = mars(x[,1:4],x$gg)

round(predict(x.mars,type=”class”))

YOU CAN DO THE SAME HERE AS YOU DID EARLIER WITH NEURAL NETS