Training and testing sets, Crossvalidation
Training and testing sets:
· Which percentage of the data should go to the training and testing sets?
· Should I reverse the sets and repeat the analysis?
· Partition the data into many parts? This sounds more and more like cross validation.
Crossvalidation
Consider a Class of models M1,…, Mc . The objective could be:
· Choose the best model for our data
· Calculate residuals that are not biased
· Calculate a measure of performance.
The algorithm:
1. Divide the data into k parts (say k=10) = {P1,…, Pk } and choose one of the c models.
2. For each part predict the model without using the part, for example, predict P1 using P2,…, Pk .
3. Calculate a measure of performance for the model.
4. Repeat this for each of the c models.
5. Choose the one that gives best performance.
This can easily be implemented in R as follows:
f.cvANN <- function(X,Y, ng=10) {
# Make sure that Y is a factor
if (!is.factor(Y)) Y= factor(Y)
# Initialization
n <- nrow(X); z1 <- NULL ; z2 <- array(NA,c(n,nlevels(Y)) )
X$Y= Y
# Make a variable with the groups and
# make sure that it is random
j = sample((1:n)%/%(n/(ng-0.01))+1)
# Loop through the groups
for(i in 1:ng) {
# Estimate the model
z <- nnet(factor(Y)~.,data=X[j!=i,],maxit=300,skip=T,size=10)
# Prediction
z1[j==i] <- predict(z,X[j==i,,drop=F],type="class")
z2[j==i,] <- predict(z,X[j==i,,drop=F])
}
# Here you can continue with a different model
# if needed and so on
# Here we output the prediction
data.frame(class=z1, p=round(z2,4))
# Or you may output some performance measure
}
## FUNCTION TO PREDICT THE CLASS FROM NUMERIC ANSWERS
## LIKE WHAT YOU GET FROM MARS OR ANN
classpred = function(x) {
p = ncol(x); y=NULL; j=1:p
for(i in 1:nrow(x)) y[i]= j[ x[i,]==max(x[i,])]
y
}
> table(gr,classpred( f.cvANN(x[,1:4],gr,50)[,-1]))
1 2 3
1 11 0 0
2 0 11 0
3 1 0 18
# DO A SMALL SIMULATION AND GET RESULTS.
> j =sum(gr!=classpred( f.cvANN(x[,1:4],gr,50)[,-1]))
> for( i in 2:100) j[i] =table(gr,classpred( f.cvANN(x[,1:4],gr,50)[,-1]))
MARS example
x.mars = mars(x[,1:4],x$gg)
round(predict(x.mars,type=”class”))
YOU CAN DO THE SAME HERE AS YOU DID EARLIER WITH NEURAL NETS