Data Analysis and Modeling using R

1)  Read the data set

We can read the data using excel. Html text file, web url , data base file, xml etc

2)  Load the data set

load("Full path to rda data set")

3)  Summarize the data

summary(Data set name)

4)  Count the missing values

sapply(data set name, function(x)(sum(is.na(x)))) # NA counts

5)  Computing new variable:

Data Set name $new variable name <- with(data set name, calculation/logic(using if, while and for loop etc)

6)  Adding Observation Number:

Data set Name $ObsNumber <- 1:100

7)  Standardize variable:

Data Set name $Z.variable name <- .Z[,1] to standardize more than one variable we can use same command but we have to chane 1 to 2, 3 etc

8)  Converting continuous variable to bin

Data Set name $derived bin variable <- bin.var(Data Set name $continuous variable name , bins=# of bin method='intervals', labels=FALSE)

9)  Dropping variable:

Data Set Name$variable name <- NULL

10) Renaming variable

names(data set name)[c(Col#)] <- c("New name")

11) Decile analysis

numSummary(Data Set name[,"variable"], statistics=c("mean", "sd", "quantiles"),

quantiles=c(0,.25,.5,.75,1))

This can be changed according to need of business and analysis purpose

12) Frequency Analysis for categorical variables

.Table <- table(Data Set Name$variable Name)

.Table # counts for variable Name

100*.Table/sum(.Table) # percentages for variable name

remove(.Table)

13) Correlation Matrix

cor(Data Set Name[,c("var_1","var_2","var_3"…,”var_n”)], use="complete.obs")

14) Principal Component Analysis using Scree plot

.PC <- princomp(~Var_1+Var_2+…,Var_n, cor=TRUE, data=data set name)

unclass(loadings(.PC)) # component loadings

.PC$sd^2 # component variances

screeplot(.PC)

data set name$PC1 <- .PC$scores[,1]

Data Set Name$PC2 <- .PC$scores[,2]

Data Set name$PC3 <- .PC$scores[,3]

remove(.PC)

We are retaining only three principal Components. No of component to be retained depends on data size and analysis requirement.

15) Factor Analysis

FA <- factanal(~Var_1+Var_2+…+Var_n, factors=# of factor to be retain, rotation="varimax", scores="none", data=Data Set Name)

.FA

remove(.FA)

16) Cluster Analysis

A) K Means Cluster

.cluster <- KMeans(model.matrix(~-1 + Var_1 + Var_2+…+ Var_3, Data Set Name), centers = # of Cluster, iter.max = 10, num.seeds = 10)

.cluster$size # Cluster Sizes

.cluster$centers # Cluster Centroids

.cluster$withinss # Within Cluster Sum of Squares

.cluster$tot.withinss # Total Within Sum of Squares

.cluster$betweenss # Between Cluster Sum of Squares

biplot(princomp(model.matrix(~-1 + Var_1 + Var_2+…+ Var_3, Data Set Name)), xlabs =

as.character(.cluster$cluster))

Hatco$KMeans <- assignCluster(model.matrix(~-1 +Var_1 + Var_2+…+ Var_3, Data Set Name),

Hatco, .cluster$cluster)

remove(.cluster)

B) Hierarchical Cluster:

Cluster Solution Name <- hclust(dist(model.matrix(~-1 + Var_1 + Var_2+…+ Var_3 , Data Set Name)) , method= "ward")

plot(Cluster Solution Name, main= "Cluster Dendrogram for Solution Cluster Solution Name", xlab= "Observation Number in Data Set Hatco",

sub="Method=ward; Distance=euclidian")

Note: Lot of experiment need to be done on selecting optimal number of cluster and method etc to address actual need of business and to justify clustering.

Logistics Regression Model

1)Model Name <- glm(Dependent Varaible ~ Var_1+Var_2+…+Var_n, family=binomial(logit), data=Data Set Name)

summary(Model Name)

2)Adjusting Confidence Interval for Model

Confint(Model Name, level=.95, type="LR")

Level can be adjusted based on sample size and model requirement

3)Model Accuracy Test:

a)  AIC Value:

AIC(Model Name) : Small value Indicate good fit

b)  BIC Value:

BIC(Model Name) : Small Value Indicate good fit

4)Visualization of Model:

1)Basic Diagnostics plot:

oldpar <- par(oma=c(0,0,3,0), mfrow=c(2,2))

plot(Model Name)

par(oldpar)

2)Component+ Residual Plot

cr.plots(Model Name, ask=FALSE)

3)Influence Plot:

influencePlot(Model Name)

4) Effect Plot:

trellis.device(theme="col.whitebg")

plot(allEffects(Model Name), ask=FALSE)