Introduction to R
Data Set: Galapagos Islands
Variables:
Species: the number of species of tortoise found on the island
Endemics: the number of endemic species
Elevation: the highest elevation of the island (m)
Nearest: The distance from the nearest island (km)
Scruz: the distance from santa Cruz (km)
Adjacent: area of the adjacent island (km2)
Reading the data in
The first step is to read the data in. You'll need to get the data and save it.
> gala <- read.table("gala.data")
> gala
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Caldwell 3 3 0.21 114 2.8 58.7 0.78
Champion 25 9 0.10 46 1.9 47.4 0.18
Coamano 2 1 0.05 77 1.9 1.9 903.82
Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
Daphne.Minor 24 0 0.08 93 6.0 12.0 0.34
Darwin 10 7 2.33 168 34.1 290.2 2.85
Eden 8 4 0.03 71 0.4 0.4 17.95
Enderby 2 2 0.18 112 2.6 50.2 0.10
Espanola 97 26 58.27 198 1.1 88.3 0.57
Fernandina 93 35 634.49 1494 4.3 95.3 4669.32
Gardner1 58 17 0.57 49 1.1 93.1 58.27
Gardner2 5 4 0.78 227 4.6 62.2 0.21
Genovesa 40 19 17.35 76 47.4 92.2 129.49
Isabela 347 89 4669.32 1707 0.7 28.1 634.49
Marchena 51 23 129.49 343 29.1 85.9 59.56
Onslow 2 2 0.01 25 3.3 45.9 0.10
Pinta 104 37 59.56 777 29.1 119.6 129.49
Pinzon 108 33 17.95 458 10.7 10.7 0.03
Las.Plazas 12 9 0.23 94 0.5 0.6 25.09
Rabida 70 30 4.89 367 4.4 24.4 572.33
SanCristobal 280 65 551.62 716 45.2 66.6 0.57
SanSalvador 237 81 572.33 906 0.2 19.8 4.89
SantaCruz 444 95 903.82 864 0.6 0.0 0.52
SantaFe 62 28 24.08 259 16.5 16.5 0.52
SantaMaria 285 73 170.92 640 2.6 49.2 0.10
Seymour 44 16 1.84 147 0.6 9.6 25.09
Tortuga 16 8 1.24 186 6.8 50.9 17.95
Wolf 21 12 2.85 253 34.1 254.7 2.33
If your data file is stored in folder Stat214, for example, and the file was created using an editor you may enter
gala<-read.table("C:/Stat242/gala.data",sep="\t",quote="",header=T,row.names=NULL)
The "<-" is an assignment operator which reads the data into the object gala. You can use "=" (underscore) as an alternative to "<-".
We can check the dimension of the data:
> dim(gala)
[1] 30 7
If we don’t remember the variable (column) names we can enter:
> names(gala)
[1] "X" "Species" "Endemics" "Area" "Elevation" "Nearest"
[7] "Scruz" "Adjacent"
We can have access to the variables in gala by entering
attach(gala)
Then by entering the name of the variable, e.g. Species, I see all the Species values:
Species
[1] 58 31 3 25 2 18 24 10 8 2 97 93 58 5 40 347 51 2 104
[20] 108 12 70 280 237 444 62 285 44 16 21
Numerical Summaries
One easy way to get the basic numerical summaries is:
> summary(gala)
Species Endemics Area Elevation
Min. : 2.00 Min. : 0.00 Min. : 0.0100 Min. : 25.00
1st Qu.: 13.00 1st Qu.: 7.25 1st Qu.: 0.2575 1st Qu.: 97.75
Median : 42.00 Median :18.00 Median : 2.5900 Median : 192.00
Mean : 85.23 Mean :26.10 Mean : 261.7000 Mean : 368.00
3rd Qu.: 96.00 3rd Qu.:32.25 3rd Qu.: 59.2400 3rd Qu.: 435.30
Max. :444.00 Max. :95.00 Max. :4669.0000 Max. :1707.00
Nearest Scruz Adjacent
Min. : 0.20 Min. : 0.00 Min. : 0.03
1st Qu.: 0.80 1st Qu.: 11.02 1st Qu.: 0.52
Median : 3.05 Median : 46.65 Median : 2.59
Mean :10.06 Mean : 56.98 Mean : 261.10
3rd Qu.:10.02 3rd Qu.: 81.08 3rd Qu.: 59.24
Max. :47.40 Max. :290.20 Max. :4669.00
We can compute these numbers seperately also:
> gala$Species
[1] 58 31 3 25 2 18 24 10 8 2 97 93 58 5 40 347 51 2 104
[20] 108 12 70 280 237 444 62 285 44 16 21
> mean(gala$Sp)
[1] 85.23333
> median(gala$Sp)
[1] 42
> min(gala$Sp)
[1] 2
> range(gala$Sp)
[1] 2 444
> quantile(gala$Sp)
0% 25% 50% 75% 100%
2 13 42 96 444
We can get the variance and sd:
> var(gala$Sp)
[1] 13140.74
> sqrt(var(gala$Sp))
[1] 114.6331
We can write a function to compute sd's:
> sd <- function(x) sqrt(var(x))
> sd(gala$Sp)
[1] 114.6331
The correlations:
> cor(gala)
Species Endemics Area Elevation Nearest
Species 1.00000000 0.970876516 0.6178431 0.73848666 -0.014094067
Endemics 0.97087652 1.000000000 0.6169791 0.79290437 0.005994286
Area 0.61784307 0.616979087 1.0000000 0.75373492 -0.111103196
Elevation 0.73848666 0.792904369 0.7537349 1.00000000 -0.011076984
Nearest -0.01409407 0.005994286 -0.1111032 -0.01107698 1.000000000
Scruz -0.17114244 -0.154264319 -0.1007849 -0.01543829 0.615410357
Adjacent 0.02616635 0.082658026 0.1800376 0.53645782 -0.116247885
Scruz Adjacent
Species -0.17114244 0.02616635
Endemics -0.15426432 0.08265803
Area -0.10078493 0.18003759
Elevation -0.01543829 0.53645782
Nearest 0.61541036 -0.11624788
Scruz 1.00000000 0.05166066
Adjacent 0.05166066 1.00000000
Or more neatly
> round(cor(gala),3)
Species Endemics Area Elevation Nearest Scruz Adjacent
Species 1.000 0.971 0.618 0.738 -0.014 -0.171 0.026
Endemics 0.971 1.000 0.617 0.793 0.006 -0.154 0.083
Area 0.618 0.617 1.000 0.754 -0.111 -0.101 0.180
Elevation 0.738 0.793 0.754 1.000 -0.011 -0.015 0.536
Nearest -0.014 0.006 -0.111 -0.011 1.000 0.615 -0.116
Scruz -0.171 -0.154 -0.101 -0.015 0.615 1.000 0.052
Adjacent 0.026 0.083 0.180 0.536 -0.116 0.052 1.000
Another numerical summary with a graphical element is the stem and leaf plot:
> gala$En
[1] 23 21 3 9 1 11 0 7 4 2 26 35 17 4 19 89 23 2 37 33 9 30 65 81 95
[26] 28 73 16 8 12
> stem(gala$En)
The decimal point is 1 digit(s) to the right of the |
0 | 01223447899
1 | 12679
2 | 13368
3 | 0357
4 |
5 |
6 | 5
7 | 3
8 | 19
9 | 5
Graphical Summaries
We can make histograms and boxplot and specify the labels if we like:
> hist(gala$Sp)
> hist(gala$Sp,main="Histogram of Species",xlab="number of Species")
> boxplot(gala$Sp)
Scatterplots are easier - here we rescale the X-axis because of the skewness of area:
plot(gala$Area,gala$Sp)
plot(log(gala$Area),gala$Sp,xlab="log(Area)",ylab="Species")
We can make a scatterplot matrix:
pairs(gala)
> plot(gala) # also a scatterplot matrix
We can put several plots in one display
par(mfrow=c(2,2))
boxplot(gala$Ar)
boxplot(gala$Adj)
boxplot(gala$Elev)
boxplot(gala$Sc)
par(mfrow=c(1,1)) # back to 1 plot display
Selecting subsets of the data
Second row:
> gala[2,]
Species Endemics Area Elevation Nearest Scruz Adjacent
Bartolome 31 21 1.24 109 0.6 26.3 572.33
Third column
> gala[,3]
[1] 25.09 1.24 0.21 0.10 0.05 0.34 0.08 2.33 0.03
[10] 0.18 58.27 634.49 0.57 0.78 17.35 4669.32 129.49 0.01
[19] 59.56 17.95 0.23 4.89 551.62 572.33 903.82 24.08 170.92
[28] 1.84 1.24 2.85
The 2,3 element:
> gala[2,3]
[1] 1.24
c() is a function for making vectors, e.g.
> c(1,4,8)
[1] 1 4 8
Select the first, fourth and eighth rows:
> gala[c(1,4,8),]
Species Endemics Area Elevation Nearest Scruz Adjacent
Baltra 58 23 25.09 346 0.6 0.6 1.84
Champion 25 9 0.10 46 1.9 47.4 0.18
Darwin 10 7 2.33 168 34.1 290.2 2.85
The : operator is good for making sequences e.g.
> 3:11
[1] 3 4 5 6 7 8 9 10 11
We can select the third through eleventh rows:
> gala[3:11,]
Species Endemics Area Elevation Nearest Scruz Adjacent
Caldwell 3 3 0.21 114 2.8 58.7 0.78
Champion 25 9 0.10 46 1.9 47.4 0.18
Coamano 2 1 0.05 77 1.9 1.9 903.82
Daphne.Major 18 11 0.34 119 8.0 8.0 1.84
Daphne.Minor 24 0 0.08 93 6.0 12.0 0.34
Darwin 10 7 2.33 168 34.1 290.2 2.85
Eden 8 4 0.03 71 0.4 0.4 17.95
Enderby 2 2 0.18 112 2.6 50.2 0.10
Espanola 97 26 58.27 198 1.1 88.3 0.57
We can use "-" to indicate "everthing but", e.g all the data except the first two columns is:
> gala[,-c(1,2)]
Area Elevation Nearest Scruz Adjacent
Baltra 25.09 346 0.6 0.6 1.84
Bartolome 1.24 109 0.6 26.3 572.33
Caldwell 0.21 114 2.8 58.7 0.78
Champion 0.10 46 1.9 47.4 0.18
Coamano 0.05 77 1.9 1.9 903.82
Daphne.Major 0.34 119 8.0 8.0 1.84
Daphne.Minor 0.08 93 6.0 12.0 0.34
Darwin 2.33 168 34.1 290.2 2.85
Eden 0.03 71 0.4 0.4 17.95
Enderby 0.18 112 2.6 50.2 0.10
Espanola 58.27 198 1.1 88.3 0.57
Fernandina 634.49 1494 4.3 95.3 4669.32
Gardner1 0.57 49 1.1 93.1 58.27
Gardner2 0.78 227 4.6 62.2 0.21
Genovesa 17.35 76 47.4 92.2 129.49
Isabela 4669.32 1707 0.7 28.1 634.49
Marchena 129.49 343 29.1 85.9 59.56
Onslow 0.01 25 3.3 45.9 0.10
Pinta 59.56 777 29.1 119.6 129.49
Pinzon 17.95 458 10.7 10.7 0.03
Las.Plazas 0.23 94 0.5 0.6 25.09
Rabida 4.89 367 4.4 24.4 572.33
SanCristobal 551.62 716 45.2 66.6 0.57
SanSalvador 572.33 906 0.2 19.8 4.89
SantaCruz 903.82 864 0.6 0.0 0.52
SantaFe 24.08 259 16.5 16.5 0.52
SantaMaria 170.92 640 2.6 49.2 0.10
Seymour 1.84 147 0.6 9.6 25.09
Tortuga 1.24 186 6.8 50.9 17.95
Wolf 2.85 253 34.1 254.7 2.33
We may also want select the subsets on the basis of some criterion e.g. which islands exceed 500 in area:
> gala[gala$Area > 500,]
Species Endemics Area Elevation Nearest Scruz Adjacent
Fernandina 93 35 634.49 1494 4.3 95.3 4669.32
Isabela 347 89 4669.32 1707 0.7 28.1 634.49
SanCristobal 280 65 551.62 716 45.2 66.6 0.57
SanSalvador 237 81 572.33 906 0.2 19.8 4.89
SantaCruz 444 95 903.82 864 0.6 0.0 0.52
Learning more about R
While running R you can get help about a particular commands - eg - if you want help about the stem() command just type help(stem)
If you don't know what the name of the command is that you want to use then type:
help.start()
and then browse.
A short introduction to R is given at
A detailed introduction to R can be found at