Data Set: Galapagos Islands

Introduction to R

Data Set: Galapagos Islands

Variables:

Species: the number of species of tortoise found on the island

Endemics: the number of endemic species

Elevation: the highest elevation of the island (m)

Nearest: The distance from the nearest island (km)

Scruz: the distance from santa Cruz (km)

Adjacent: area of the adjacent island (km2)

Reading the data in

The first step is to read the data in. You'll need to get the data and save it.

> gala <- read.table("gala.data")

> gala

Species Endemics Area Elevation Nearest Scruz Adjacent

Baltra 58 23 25.09 346 0.6 0.6 1.84

Bartolome 31 21 1.24 109 0.6 26.3 572.33

Caldwell 3 3 0.21 114 2.8 58.7 0.78

Champion 25 9 0.10 46 1.9 47.4 0.18

Coamano 2 1 0.05 77 1.9 1.9 903.82

Daphne.Major 18 11 0.34 119 8.0 8.0 1.84

Daphne.Minor 24 0 0.08 93 6.0 12.0 0.34

Darwin 10 7 2.33 168 34.1 290.2 2.85

Eden 8 4 0.03 71 0.4 0.4 17.95

Enderby 2 2 0.18 112 2.6 50.2 0.10

Espanola 97 26 58.27 198 1.1 88.3 0.57

Fernandina 93 35 634.49 1494 4.3 95.3 4669.32

Gardner1 58 17 0.57 49 1.1 93.1 58.27

Gardner2 5 4 0.78 227 4.6 62.2 0.21

Genovesa 40 19 17.35 76 47.4 92.2 129.49

Isabela 347 89 4669.32 1707 0.7 28.1 634.49

Marchena 51 23 129.49 343 29.1 85.9 59.56

Onslow 2 2 0.01 25 3.3 45.9 0.10

Pinta 104 37 59.56 777 29.1 119.6 129.49

Pinzon 108 33 17.95 458 10.7 10.7 0.03

Las.Plazas 12 9 0.23 94 0.5 0.6 25.09

Rabida 70 30 4.89 367 4.4 24.4 572.33

SanCristobal 280 65 551.62 716 45.2 66.6 0.57

SanSalvador 237 81 572.33 906 0.2 19.8 4.89

SantaCruz 444 95 903.82 864 0.6 0.0 0.52

SantaFe 62 28 24.08 259 16.5 16.5 0.52

SantaMaria 285 73 170.92 640 2.6 49.2 0.10

Seymour 44 16 1.84 147 0.6 9.6 25.09

Tortuga 16 8 1.24 186 6.8 50.9 17.95

Wolf 21 12 2.85 253 34.1 254.7 2.33

If your data file is stored in folder Stat214, for example, and the file was created using an editor you may enter

gala<-read.table("C:/Stat242/gala.data",sep="\t",quote="",header=T,row.names=NULL)

The "<-" is an assignment operator which reads the data into the object gala. You can use "=" (underscore) as an alternative to "<-".

We can check the dimension of the data:

> dim(gala)

[1] 30 7

If we don’t remember the variable (column) names we can enter:

> names(gala)

[1] "X" "Species" "Endemics" "Area" "Elevation" "Nearest"

[7] "Scruz" "Adjacent"

We can have access to the variables in gala by entering

attach(gala)

Then by entering the name of the variable, e.g. Species, I see all the Species values:

Species

[1] 58 31 3 25 2 18 24 10 8 2 97 93 58 5 40 347 51 2 104

[20] 108 12 70 280 237 444 62 285 44 16 21

Numerical Summaries

One easy way to get the basic numerical summaries is:

> summary(gala)

Species Endemics Area Elevation

Min. : 2.00 Min. : 0.00 Min. : 0.0100 Min. : 25.00

1st Qu.: 13.00 1st Qu.: 7.25 1st Qu.: 0.2575 1st Qu.: 97.75

Median : 42.00 Median :18.00 Median : 2.5900 Median : 192.00

Mean : 85.23 Mean :26.10 Mean : 261.7000 Mean : 368.00

3rd Qu.: 96.00 3rd Qu.:32.25 3rd Qu.: 59.2400 3rd Qu.: 435.30

Max. :444.00 Max. :95.00 Max. :4669.0000 Max. :1707.00

Nearest Scruz Adjacent

Min. : 0.20 Min. : 0.00 Min. : 0.03

1st Qu.: 0.80 1st Qu.: 11.02 1st Qu.: 0.52

Median : 3.05 Median : 46.65 Median : 2.59

Mean :10.06 Mean : 56.98 Mean : 261.10

3rd Qu.:10.02 3rd Qu.: 81.08 3rd Qu.: 59.24

Max. :47.40 Max. :290.20 Max. :4669.00

We can compute these numbers seperately also:

> gala$Species

[1] 58 31 3 25 2 18 24 10 8 2 97 93 58 5 40 347 51 2 104

[20] 108 12 70 280 237 444 62 285 44 16 21

> mean(gala$Sp)

[1] 85.23333

> median(gala$Sp)

[1] 42

> min(gala$Sp)

[1] 2

> range(gala$Sp)

[1] 2 444

> quantile(gala$Sp)

0% 25% 50% 75% 100%

2 13 42 96 444

We can get the variance and sd:

> var(gala$Sp)

[1] 13140.74

> sqrt(var(gala$Sp))

[1] 114.6331

We can write a function to compute sd's:

> sd <- function(x) sqrt(var(x))

> sd(gala$Sp)

[1] 114.6331

The correlations:

> cor(gala)

Species Endemics Area Elevation Nearest

Species 1.00000000 0.970876516 0.6178431 0.73848666 -0.014094067

Endemics 0.97087652 1.000000000 0.6169791 0.79290437 0.005994286

Area 0.61784307 0.616979087 1.0000000 0.75373492 -0.111103196

Elevation 0.73848666 0.792904369 0.7537349 1.00000000 -0.011076984

Nearest -0.01409407 0.005994286 -0.1111032 -0.01107698 1.000000000

Scruz -0.17114244 -0.154264319 -0.1007849 -0.01543829 0.615410357

Adjacent 0.02616635 0.082658026 0.1800376 0.53645782 -0.116247885

Scruz Adjacent

Species -0.17114244 0.02616635

Endemics -0.15426432 0.08265803

Area -0.10078493 0.18003759

Elevation -0.01543829 0.53645782

Nearest 0.61541036 -0.11624788

Scruz 1.00000000 0.05166066

Adjacent 0.05166066 1.00000000

Or more neatly

> round(cor(gala),3)

Species Endemics Area Elevation Nearest Scruz Adjacent

Species 1.000 0.971 0.618 0.738 -0.014 -0.171 0.026

Endemics 0.971 1.000 0.617 0.793 0.006 -0.154 0.083

Area 0.618 0.617 1.000 0.754 -0.111 -0.101 0.180

Elevation 0.738 0.793 0.754 1.000 -0.011 -0.015 0.536

Nearest -0.014 0.006 -0.111 -0.011 1.000 0.615 -0.116

Scruz -0.171 -0.154 -0.101 -0.015 0.615 1.000 0.052

Adjacent 0.026 0.083 0.180 0.536 -0.116 0.052 1.000

Another numerical summary with a graphical element is the stem and leaf plot:

> gala$En

[1] 23 21 3 9 1 11 0 7 4 2 26 35 17 4 19 89 23 2 37 33 9 30 65 81 95

[26] 28 73 16 8 12

> stem(gala$En)

The decimal point is 1 digit(s) to the right of the |

0 | 01223447899

1 | 12679

2 | 13368

3 | 0357

4 |

5 |

6 | 5

7 | 3

8 | 19

9 | 5

Graphical Summaries

We can make histograms and boxplot and specify the labels if we like:

> hist(gala$Sp)

> hist(gala$Sp,main="Histogram of Species",xlab="number of Species")

> boxplot(gala$Sp)

Scatterplots are easier - here we rescale the X-axis because of the skewness of area:

plot(gala$Area,gala$Sp)

plot(log(gala$Area),gala$Sp,xlab="log(Area)",ylab="Species")

We can make a scatterplot matrix:

pairs(gala)

> plot(gala) # also a scatterplot matrix

We can put several plots in one display

par(mfrow=c(2,2))

boxplot(gala$Ar)

boxplot(gala$Adj)

boxplot(gala$Elev)

boxplot(gala$Sc)

par(mfrow=c(1,1)) # back to 1 plot display

Selecting subsets of the data

Second row:

> gala[2,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Bartolome 31 21 1.24 109 0.6 26.3 572.33

Third column

> gala[,3]

[1] 25.09 1.24 0.21 0.10 0.05 0.34 0.08 2.33 0.03

[10] 0.18 58.27 634.49 0.57 0.78 17.35 4669.32 129.49 0.01

[19] 59.56 17.95 0.23 4.89 551.62 572.33 903.82 24.08 170.92

[28] 1.84 1.24 2.85

The 2,3 element:

> gala[2,3]

[1] 1.24

c() is a function for making vectors, e.g.

> c(1,4,8)

[1] 1 4 8

Select the first, fourth and eighth rows:

> gala[c(1,4,8),]

Species Endemics Area Elevation Nearest Scruz Adjacent

Baltra 58 23 25.09 346 0.6 0.6 1.84

Champion 25 9 0.10 46 1.9 47.4 0.18

Darwin 10 7 2.33 168 34.1 290.2 2.85

The : operator is good for making sequences e.g.

> 3:11

[1] 3 4 5 6 7 8 9 10 11

We can select the third through eleventh rows:

> gala[3:11,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Caldwell 3 3 0.21 114 2.8 58.7 0.78

Champion 25 9 0.10 46 1.9 47.4 0.18

Coamano 2 1 0.05 77 1.9 1.9 903.82

Daphne.Major 18 11 0.34 119 8.0 8.0 1.84

Daphne.Minor 24 0 0.08 93 6.0 12.0 0.34

Darwin 10 7 2.33 168 34.1 290.2 2.85

Eden 8 4 0.03 71 0.4 0.4 17.95

Enderby 2 2 0.18 112 2.6 50.2 0.10

Espanola 97 26 58.27 198 1.1 88.3 0.57

We can use "-" to indicate "everthing but", e.g all the data except the first two columns is:

> gala[,-c(1,2)]

Area Elevation Nearest Scruz Adjacent

Baltra 25.09 346 0.6 0.6 1.84

Bartolome 1.24 109 0.6 26.3 572.33

Caldwell 0.21 114 2.8 58.7 0.78

Champion 0.10 46 1.9 47.4 0.18

Coamano 0.05 77 1.9 1.9 903.82

Daphne.Major 0.34 119 8.0 8.0 1.84

Daphne.Minor 0.08 93 6.0 12.0 0.34

Darwin 2.33 168 34.1 290.2 2.85

Eden 0.03 71 0.4 0.4 17.95

Enderby 0.18 112 2.6 50.2 0.10

Espanola 58.27 198 1.1 88.3 0.57

Fernandina 634.49 1494 4.3 95.3 4669.32

Gardner1 0.57 49 1.1 93.1 58.27

Gardner2 0.78 227 4.6 62.2 0.21

Genovesa 17.35 76 47.4 92.2 129.49

Isabela 4669.32 1707 0.7 28.1 634.49

Marchena 129.49 343 29.1 85.9 59.56

Onslow 0.01 25 3.3 45.9 0.10

Pinta 59.56 777 29.1 119.6 129.49

Pinzon 17.95 458 10.7 10.7 0.03

Las.Plazas 0.23 94 0.5 0.6 25.09

Rabida 4.89 367 4.4 24.4 572.33

SanCristobal 551.62 716 45.2 66.6 0.57

SanSalvador 572.33 906 0.2 19.8 4.89

SantaCruz 903.82 864 0.6 0.0 0.52

SantaFe 24.08 259 16.5 16.5 0.52

SantaMaria 170.92 640 2.6 49.2 0.10

Seymour 1.84 147 0.6 9.6 25.09

Tortuga 1.24 186 6.8 50.9 17.95

Wolf 2.85 253 34.1 254.7 2.33

We may also want select the subsets on the basis of some criterion e.g. which islands exceed 500 in area:

> gala[gala$Area > 500,]

Species Endemics Area Elevation Nearest Scruz Adjacent

Fernandina 93 35 634.49 1494 4.3 95.3 4669.32

Isabela 347 89 4669.32 1707 0.7 28.1 634.49

SanCristobal 280 65 551.62 716 45.2 66.6 0.57

SanSalvador 237 81 572.33 906 0.2 19.8 4.89

SantaCruz 444 95 903.82 864 0.6 0.0 0.52

Learning more about R

While running R you can get help about a particular commands - eg - if you want help about the stem() command just type help(stem)

If you don't know what the name of the command is that you want to use then type:

help.start()

and then browse.

A short introduction to R is given at

A detailed introduction to R can be found at