Smartersig Conference September 2012
Introduction to Betting Data Modelling with R
Dr Alun Owen
Part 1: Using R, Managing Data Files, Initial Data Exploration and a First Look at Logistic Regression
1.1Preamble and Pre-Conference Work!
At the conference we will not actually work through the material contained on pages 1 to 15 as we will assume that you have worked through the material on these pages before you arrive. Hence these pages have been supplied to you in advance of the conference. Can you therefore please work through this material on those pages prior to arriving at the conference?
If there are any aspects of the material on pages 1 to 15 that you would like some clarification on please email me (Alun Owen) at
Instructions for downloading and installing R should already have been emailed to you or if not can be obtained from Mark.
We will be using a data set here calledaiplus2004.csv which is available on the utilities section of the Smartersig website or again can be obtained from Mark.
When you arrive at the conference, you will be given a longer document that includes the material here but extends this work from page 16 onwards. We will kick off the first R session with a 5 minute resume of pages 1 to 15, but will then go almost straight into page 16 and move onwards from there.
1.2Using R
R can actually be used in a number of ways;it can be usedinteractively by typing commands in the R Console window (after the prompt) shown below:
Alternatively, R can be used in “batch” mode, by typing a series of commands in an external script (text) file and then running some/all the commands in that file at once. This option opens up the real power of the software.
1.3Using R Interactively
To submit commands to R interactively, simply type the relevant command (after the prompt) in the R Console window and hit the enter button.
As an example, type in the following commands (hitting the enter key after each line):
x <- c(6,7.8,7.7,10.8,6.4,5.7,6.6,12.4)
y <- c(1,2,3,4,5,6,7,8)
plot(x,y)
The <- symbol in the first two lines of commands above is used to assign values to “objects” and can be thought of as representing “=”. In the first line above, we assigned values to a column of data named x. These values were Combined together using the c() function. These columns of data are usually referred to as “variables”. So in the second line above, we combined a list of values together and assigned these to another variable we named y.
The plot(x,y)command opens up an additional (graphics) window and displays a simple scatter plot of y against x , with y on the vertical axis and xon the horizontal axis.
You can see whatx and y look like simply by typing their names in R (again hitting the enter key after each one) as follows:
x
y
If we wanted to assign a single value to a single variable (i.e. not a column of data), for example you may have a starting betting bank of £1,000, then we could assign the value 1,000 to a variable we will callbank as follows:
bank <- 1000
If we wanted to increase the value of bank, by say £500, you could type:
bank <- bank+500
To see that value of the variable bank has increased to £1,500, try typing:
bank
At this point it may be worth mentioning that R provides a mechanism for recalling, editing and re-executingprevious commands. The vertical arrow keys on the keyboard can usually be used to scrollforward and backward through a command history. Once a command is located in this way, thecursor can be moved within the command using the horizontal arrow keys, so that you can edit the command if required and then hit enter key to re-execute the revised command. Try this to see what x looks like again.
Now, the variables x and y and the variable bankare all what R refers to as objects. We can take a look at what variables (objects) we have created during this R session by typing the following command (followed by the enter key remember!):
ls()
Note this is a lower case “L” and a lower case “S” as in “lima” “sierra”!
This lists all the objects we have created in the current R session. You should see bank, xand y listed on the screen in alphabetical order.
Next close R down so we can start again.
Do this either by clicking on File Exit, or by clicking the usual in the top-right of the R screen. Notethat just clicking the in the R console window will just close the console window and not close R itself.
When you try to close R you will probably see the dialogue box below:
Answer “No” to this as we do not need to save the session or any session history as such – we will discuss this in more detail at the conference.
1.4UsingR in Batch Mode and Writing R Code
Instead of typing commands and executing these one at a time in interactive mode, we can write a series of commands in an R script file, which is simply a text file holding our R commands. This is useful if we want to save our work and re-use it later, or to develop models written using R code that we can run repeatedly etc. To create a new blank R script file, start R again and click on File New Script as shown below:
This will open up a new window into which you can write your R commands:
In the next section we will enter our R commands into this script window and run them from there, rather than typing them directly into the R Console window, so keep your R session open.
1.5Using External Data Files (and Folders) with R
So far we have enteredour data by hand via the keyboard, but what about reading data in from an external source, such as an Excel spreadsheet, comma separated (.csv) file or even a text file?
In fact the data we entered for x and y above comes from the first few rows of the comma separated file named aiplus2004.csv. This relates to data on horse racing and contains data on over 22,000 horses. This data set is available on the utilities section of the Smartersig website (or can be obtained from Mark).
This data contains six columns of data as follows:
- position3- finishing position three races ago (1, 2, 3 or 4, 0 = anywhere else)
- position2- finishing position two races ago (1, 2, 3 or 4, 0 = anywhere else)
- position1- finishing position in the previous race (1, 2, 3 or 4, 0 = anywhere else)
- days- days since last race
- sireSR - win rate achieved by the horses Sire with its offspring prior to this race
- position- finishing position in this race
Each row refers to a single horse in each race.
We will look at how to read this data into R.Firstyou need to create a new foldersomewhere in the directory structure on your own PC. You will usethis to place the data fileaiplus2004.csv, and also to save your work during the conference and during any further work you do in this pre-conference session. For example you might create a folder called “C:\R Stuff”.
Once you have created your new folder, copy or move the file named aiplus2004.csv to this new folder noting the directory reference of where this folder is located on your PC, such as for example “C:\R Stuff”.
When using R, we need to direct R to point to this directory. To do this we will set the Working Directory we want R to use. In your R script file window (i.e. NOT the R console window we have used so far!) type the following command:
setwd("C:/R Stuff")
noting the use of the forward slash instead of the usual back slash!
Also replace C:/R Stuff with whatever directory you are using if this is different!
To then execute this command in R, highlight the command in your script file (in the usual way using your mouse) and then click on Edit Run line or selection. Or you can use the short-cut which is to hold down the CTRL and R keys together (with your command highlighted).
If this has been successful, all you will see in the console window is that same command in red having been run as follows:
> setwd("C:/R Stuff")
If this was not successful, then you will probably see the following statement:
> setwd("C:/R Stuff")
Error in setwd("C:/R Stuff") : cannot change working directory
The most likely reason is that you haven’t stated the name of the directory correctly or may have used a back slash in the command when you need to use a forward slash.
This section is optional!
Another approach you could use instead of setting the working directory is to put the full directory location of the source file in the actual read.csv command. For example:
horse.data<-read.csv("C:/R Stuff/aiplus2004.csv")
However, the advantage of using the setwd() command approach is that for the current R session, R will always look in that directory for data and other files and will save your work to this directory also, keeping this tidy for us!
There is also a third way approach you could use instead of setting the working directory. In this case we modify the properties associated with the short-cut icon on the desk-top we use to launch R.We can modify those properties to change the default directory where R points to. I’m not suggesting you use this approach, but for information if you wanted to follow this approach, you need to right click, using your mouse, on the R short-cut icon on your desktop, in order to display the properties dialogue box shown overleaf.On the Shortcut tab type the name of the new sub-directory in the “Start-in:” field as shown below. If you have created sub-directory with a different name and/or in a different location, use that instead.
Click Apply and OK to save this change.
Once you are happy that the setwd() command has been run successfully, type the following command into your script window.
horse.data<-read.csv("aiplus2004.csv")
Then run it as before by highlighting the command and then clicking on Edit Run line or selection (or using CTRL and R).
The read.csv command actually reads data in a table-like format and from this creates what R refers to as a data frame. We have chosen to call this data frame horse.data. You can think of a data frame as being a data set that contains a number of columns of data. In this case, the data frame horse.data contains the six columns of data from aiplus2004.csv.
There are other variants of the read function, should you need them. For example read.table(), which can be used if your data has previously been saved in ASCII (flat file) format, such as that created by windows NotePad. There is also read.csv2for when the data are separated by semi-colons as opposed to commas.
Okay, let’s see what our data in the horse.data data frame looks like. To do this type the following command into your script file and then the run it (Edit Run line or selection, or use CTRL and R):
horse.data
You may see that only the first 16,666 rows (of the 22,000+ in the data frame) are displayed! This is simply the default maximum number of rows that will be printed, which can be altered if required.
To view just say the first 20 rows of the data frame horse.data, copy and paste the last command you entered into your script file horse.data onto a new line further down in your sxript file and edit it so that it reads as follows
horse.data[1:20,]
(making sure you use square brackets, and include the comma!)
Run this command (Edit Run line or selection, or use CTRL and R) and it will show rows 1 to 20, but all columns, ofhorse.data. You should therefore see the output shown below (note that this has been edited here to save space):
position3 position2 position1 days sireSR position
1 1 4 2 60 6.0 1
2 3 1 3 13 7.8 2
3 0 4 2 43 7.7 3
......
......
18 NA 1 0 566 10.8 18
19 0 2 0 20 7.5 19
20 0 0 4 9 4.8 20
Note that the file aiplus2004.csv had column names in the first row and so by default R has used these as the column names for the data frame that it created.
The first number that appears in each row is simply a row number reference that has been printed by R and is not actually stored as an actual column in the data frame.
Note also that the column position3 has a value of NA in row 18. This is how R represents missing values and indicates that this value is Not A number. In the original data set this value would have simply been missing.
You may have noticed that the first 8 rows of sireSR and position are actually the same as those that we typed in manually for the variables x and y previously. To see just the first 8 rows of data for sireSR and position (i.e. columns 5 and 6) copy and edit the last line in your R script file so that it reads:
horse.data[1:8,5:6]
and then run this (Edit Run line or selection, or use CTRL and R)
Two ways of viewing say, the last column in the data frame, are to use
horse.data[,6]
or
horse.data$position
Try adding these to your R script file and then running them.
In both cases you should see that the R console window has been filled with all the 22,000+ values in the column named position from the data frame horse.data.
See if you can now view the data in the column of our data set named position by adding the following command to your R script file and then running it
position
You should find that R cannot find this column. This is because at the moment the columns in the data frame horse.dataare not yet visible to R as objects in their own right. They are just simply columns of the data frame object called horse.data.We can use the attach()function to make the columns of the data frame “visible to R”. In your script file add the following command and the run it.
attach(horse.data)
Now see if you can view the column named position by running the following command again from your script file:
position
You should now see that this is now visible to R!
Note that we need to take care with the attach() function. Whilstthis has made the columns of the data frame horse.data visible to R, this does not create them as individual variables or objects. Note that if we already had an object named position, the attach()function would not overwrite that old one with the new one as you might expect it to!
1.6Saving and Opening R Script Files
Before go any further lets save our script file. To do this, make sure your script file is the “active window” in R – you can make sure of this by clicking anywhere in the script file window. Then click on File Save as and then saving it using a suitable name, but you will also need to include a .R file extension at the end of the file name so that R knows this to be an R script file.
Once you have saved your script file, close R down, again answering “No” when you see the dialogue box below:
Now re-start R and click on File Open script, and locate your script file save above to open it again.
If you can’t locate it and you are sure you are looking in the right place, you may have forgotten to add the .R file extension in the name of you script file when you saved it. Since R by default only looks for script files with a .R file extension this may be why it isn’t listed. You can check this by changing the “Files of type” option it is using to all files as follows:
Okay, we will now assume you have managed to locate your R script file.
1.7Using R to Obtain Simple Data Summaries
Throughout the rest of this document I will assume that all the R commands you run will be first typed into your R script file and then run in batch mode (using Edit Run line or selection, or CTRL and R).
Assuming you now have started a new R session and you have managed to open up your saved R script file, run the following commands from that R script file to make sure we have set our working directory, read in our aiplus data set as a data frame called horse.data and have used the attach() command so that all the columns of horse.data are visible to R:
setwd("C:/R Stuff")
(using a forward slash and replacing C:/R Stuff as required)
horse.data<-read.csv("aiplus2004.csv")
attach(horse.data)
During any data analysis or model development work, you should start with an examination of your data. Let’s start by obtaining a frequency table on the data in the column named position in our data set by typing the following: