DSCI 325: R Handout 5 -- Web scraping with R using APIs
In this handout, we will explore scraping data from the web (in R) using APIs. The information in this handout is largely based on the following tutorial prepared by Bradley Boehmke:
What is an API?
Many websites and companies allow programmatic access to their data via web APIs (API stands for application programming interface). Essentially, this gives a web-based database a way to communicate with another program (such as R).
Each organization’s API is unique, but you typically need to know the following for your specific data set of interest:
- URL for the organization and data you are pulling
- Data set you are pulling from
- Data content (so that you can specify variables you want the API to retrieve, you need to be familiar with the data library)
In addition, you may also need the following:
- API key (i.e., API token). This is typically obtained by supplying your name/email to the organization.
- OAuth. This is an authorization framework that provides credentials as proof for certain information.
The remainder of this handout will introduce these topics with the use of examples.
Example: Pulling U.S. Bureau of Labor Statistics data
No API key or OAuth is required, but a key is recommended.
To register for a key, visit this website:
All available data sets from the Bureau of Labor Statistics and their series code information are available here:
In this example, we will consider pulling Local Area Unemployment Statistics:
The BLS provides this breakdown:
Understanding the SERIES ID: LA U CN2716900000000
Prefix / Seasonal Adjustment Code / Area Code / Measure CodeLA / U / CN2716900000000 / 03
Source:
You can view the data being pulled via the BLS dataViewer website.
The following link contains some answers to FAQs:
To read this data into R from the web, you should make use of the following package.
Request Data from the U.S. Bureau Of Labor Statistics API
Description
Allows users to request data for one or multiple series through the U.S. Bureau of Labor Statistics API. Users provide parameters as specified in < and the function returns a JSON string or data frame.Usage
blsAPI(payload = NA, api.version = 1, return.data.frame = FALSE)Arguments
payload / a string or a list containing data to be sent to the API.api.version / an integer for which api version you want to use (i.e. 1 for v1, 2 for v2)
return.data.frame / aboolean if you want to the function to return JSON (default) or a data frame. If the data frame option is used, the series id will be added as a column. This is helpful if multiple series are selected.
R code:
install.packages("blsAPI")library(blsAPI)
# Supply series identifier to pull data
payload <- list('seriesid'=c('LAUCN271690000000003'),'registrationKey'='your registration key here')
unemployment_winona = blsAPI(payload,api.version=2,return.data.frame=TRUE)
Task:Use the blsAPI package to pull data from the National Employment, Hours, and Earnings data set. Pull the average weekly hours of all employees in the construction sector.Data for a three-year period is returned by default. To pull data from a specified set of years, you can specify additional arguments:
R code:
payload < list('seriesid'=c('LAUCN271690000000003'),'registrationKey'='your registration key here',
'startyear'=2010,
'endyear'=2017)
unemployment_winona <- blsAPI(payload, 2, return.data.frame=TRUE)
Finally, note that you can also pull multiple series at a time:
R code:
payload < list('seriesid'=c('LAUCN271690000000003', 'LAUCN271690000000004'),'registrationKey'='your registration key here',
'startyear'=2010,
'endyear'=2017)
unemployment_winona_rateandnumber <- blsAPI(payload, 2, return.data.frame=TRUE)
Example: Pulling NOAA Data
The rnoaa package can be used to request data from the National Climatic Data Center (now known as the National Centers for Environmental Information) API. This package requires you to have an API key. To request a key, click here and provide your email address.
NOAA data set descriptions are available here:
Description
rnoaa is an R interface to NOAA climate data.Data Sources
Many functions in this package interact with the National Climatic Data Center application programming interface (API) at all of which functions start withncdc_. An access token, or API key, is required to use all thencdc_functions. The key is required by NOAA, not us. Go to the link given above to get an API key.More NOAA data sources are being added through time. Data sources and their function prefixes are:
- buoy_*- NOAA Buoy data from the National Buoy Data Center
- gefs_*- GEFS forecast ensemble data
- ghcnd_*- GHCND daily data from NOAA
- isd_*- ISD/ISH data from NOAA
- homr_*- Historical Observing Metadata Repository (HOMR) vignette
- ncdc_*- NOAA National Climatic Data Center (NCDC) vignette (examples)
- seaice- Sea ice vignette
- storm_- Storms (IBTrACS) vignette
- swdi- Severe Weather Data Inventory (SWDI) vignette
- tornadoes- From the NOAA Storm Prediction Center
- argo_*- Argo buoys
- coops_search- NOAA CO-OPS - tides and currents data
For example, let’sstart by pulling all weather stations in Winona County, MN, usingWinona County’s FIPS code. We will focus on the GHCND data set, which contains records on daily measurements such as maximum and minimum temperature, total daily precipitation, etc.
R code:
install.packages("rnoaa")library(rnoaa)
stations <- ncdc_stations(datasetid=’GHCND’,
locationid=’FIPS:27169’,
token ='your registration key here')
To pull data from one of these stations, we need the station ID. Suppose we want to pull all available data from the “Winona, MN US” station. The following commands supply the data to pull, the start and end dates (you are restricted to a one-year limit), station ID, and your key.
R code:
climate =ncdc(datasetid='GHCND',startdate = '2007-01-01',
enddate = '2007-12-31',
stationid='GHCND:USC00219067’,
token = 'your registration key here')
climate$data
date datatype station value fl_mfl_qfl_sofl_t
1 2007-01-01T00:00:00 PRCP GHCND:USC00219067 41 0 1830
2 2007-01-01T00:00:00 SNOW GHCND:USC00219067 18 0
3 2007-01-01T00:00:00 SNWD GHCND:USC00219067 25 0
4 2007-01-01T00:00:00 TMAX GHCND:USC00219067 33 0 1830
5 2007-01-01T00:00:00 TMIN GHCND:USC00219067 -22 0 1830
6 2007-01-01T00:00:00 TOBS GHCND:USC00219067 -22 0 1830
7 2007-01-02T00:00:00 PRCP GHCND:USC00219067 0 0 1830
8 2007-01-02T00:00:00 SNOW GHCND:USC00219067 0 0
9 2007-01-02T00:00:00 SNWD GHCND:USC00219067 0 0
10 2007-01-02T00:00:00 TMAX GHCND:USC00219067 56 0 1830
11 2007-01-02T00:00:00 TMIN GHCND:USC00219067 -72 0 1830
12 2007-01-02T00:00:00 TOBS GHCND:USC00219067 17 0 1830
13 2007-01-03T00:00:00 PRCP GHCND:USC00219067 0 0 1830
14 2007-01-03T00:00:00 SNOW GHCND:USC00219067 0 0
15 2007-01-03T00:00:00 SNWD GHCND:USC00219067 0 0
16 2007-01-03T00:00:00 TMAX GHCND:USC00219067 78 0 1830
17 2007-01-03T00:00:00 TMIN GHCND:USC00219067 6 0 1830
18 2007-01-03T00:00:00 TOBS GHCND:USC00219067 72 0 1830
19 2007-01-04T00:00:00 PRCP GHCND:USC00219067 0 0 1830
20 2007-01-04T00:00:00 SNOW GHCND:USC00219067 0 0
21 2007-01-04T00:00:00 SNWD GHCND:USC00219067 0 0
22 2007-01-04T00:00:00 TMAX GHCND:USC00219067 78 0 1830
23 2007-01-04T00:00:00 TMIN GHCND:USC00219067 39 0 1830
24 2007-01-04T00:00:00 TOBS GHCND:USC00219067 72 0 1830
25 2007-01-05T00:00:00 PRCP GHCND:USC00219067 0 0 1830
Next, let’s pull data on precipitation for 2007 (note the use of the datatypeid argument). By
default, ncdc limits the results to 25, but we can adjust the limit argument as shown below.
R code:
precip = ncdc(datasetid='GHCND',startdate = '2007-01-01',
enddate = '2007-12-31',
limit=365,
stationid='GHCND:USC00219067',
datatypeid = 'PRCP',
token = 'your registration key here')
Finally, we can sort the observations to see which days in 2007 experienced the greatest rainfall.
R code:
precip.data=precip$dataprecip.data %>%
arrange(desc(value))
date datatype station value fl_mfl_qfl_sofl_t
1 2007-08-19T00:00:00 PRCP GHCND:USC00219067 1257 0 1830
2 2007-08-20T00:00:00 PRCP GHCND:USC00219067 1016 0 1830
3 2007-08-11T00:00:00 PRCP GHCND:USC00219067 424 0 1830
4 2007-02-24T00:00:00 PRCP GHCND:USC00219067 404 0 1830
5 2007-08-18T00:00:00 PRCP GHCND:USC00219067 396 0 1830
6 2007-09-07T00:00:00 PRCP GHCND:USC00219067 391 0 1830
7 2007-05-24T00:00:00 PRCP GHCND:USC00219067 381 0 1830
8 2007-08-14T00:00:00 PRCP GHCND:USC00219067 361 0 1830
.
.
.
Example – Leveraging an Organization’s API without an R Package
In some situations, an R package may not exist to communicate with an organization’s API interface. Hadley Wickham has developed the httr package to easily work with web APIs. It offers multiple functions, but in this example, we will focus on the use of the GET() function to access an API, provide some request parameters, and get output.
Suppose we wanted to pull College Scoreboard data from the Department of Education. Though an R package does in fact exist to facilitate such a data pull, we will illustrate the use of the httr package, instead. Start by requesting a key:
Data library:
Query explanation:
Suppose we wanted to retrieve all information from Winona State University. The following URL will retrieve this information. Paste this URL into a browser, specify your registration key with the api_key parameter.
your_registration_key_here
The following R code can be used to bring this into R.
install.packages("httr")library(httr)
URL <- "
# import all available data for Winona State University
wsu_request <- httr::GET(URL, query = list(api_key = 'your registration key here',
school.name = "Winona State University"))
This request will provide us with all information collected on Winona State University. The JSON data structure is the format that the API is using to send data back to R. The hierarchical structure of JSON data does not work well in R which is best suited for rectangular shaped data.
To retrieve the contents, use the content() function. This returns an R object (specifically a list).
wsu_data <- content(wsu_request)
names(wsu_data)
[1] "metadata" "results"
Note that the data is segment into metadata and results; we are interested in the results.
names(wsu_data$results[[1]])
To see what data are available, we can look at a single year:
names(wsu_data$results[[1]]$'2015')
With such a large data set containing many embedded lists, we can explore the data by examining names at different levels:
wsu_data$results[[1]]$'2015'$cost
names(wsu_data$results[[1]]$'2015'$cost)
wsu_data$results[[1]]$'2015'$cost$attendance$academic_year
Getting cost over a sequence of years…
wsu_data$results[[1]]$'2014'$cost$attendance$academic_year
wsu_data$results[[1]]$'2013'$cost$attendance$academic_year
wsu_data$results[[1]]$'2012'$cost$attendance$academic_year
Getting median ACT cumulative for
names(wsu_data$results[[1]]$'2015'$admissions)
[1] "sat_scores" "admission_rate" "act_scores"
names(wsu_data$results[[1]]$'2015'$admissions$act_scores)
[1] "midpoint" "25th_percentile" "75th_percentile"
names(wsu_data$results[[1]]$'2015'$admissions$act_scores$midpoint)
[1] "math" "cumulative" "writing" "english"
> wsu_data$results[[1]]$'2015'$admissions$act_scores$midpoint$cumulative
[1] 23
We can pull data collected on a certain variable over many years as follows:
install.packages("dplyr")library(dplyr)
# subset list for annual student data only
wsu_yr <- wsu_data$results[[1]][c(as.character(2000:2015))]
# extract median cumulative ACT score data for each year
wsu_yr %>%
sapply(function(x){x$admissions$act_scores$midpoint$cumulative}) %>%
unlist()
#extract net price for each year
wsu_yr %>%
sapply(function(x) x$cost$avg_net_price$overall) %>%
unlist()
# extract median debt data for each year
wsu_yr %>%
sapply(function(x) x$aid$median_debt$completers$overall) %>%
unlist()
> # extract median cumulative ACT score data for each year
wsu_yr %>%
+ sapply(function(x) x$admissions$act_scores$midpoint$"cumulative") %>%
+ unlist()
> #extract net price for each year
wsu_yr %>%
+ sapply(function(x) x$cost$avg_net_price$overall) %>%
+ unlist()
> # extract median debt data for each year
wsu_yr %>%
+ sapply(function(x) x$aid$median_debt$completers$overall) %>%
+ unlist()
CSV Version of file…
If you wanted to obtain a csv file instead of a json, change the URL as follows:
After the CSV file has been downloaded via a browser, you can read it into R using Import Dataset within R Studio.
> #extract net price for each year
library(readr)
wsu_data <- read_csv(“D:/Teaching/DSCI325/Data/wsu_data.csv”)
View(wsu_data
Notice that the CSV file is very wide (about 17,000 variables and only 1 observation). To pull off certain columns, we will use the contains() function in conjunction with dplyr::select() function. Here, variables associated with median debt for completers.overallis being selected.
dplyr::select(wsu_data, contains("median_debt.completers.overall"))
The gather() function in dplyr can be used to transpose this row of data into a column of data.
dplyr::select(wsu_data, contains("median_debt.completers.overall")) %>% dplyr::gather()
Function to pull off cost for a particular year…
mypull<-function(year){
return(eval(parse(text=paste("wsu_data$results[[1]]$'",year,"'$cost$attendance$academic_year",sep=""))))
}
1