DSCI 325: R Handout 5 Web Scraping with R Using Apis

DSCI 325: R Handout 5 -- Web scraping with R using APIs

In this handout, we will explore scraping data from the web (in R) using APIs. The information in this handout is largely based on the following tutorial prepared by Bradley Boehmke:

What is an API?

Many websites and companies allow programmatic access to their data via web APIs (API stands for application programming interface). Essentially, this gives a web-based database a way to communicate with another program (such as R).

Each organization’s API is unique, but you typically need to know the following for your specific data set of interest:

URL for the organization and data you are pulling
Data set you are pulling from
Data content (so that you can specify variables you want the API to retrieve, you need to be familiar with the data library)

In addition, you may also need the following:

API key (i.e., API token). This is typically obtained by supplying your name/email to the organization.
OAuth. This is an authorization framework that provides credentials as proof for certain information.

The remainder of this handout will introduce these topics with the use of examples.

Example: Pulling U.S. Bureau of Labor Statistics data
No API key or OAuth is required, but a key is recommended.

To register for a key, visit this website:

All available data sets from the Bureau of Labor Statistics and their series code information are available here:
In this example, we will consider pulling Local Area Unemployment Statistics:

The BLS provides this breakdown:

Understanding the SERIES ID: LA U CN2716900000000

Prefix / Seasonal Adjustment Code / Area Code / Measure Code
LA / U / CN2716900000000 / 03

Source:

You can view the data being pulled via the BLS dataViewer website.

The following link contains some answers to FAQs:

To read this data into R from the web, you should make use of the following package.

blsAPI {blsAPI} / R Documentation

Request Data from the U.S. Bureau Of Labor Statistics API

Description

Allows users to request data for one or multiple series through the U.S. Bureau of Labor Statistics API. Users provide parameters as specified in < and the function returns a JSON string or data frame.

Usage

blsAPI(payload = NA, api.version = 1, return.data.frame = FALSE)

Arguments

payload / a string or a list containing data to be sent to the API.
api.version / an integer for which api version you want to use (i.e. 1 for v1, 2 for v2)
return.data.frame / aboolean if you want to the function to return JSON (default) or a data frame. If the data frame option is used, the series id will be added as a column. This is helpful if multiple series are selected.

R code:

install.packages("blsAPI")
library(blsAPI)
# Supply series identifier to pull data
payload <- list('seriesid'=c('LAUCN271690000000003'),'registrationKey'='your registration key here')
unemployment_winona = blsAPI(payload,api.version=2,return.data.frame=TRUE)

Task:Use the blsAPI package to pull data from the National Employment, Hours, and Earnings data set. Pull the average weekly hours of all employees in the construction sector.Data for a three-year period is returned by default. To pull data from a specified set of years, you can specify additional arguments:

R code:

payload < list('seriesid'=c('LAUCN271690000000003'),
'registrationKey'='your registration key here',
'startyear'=2010,
'endyear'=2017)
unemployment_winona <- blsAPI(payload, 2, return.data.frame=TRUE)

Finally, note that you can also pull multiple series at a time:

R code:

payload < list('seriesid'=c('LAUCN271690000000003', 'LAUCN271690000000004'),
'registrationKey'='your registration key here',
'startyear'=2010,
'endyear'=2017)
unemployment_winona_rateandnumber <- blsAPI(payload, 2, return.data.frame=TRUE)

Example: Pulling NOAA Data

The rnoaa package can be used to request data from the National Climatic Data Center (now known as the National Centers for Environmental Information) API. This package requires you to have an API key. To request a key, click here and provide your email address.

NOAA data set descriptions are available here:

Description

rnoaa is an R interface to NOAA climate data.

Data Sources

Many functions in this package interact with the National Climatic Data Center application programming interface (API) at all of which functions start withncdc_. An access token, or API key, is required to use all thencdc_functions. The key is required by NOAA, not us. Go to the link given above to get an API key.
More NOAA data sources are being added through time. Data sources and their function prefixes are:

buoy_*- NOAA Buoy data from the National Buoy Data Center
gefs_*- GEFS forecast ensemble data
ghcnd_*- GHCND daily data from NOAA
isd_*- ISD/ISH data from NOAA
homr_*- Historical Observing Metadata Repository (HOMR) vignette
ncdc_*- NOAA National Climatic Data Center (NCDC) vignette (examples)
seaice- Sea ice vignette
storm_- Storms (IBTrACS) vignette
swdi- Severe Weather Data Inventory (SWDI) vignette
tornadoes- From the NOAA Storm Prediction Center
argo_*- Argo buoys
coops_search- NOAA CO-OPS - tides and currents data

For example, let’sstart by pulling all weather stations in Winona County, MN, usingWinona County’s FIPS code. We will focus on the GHCND data set, which contains records on daily measurements such as maximum and minimum temperature, total daily precipitation, etc.

R code:

install.packages("rnoaa")
library(rnoaa)
stations <- ncdc_stations(datasetid=’GHCND’,
locationid=’FIPS:27169’,
token ='your registration key here')

To pull data from one of these stations, we need the station ID. Suppose we want to pull all available data from the “Winona, MN US” station. The following commands supply the data to pull, the start and end dates (you are restricted to a one-year limit), station ID, and your key.

R code:

climate =ncdc(datasetid='GHCND',
startdate = '2007-01-01',
enddate = '2007-12-31',
stationid='GHCND:USC00219067’,
token = 'your registration key here')

climate$data

date datatype station value fl_mfl_qfl_sofl_t

1 2007-01-01T00:00:00 PRCP GHCND:USC00219067 41 0 1830

2 2007-01-01T00:00:00 SNOW GHCND:USC00219067 18 0

3 2007-01-01T00:00:00 SNWD GHCND:USC00219067 25 0

4 2007-01-01T00:00:00 TMAX GHCND:USC00219067 33 0 1830

5 2007-01-01T00:00:00 TMIN GHCND:USC00219067 -22 0 1830

6 2007-01-01T00:00:00 TOBS GHCND:USC00219067 -22 0 1830

7 2007-01-02T00:00:00 PRCP GHCND:USC00219067 0 0 1830

8 2007-01-02T00:00:00 SNOW GHCND:USC00219067 0 0

9 2007-01-02T00:00:00 SNWD GHCND:USC00219067 0 0

10 2007-01-02T00:00:00 TMAX GHCND:USC00219067 56 0 1830

11 2007-01-02T00:00:00 TMIN GHCND:USC00219067 -72 0 1830

12 2007-01-02T00:00:00 TOBS GHCND:USC00219067 17 0 1830

13 2007-01-03T00:00:00 PRCP GHCND:USC00219067 0 0 1830

14 2007-01-03T00:00:00 SNOW GHCND:USC00219067 0 0

15 2007-01-03T00:00:00 SNWD GHCND:USC00219067 0 0

16 2007-01-03T00:00:00 TMAX GHCND:USC00219067 78 0 1830

17 2007-01-03T00:00:00 TMIN GHCND:USC00219067 6 0 1830

18 2007-01-03T00:00:00 TOBS GHCND:USC00219067 72 0 1830

19 2007-01-04T00:00:00 PRCP GHCND:USC00219067 0 0 1830

20 2007-01-04T00:00:00 SNOW GHCND:USC00219067 0 0

21 2007-01-04T00:00:00 SNWD GHCND:USC00219067 0 0

22 2007-01-04T00:00:00 TMAX GHCND:USC00219067 78 0 1830

23 2007-01-04T00:00:00 TMIN GHCND:USC00219067 39 0 1830

24 2007-01-04T00:00:00 TOBS GHCND:USC00219067 72 0 1830

25 2007-01-05T00:00:00 PRCP GHCND:USC00219067 0 0 1830

Next, let’s pull data on precipitation for 2007 (note the use of the datatypeid argument). By
default, ncdc limits the results to 25, but we can adjust the limit argument as shown below.

R code:

precip = ncdc(datasetid='GHCND',
startdate = '2007-01-01',
enddate = '2007-12-31',
limit=365,
stationid='GHCND:USC00219067',
datatypeid = 'PRCP',
token = 'your registration key here')

Finally, we can sort the observations to see which days in 2007 experienced the greatest rainfall.

R code:

precip.data=precip$data
precip.data %>%
arrange(desc(value))

date datatype station value fl_mfl_qfl_sofl_t

1 2007-08-19T00:00:00 PRCP GHCND:USC00219067 1257 0 1830

2 2007-08-20T00:00:00 PRCP GHCND:USC00219067 1016 0 1830

3 2007-08-11T00:00:00 PRCP GHCND:USC00219067 424 0 1830

4 2007-02-24T00:00:00 PRCP GHCND:USC00219067 404 0 1830

5 2007-08-18T00:00:00 PRCP GHCND:USC00219067 396 0 1830

6 2007-09-07T00:00:00 PRCP GHCND:USC00219067 391 0 1830

7 2007-05-24T00:00:00 PRCP GHCND:USC00219067 381 0 1830

8 2007-08-14T00:00:00 PRCP GHCND:USC00219067 361 0 1830

Example – Leveraging an Organization’s API without an R Package

In some situations, an R package may not exist to communicate with an organization’s API interface. Hadley Wickham has developed the httr package to easily work with web APIs. It offers multiple functions, but in this example, we will focus on the use of the GET() function to access an API, provide some request parameters, and get output.

Suppose we wanted to pull College Scoreboard data from the Department of Education. Though an R package does in fact exist to facilitate such a data pull, we will illustrate the use of the httr package, instead. Start by requesting a key:

Data library:
Query explanation:

Suppose we wanted to retrieve all information from Winona State University. The following URL will retrieve this information. Paste this URL into a browser, specify your registration key with the api_key parameter.

your_registration_key_here

The following R code can be used to bring this into R.

install.packages("httr")
library(httr)
URL <- "
# import all available data for Winona State University
wsu_request <- httr::GET(URL, query = list(api_key = 'your registration key here',
school.name = "Winona State University"))

This request will provide us with all information collected on Winona State University. The JSON data structure is the format that the API is using to send data back to R. The hierarchical structure of JSON data does not work well in R which is best suited for rectangular shaped data.

To retrieve the contents, use the content() function. This returns an R object (specifically a list).

wsu_data <- content(wsu_request)

names(wsu_data)

[1] "metadata" "results"

Note that the data is segment into metadata and results; we are interested in the results.

names(wsu_data$results[[1]])

To see what data are available, we can look at a single year:

names(wsu_data$results[[1]]$'2015')

With such a large data set containing many embedded lists, we can explore the data by examining names at different levels:

wsu_data$results[[1]]$'2015'$cost

names(wsu_data$results[[1]]$'2015'$cost)

wsu_data$results[[1]]$'2015'$cost$attendance$academic_year

Getting cost over a sequence of years…

wsu_data$results[[1]]$'2014'$cost$attendance$academic_year

wsu_data$results[[1]]$'2013'$cost$attendance$academic_year

wsu_data$results[[1]]$'2012'$cost$attendance$academic_year

Getting median ACT cumulative for

names(wsu_data$results[[1]]$'2015'$admissions)

[1] "sat_scores" "admission_rate" "act_scores"

names(wsu_data$results[[1]]$'2015'$admissions$act_scores)

[1] "midpoint" "25th_percentile" "75th_percentile"

names(wsu_data$results[[1]]$'2015'$admissions$act_scores$midpoint)

[1] "math" "cumulative" "writing" "english"

> wsu_data$results[[1]]$'2015'$admissions$act_scores$midpoint$cumulative

[1] 23

We can pull data collected on a certain variable over many years as follows:

install.packages("dplyr")
library(dplyr)
# subset list for annual student data only
wsu_yr <- wsu_data$results[[1]][c(as.character(2000:2015))]
# extract median cumulative ACT score data for each year
wsu_yr %>%
sapply(function(x){x$admissions$act_scores$midpoint$cumulative}) %>%
unlist()
#extract net price for each year
wsu_yr %>%
sapply(function(x) x$cost$avg_net_price$overall) %>%
unlist()
# extract median debt data for each year
wsu_yr %>%
sapply(function(x) x$aid$median_debt$completers$overall) %>%
unlist()

> # extract median cumulative ACT score data for each year

wsu_yr %>%

+ sapply(function(x) x$admissions$act_scores$midpoint$"cumulative") %>%

+ unlist()

> #extract net price for each year

wsu_yr %>%

+ sapply(function(x) x$cost$avg_net_price$overall) %>%

+ unlist()

> # extract median debt data for each year

wsu_yr %>%

+ sapply(function(x) x$aid$median_debt$completers$overall) %>%

+ unlist()

CSV Version of file…

If you wanted to obtain a csv file instead of a json, change the URL as follows:

After the CSV file has been downloaded via a browser, you can read it into R using Import Dataset within R Studio.

> #extract net price for each year

library(readr)

wsu_data <- read_csv(“D:/Teaching/DSCI325/Data/wsu_data.csv”)

View(wsu_data

Notice that the CSV file is very wide (about 17,000 variables and only 1 observation). To pull off certain columns, we will use the contains() function in conjunction with dplyr::select() function. Here, variables associated with median debt for completers.overallis being selected.

dplyr::select(wsu_data, contains("median_debt.completers.overall"))

The gather() function in dplyr can be used to transpose this row of data into a column of data.

dplyr::select(wsu_data, contains("median_debt.completers.overall")) %>% dplyr::gather()

Function to pull off cost for a particular year…

mypull<-function(year){

return(eval(parse(text=paste("wsu_data$results[[1]]$'",year,"'$cost$attendance$academic_year",sep=""))))

}