A detailed guide on how I scraped opioid prescription history from the CDC and created an animated map to highlight changing behavior.
R
dataviz
scraping
maps
Author
Affiliation
Mirza S. Khan
Vanderbilt University
Published
January 28, 2021
For some reason, I really enjoy making maps as a way to illustrate state and regional variation. One area that has appropriately gotten a lot of press is opioid prescribing. The CDC is a great data resource, and maintains state and county-level prescribing data.
Unfortunately, I did not find that the data was available in an easily readable format, e.g. CSV, JSON, etc. What follows is a walk-through of how I went about scraping the CDC website to get this data.
Scraping
May I scrape?
It is good practice to check a site’s robots.txt to make sure that they are cool with you accessing and scraping their data. The polite package makes this pretty simple.
<polite session> https://www.cdc.gov/
User-agent: M Khan
robots.txt: 44 rules are defined for 4 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
Test case with 2017
Now that I see that it is ok for me to scrape the data, I’d like to start out with a single case - the year 2017 - to test things out before replicating it for all available years.
Warning: The `fill` argument of `html_table()` is deprecated as of rvest 1.0.0.
An improved algorithm fills by default so it is no longer needed.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
# Tidy up the column namescol_names <-c("state", "state_abbrev", "opioid_Rx_rate")colnames(opioidRx_2017) <- col_names opioidRx_2017 %>%head() %>% gt::gt()
state
state_abbrev
opioid_Rx_rate
United States
US
59.0
Alaska
AK
52.0
Alabama
AL
108.8
Arkansas
AR
106.1
Arizona
AZ
60.8
California
CA
39.8
Things seem reasonable for 2017, I can now apply the same process for all of the available years.
Get the links for each year
We find that the links for the years 2006 to 2017 are available at this link. I can use the same SelectorGadget method to identify and subsequently scrape the links for each of these years.
# Collect list of sublinkssublinks <- ralger::scrap("https://www.cdc.gov/drugoverdose/rxrate-maps/index.html",node =".mb-3+ .mb-3 a")
Looks like I’ll need to append each of these with “https://www.cdc.gov/drugoverdose/maps/rxcounty”. Let’s create a dataframe that has the year and the corresponding link.
I’d like to point out that I used possibly. This is to make sure that the mapping function is all for naught if there is an error of some sort. Check out the R4DS section on Iteration: Dealing with Failure, which also covers safely and quietly.
Explanation of possibly arguments used here:
otherwise = NULL, i.e. if there is a failure, it will assign a value of NULL.
quiet = FALSE because I wanted a verbose output to see what’s going on
# Notice how 'data_df' column is just a nested dfopioid_df %>%head(3)
# A tibble: 3 × 3
year link data_df
<dbl> <chr> <list>
1 2006 https://www.cdc.gov/drugoverdose/rxrate-maps/state2006.html <tibble>
2 2007 https://www.cdc.gov/drugoverdose/rxrate-maps/state2007.html <tibble>
3 2008 https://www.cdc.gov/drugoverdose/rxrate-maps/state2008.html <tibble>
For each year, I have a corresponding tibble that contains state level opioid prescribing data. I will have to use unnest() to expand this.
complete_df <- opioid_df %>%unnest(cols =c(data_df)) %>%rename("state"="State","state_abbrev"="State Abbreviation","opioid_Rx_rate"="Opioid Dispensing Rate per 100" ) %>%# Name of abbreviation column changed from pre- and post-2017mutate(state_abbrev =coalesce(state_abbrev, Abbreviation)) %>%select(-Abbreviation, -link)complete_df %>%head() %>%gt()
year
state
state_abbrev
opioid_Rx_rate
2006
Alabama
AL
115.6
2006
Alaska
AK
63.4
2006
Arizona
AZ
74.3
2006
Arkansas
AR
98.3
2006
California
CA
51.0
2006
Colorado
CO
62.2
Mapping
Above, I outlined how I went about scraping the CDC website for opioid prescribing data. Now, I’ll describe how I took this data and made a map and used gganmiate to make a GIF that better shows the state-level trends from 2006-2019.
You may notice that the relevant geometric information for each state is provided in albersusa::usa_sf(). Thus, we must map this to the data of interest with a join, e.g. *_join, merge, etc.
A map using 2006 data
I’ll start by making a map of prescribing rates for 2006.
usa_sf() %>%inner_join(complete_df %>%filter(year ==2006), # filter for the year 2006by =c("iso_3166_2"="state_abbrev")) %>%ggplot() +geom_sf(aes(fill = opioid_Rx_rate)) +scale_fill_gradient2(name ="Opioid Rx rate",low ="blue", high ="red", midpoint =median(complete_df$opioid_Rx_rate)) + ggthemes::theme_map() +coord_sf(crs = albersusa::us_aeqd_proj) +theme(legend.position ="right") +labs(title ="Opioid Prescribing Rates in 2006",subtitle ="Rates reported as retail opioid prescriptions per 100 persons",caption ="by Mirza Khan (@mirzakhan_) \n Source: CDC National Center for Injury Prevention and Control")
old-style crs object detected; please recreate object with a recent sf::st_crs()
Animated Map
Now, I’d like to use gganimate to get a sense of how things are changing from 2006 to 2019, i.e. better, worse, the same?
This is simpler than one would think. All I added to the previous map were transition_states(year) and to the title, I replaced 2006 with {closest_state}.
#remotes::install_github("thomasp85/transformr")usa_sf() %>%inner_join(complete_df, by =c("iso_3166_2"="state_abbrev")) %>%ggplot() +geom_sf(aes(fill = opioid_Rx_rate)) +scale_fill_gradient2(name ="Opioid Rx rate",low ="blue", high ="red", midpoint =median(complete_df$opioid_Rx_rate)) + ggthemes::theme_map() +coord_sf(crs = albersusa::us_aeqd_proj) +transition_states(year) +theme(legend.position ="right") +labs(title ="Opioid Prescribing Rates in {closest_state}",subtitle ="Rates reported as retail opioid prescriptions per 100 persons",caption ="by Mirza Khan (@mirzakhan_) \n Source: CDC National Center for Injury Prevention and Control")
old-style crs object detected; please recreate object with a recent sf::st_crs()