Catching up on package blogging, and juuuust managing to equal the low, low bar of four posts per year that I appear to have set myself.

This post is about slga, a data-access package I wrote a month or so ago to facilitate R-based access to the Soil and Landscape Grid of Australia, a set of geostatistically-modelled soil attributes and accompanying environmental covariate datasets.

slga is another one of those packages that happened because I read some interesting code (in this case, Ross Searle’s WCS access demo script) and decided to tinker a bit and then… failed to stop. Whoops. The basic idea is to hook into the set of OGC Web Coverage Services available for the SLGA and make it as easy as possible to retrieve subsets of the parent datasets. My only requirement was that the subsets be ‘clean’; i.e. a perfect match in terms of cell value, cell size and alignment against the parent dataset. And thus the following:

library(raster)
library(slga)
library(ggplot2)

# get surface clay content for King Island
aoi <- c(143.75, -40.17, 144.18, -39.57)
ki_surface_clay <- get_slga_data(product = 'TAS', attribute = 'CLY',
                                 component = 'all', depth = 1,
                                 aoi = aoi, write_out = FALSE)

retrieves this:

plot of King Island surface clay content

WCS services aren’t quite designed for this task - they’re mainly geared towards dynamic data access via web map portals or GIS GUIs, so they default to a lot of dynamic rescaling and resampling to make that efficient. Still, with a bit of mucking about its possible to bend them towards simple subsetting of very large raster datasets, at pretty reasonable speed (depending, of course, on one’s internet connection, insert NBN joke here).

My other goal for this project was to figure out pkgdown, so I’m not going to reiterate how slga works in this post. I’m just going to smugly link to slga's vignette where it sits on the package website.

Thanks to

I really didn’t get OGC web services at all before diving in to this. The official documentation is pretty comprehensive, but I couldn’t find much higher level material about working with them. I definitely wouldn’t have known where to start without picking through Ross’ script; slga only exists because of his work (as does the grid itself!).

I found Lorenzo Busetto’s tutorial post immensely helpful when getting started with pkgdown. Datacamp’s deployment tutorial was also super good. For customising my site, I left the default bootstrap theme in place and just overrode some CSS for nicer colours. Lest anyone think I’m actually good at CSS, this was largely accomplished by clicking ‘Inspect Element’ in Firefox and copypasting the relevant CSS code out. It is therefore probable that my extra.css file is an affront to god and man alike, but whatever, it works.

Lastly, this trick for mosaicing a list of rasters is the bees’ knees, and enabled tiling requests over larger areas. That said, I still wouldn’t advise trying to download massive data extents at once with this package. Once you start going after whole state’s worth of data, you’re better off downloading the entire parent dataset from the CSIRO Data Access Portal and cropping it.

Gripes

The only thing I don’t love about this project is the low-ish unit test coverage. I’m not sure how best to cover some of the core functions, since they hit web services, and the tests shouldn’t be eating bandwidth or relying on a http 200 return. If anyone has any advice, fire away.

I also really wish the WCS spec had some kind of source-align/target-align flag a la gdalwarp’s ‘-tap’ option, because that would remove the need for about half the code I wrote for this package. Might stretching the concept too far though, idk.


Anyway so there that is, if you’re working in Aus and you need a bit of quick soil and/or terrain data, this may be useful. I’m also drafting up a matching package for GeoScience Australia’s web services, so watch this space.

Discrete Global Grids! They’re pretty cool, and slowly starting to catch on. Google’s been plugging away at S2 for a while now, and Uber recently released H3. Both libraries are open-sourced and have their own sets of interesting features, but don’t seem to have found their way into traditional GIS software yet, so you need some coding skill to access them.

I could see some interesting potential use cases for H3 as soon as I read the documentation, so I was super keen to start playing with it ASAP. There were some barriers between me and all the hexagons I could eat though, so I had to do a little work first.

My process here was basically:

  • realise that no R bindings were available for H3 (apart from this attempt, which appears to be slightly abandoned and doesn’t work in Windows), sulk a little
  • realise there’s a transpiled version, h3-js, and that V8 is a thing (hot damn!)
  • spend a Saturday morning figuring out how to get h3-js bundled into a v8 session
  • realise I now have to learn some damn JavaScript; spend Saturday afternoon on codecademy
  • briefly ponder whether six (6) hours of JS experience is enough to get by
  • proceed anyway because I’ve got this far and f*&k imposter syndrome, right? Right.
  • …?
  • profit!

h3jsr is now available from GitHub. I’m feeling pretty good about it.

Y tho

Right now my own applications for this package are nice data aggregation and pretty maps. That might seem basic, but that does seem to be all that Uber are using it for themselves so far, and its solved some substantial business problems.

Performance

You’ll be able to do a fair bit with this package so long as you think ahead. The most important thing to remember is that every call to a h3jsr function involves casting data into a JS environment via JSON, and that eats time. Aim to feed as much data into one function call as possible - use lists, vectors or dataframes as input wherever you can, don’t try and iterate over individual geometries. Bear in mind that there’s an upper limit to what V8 can transfer in one hit.

Demo time

I did the bulk of the work on this package back in July, and then idly tinkered with it while failing to complete this post - my examples were boring! Luckily the Uber Open Summit 2018 happened not long ago and as part of it, h3-js dev Nick Rabinowitz ran a great live tutorial on Suitability Analysis using h3-js and Mapbox GL JS. Attempting a rebuild in R seems like a good way to demonstrate key functions.

library(httr)
library(jsonlite)
library(geojsonsf)
library(tidyverse)
library(ggspatial)
library(raster)
library(sf)
library(h3jsr)
options(stringsAsFactors = FALSE)

All the tutorial inputs are github gists, so I can download and convert them to sf data frames like so:

Oakland crime reports, last 90 days. Source: data.oaklandnet.com:

crime_90_days <- httr::GET('https://gist.githubusercontent.com/nrabinowitz/d3a5ca3e3e40727595dd137b65058c76/raw/f5ef0fed8972d04a27727ebb50e065265e2d853f/oakland_crime_90days.json') %>%
  httr::content() %>%
  fromJSON() %>%
  sf::st_as_sf(., coords = c('lng', 'lat'), crs = 4326) # JSON's always 4326

head(crime_90_days)
## Simple feature collection with 6 features and 1 field
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -122.2758 ymin: 37.75606 xmax: -122.1889 ymax: 37.81339
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##                   type                   geometry
## 1            VANDALISM POINT (-122.2655 37.81339)
## 2              ASSAULT  POINT (-122.2758 37.7969)
## 3        THEFT/LARCENY POINT (-122.2026 37.75606)
## 4              ROBBERY POINT (-122.2352 37.78423)
## 5 DISTURBING THE PEACE POINT (-122.1889 37.79133)
## 6            VANDALISM  POINT (-122.219 37.78628)

Oakland public school locations. Source: data.oaklandnet.com

public_schools <- httr::GET('https://gist.githubusercontent.com/nrabinowitz/d3a5ca3e3e40727595dd137b65058c76/raw/babf7357f15c99a1b2a507a33d332a4a87b7df8d/public_schools.json') %>%
  httr::content() %>%
  fromJSON() %>%
  sf::st_as_sf(., coords = c('lng', 'lat'), crs = 4326)

head(public_schools)
## Simple feature collection with 6 features and 1 field
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -122.2869 ymin: 37.74664 xmax: -122.1656 ymax: 37.813
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##         type                   geometry
## 1    Charter POINT (-122.1849 37.79886)
## 2    Charter  POINT (-122.225 37.77617)
## 3     Middle POINT (-122.1656 37.74664)
## 4       High   POINT (-122.2869 37.813)
## 5 Elementary   POINT (-122.23 37.77982)
## 6 Elementary POINT (-122.2371 37.80036)

BART station locations. Source: bart.gov. This is GeoJSON data, not straight JSON, so note how the import proccess is a little different.

bart_stations <- httr::GET('https://gist.githubusercontent.com/nrabinowitz/d3a5ca3e3e40727595dd137b65058c76/raw/8f1a3e30113472404feebc288e83688a6d5cf33d/bart.json') %>%
  httr::content() %>%
  geojson_sf()

head(bart_stations[, 1])
## Simple feature collection with 6 features and 1 field
## geometry type:  POINT
## dimension:      XYZ
## bbox:           xmin: -122.4475 ymin: 37.72158 xmax: -122.2686 ymax: 37.8528
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##                                  name                       geometry
## 1 12th St. Oakland City Center (12TH) POINT Z (-122.2715 37.80377 0)
## 2             16th St. Mission (16TH) POINT Z (-122.4197 37.76506 0)
## 3             19th St. Oakland (19TH) POINT Z (-122.2686 37.80835 0)
## 4             24th St. Mission (24TH) POINT Z (-122.4181 37.75247 0)
## 5                        Ashby (ASHB)  POINT Z (-122.2701 37.8528 0)
## 6                  Balboa Park (BALB) POINT Z (-122.4475 37.72158 0)

Travel times from Oakland to downtown SF by census tract. Source: movement.uber.com

sf_travel_times <- httr::GET('https://gist.githubusercontent.com/nrabinowitz/d3a5ca3e3e40727595dd137b65058c76/raw/657a9f3b64fedc718c3882cd4adc645ac0b4cfc5/oakland_travel_times.json') %>%
  httr::content() %>%
  geojson_sf()

head(sf_travel_times)
## Simple feature collection with 6 features and 3 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -122.3049 ymin: 37.74276 xmax: -122.1595 ymax: 37.84773
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##   MOVEMENT_ID                                DISPLAY_NAME travelTime
## 1          46   500 Chester Street, West Oakland, Oakland        708
## 2          47             9700 Birch Street, Cox, Oakland       1575
## 3          58        5600 Genoa Street, Santa Fe, Oakland       1015
## 4          98  500 10th Street, Downtown Oakland, Oakland        826
## 5          99 2400 19th Avenue, Highland Terrace, Oakland       1166
## 6         151  500 20th Street, Downtown Oakland, Oakland        908
##                         geometry
## 1 MULTIPOLYGON (((-122.304 37...
## 2 MULTIPOLYGON (((-122.1725 3...
## 3 MULTIPOLYGON (((-122.2779 3...
## 4 MULTIPOLYGON (((-122.2796 3...
## 5 MULTIPOLYGON (((-122.238 37...
## 6 MULTIPOLYGON (((-122.276 37...

Oakland points of interest. Source: uber.com/local

pois <- httr::GET('https://gist.githubusercontent.com/nrabinowitz/d3a5ca3e3e40727595dd137b65058c76/raw/ded89c2acef426fe3ee59b05096ed1baecf02090/oakland-poi.json') %>%
  httr::content()  %>%
  fromJSON() %>%
  sf::st_as_sf(., coords = c('lng', 'lat'), crs = 4326) %>%
  dplyr::filter(type %in% c('Cafes', 'Places to eat', 'Restaurant'))

head(pois)
## Simple feature collection with 6 features and 1 field
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -122.2763 ymin: 37.77091 xmax: -122.211 ymax: 37.83476
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##         type                   geometry
## 1 Restaurant  POINT (-122.2567 37.8281)
## 2 Restaurant POINT (-122.2632 37.83476)
## 3 Restaurant POINT (-122.2705 37.80706)
## 4 Restaurant  POINT (-122.211 37.77091)
## 5 Restaurant POINT (-122.2763 37.79486)
## 6 Restaurant POINT (-122.2437 37.81062)

That’s a lot of messy data:

cropper <- st_bbox(c('xmin' = -122.35, 'ymin' = 37.75,
                     'xmax' = -122.20, 'ymax' = 37.85), crs = st_crs(4326))
plot_these <- lapply(list(sf_travel_times, crime_90_days, public_schools,
                          bart_stations, pois), function(x) {
                            sf::st_crop(x, cropper)
                            })
ggplot() +
  layer_spatial(plot_these[[1]], aes(fill = travelTime), alpha = 0.9,
                show.legend = FALSE) +
  scale_fill_viridis_c() +
  layer_spatial(plot_these[[2]], col = 'red',     pch = 20) +
  layer_spatial(plot_these[[3]], col = 'cyan',    pch = 19) +
  layer_spatial(plot_these[[4]], col = 'yellow',  pch = 18, size = 2) + 
  layer_spatial(plot_these[[5]], col = 'magenta', pch = 17) +
  theme_minimal() +
  ggtitle('H3 Suitability analysis', subtitle = 'Unprocessed input data')

plot of chunk plot1

Time to make some sense of it.

We’ll take each of the raw data layers we’re bringing in and convert them to hexagon layers. Each layer will be a map of H3 index to some value normalized between zero and 1 with this helper function:

normalise_layer <- function(layer = NULL, b0 = FALSE) {
  dplyr::filter(layer, !is.na(h3)) %>%
    dplyr::group_by(h3) %>%
    dplyr::summarise('weight' = sum(weight, na.rm = TRUE)) %>%
    dplyr::mutate(norm = if (b0) {
      scales::rescale(weight, to = c(0, 1), from =  c(0, max(weight, na.rm = TRUE)))
      } else { scales::rescale(weight, to = c(0, 1)) }) # from = range(x)
}

Analysis is being conducted at four of H3’s 15 resolution levels - 7-10 - so each data layer must be binned and normalised four times. This is where purrr can be handy.

For crime, the output layer is just a normalised count of incidents per hex.

crime_hexes <- point_to_h3(crime_90_days, seq(7, 10)) %>%
  purrr::map(., function(h3) {
   dat <- data.frame('h3' = h3, 'weight' = 1L, 
                     stringsAsFactors = FALSE) %>%
     normalise_layer()
  })
head(crime_hexes[['h3_resolution_7']])
## # A tibble: 6 x 3
##   h3              weight     norm
##   <chr>            <int>    <dbl>
## 1 872830802ffffff      5 0.000486
## 2 872830810ffffff   2060 1       
## 3 872830811ffffff    311 0.149   
## 4 872830812ffffff    575 0.278   
## 5 872830813ffffff    535 0.258   
## 6 872830814ffffff    145 0.0686

For schools, there’s a bit of buffering added so that addresses adjacent to those containing a school are given some weight.

school_hexes <- point_to_h3(public_schools, seq(7, 10)) %>%
  purrr::map(., function(h3) {
    # returns 7 addresses - input and neighbours 
    near_school <- get_kring(h3, 1) 
    near_wts <- c(1, rep(0.5, 6)) # surrounds are worth half
    # combine and normalise
    dat <- purrr::map_dfr(near_school, function(h3) {
      data.frame('h3' = h3, 'weight' = near_wts, stringsAsFactors = FALSE)
    }) %>%
      normalise_layer()
  })
head(school_hexes[[1]])
## # A tibble: 6 x 3
##   h3              weight   norm
##   <chr>            <dbl>  <dbl>
## 1 872830802ffffff    2.5 0.0556
## 2 872830806ffffff    1   0.0139
## 3 872830810ffffff   25.5 0.694 
## 4 872830811ffffff   14.5 0.389 
## 5 872830812ffffff   23.5 0.639 
## 6 872830813ffffff   22.5 0.611

For BART stations, the buffering is more sophisticated, with a smooth decay function implemented. This does a better job of preserving the area of influence around a station across different H3 resolutions.

km_to_radius <- function(km, res) {
  floor(km / res_length(res, units = 'km'))
}

bart_hexes <- point_to_h3(bart_stations, seq(7, 10)) %>%
  purrr::map2(., seq(7,10), function(h3, res) {
    d <- km_to_radius(1, res)
    near_bart <- get_kring_list(sort(unique(h3)), d)
    # weights are the same for every feature so just make a template
    near_wts <- purrr::map(near_bart[1], function(feature) {
      purrr::map2(feature, seq_along(feature), function(ring, step) {
        wt <- 1 - step * 1 / (d + 1)
        rep(wt, length(ring))
      })
    }) %>% unlist()
    purrr::map(near_bart, unlist) %>%
      purrr::map_dfr(., function(x) {
        data.frame('h3' = x, 'weight' = near_wts, stringsAsFactors = FALSE)
        }) %>%
      normalise_layer() 
      })
head(bart_hexes[[1]])
## # A tibble: 6 x 3
##   h3              weight  norm
##   <chr>            <dbl> <dbl>
## 1 872830810ffffff      0   0.5
## 2 872830813ffffff      0   0.5
## 3 872830815ffffff      0   0.5
## 4 872830828ffffff      0   0.5
## 5 87283082affffff      0   0.5
## 6 87283082cffffff      0   0.5

For travel time, we find all of the intersecting H3 addresses for each polygon in sf_travel_times before assigning them weights based on the travelTime attribute. The center of a given address must intersect the polygon before it can be returned. Weights are negative as lower travel times are better.

travel_hexes <- purrr::map(seq(7, 10), function(res) {
  dat <- polyfill(sf_travel_times, res, simple = FALSE) 
  dat <- purrr::map2_dfr(as.list(dat$travelTime), 
                         dat$h3_polyfillers, 
                         function(x, y) {
                           data.frame('h3' = y, 'weight' = 1/x)
    }) %>%
    normalise_layer(., TRUE)
})
head(travel_hexes[[1]])
## # A tibble: 6 x 3
##   h3               weight  norm
##   <chr>             <dbl> <dbl>
## 1 872830802ffffff 0.00151 0.802
## 2 872830810ffffff 0.00110 0.586
## 3 872830811ffffff 0.00188 1    
## 4 872830812ffffff 0.00127 0.673
## 5 872830813ffffff 0.00137 0.727
## 6 872830815ffffff 0.00129 0.686

Lastly, points of interest are tallied, just as crime was.

food_hexes <- point_to_h3(pois, seq(7, 10)) %>%
  purrr::map(., function(h3) {
   dat <- data.frame('h3' = h3, 'weight' = 1L, 
                     stringsAsFactors = FALSE) %>%
     normalise_layer()
  })
head(food_hexes[[1]])
## # A tibble: 6 x 3
##   h3              weight   norm
##   <chr>            <int>  <dbl>
## 1 872830810ffffff    851 1     
## 2 872830811ffffff     16 0.0176
## 3 872830812ffffff    407 0.478 
## 4 872830813ffffff    331 0.388 
## 5 872830814ffffff    173 0.202 
## 6 872830816ffffff     98 0.114

Phew! Now we can plug the data together to get some overall weights. For each resolution, the following code adds up the normalised weights all of the ‘good’ layers, and then subtracts crime.

# arrange data by resolution instead of theme
datnm <- c('crime', 'school', 'bart', 'travel', 'food')
dat <- list(crime_hexes, school_hexes, bart_hexes,
                  travel_hexes, food_hexes) %>%
  purrr::transpose() %>%
  purrr::map(., setNames, datnm)

final_surfaces <- purrr::map(dat, function(res) {
  # rename cols for nicer joins
  purrr::map2(res, datnm, function(x,y) {
    dplyr::rename_at(x, vars(norm), funs(paste0('n_', y))) %>%
      dplyr::select(-weight)
  }) %>%
    # condense inputs
    purrr::reduce(., full_join, by = 'h3') %>%
    replace(is.na(.), 0) %>%
    rowwise() %>%
    dplyr::mutate(weight = 
                    sum(c(n_school, n_bart, n_travel, n_food)) - n_crime) %>%
    ungroup() %>%
    dplyr::select(h3, weight) %>%
    normalise_layer() %>%
    h3_to_polygon(., simple = FALSE)
})
head(final_surfaces[[2]])
## Simple feature collection with 6 features and 4 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -122.3869 ymin: 37.79247 xmax: -122.3053 ymax: 37.8162
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##      weight       norm      h3_address h3_resolution
## 1 0.0000000 0.02495816 882830801bfffff             8
## 2 0.8024133 0.45593409 8828308021fffff             8
## 3 0.8024133 0.45593409 8828308023fffff             8
## 4 0.8024133 0.45593409 8828308025fffff             8
## 5 1.0000000 0.56205787 8828308027fffff             8
## 6 0.8024133 0.45593409 882830802bfffff             8
##                         geometry
## 1 POLYGON ((-122.3796 37.7924...
## 2 POLYGON ((-122.3202 37.7985...
## 3 POLYGON ((-122.3276 37.8044...
## 4 POLYGON ((-122.3098 37.8009...
## 5 POLYGON ((-122.3172 37.8068...
## 6 POLYGON ((-122.3306 37.7961...

All that and I still can’t give you a cool interactive map like the source Observable notebook, but my plot below matches up nicely. Winner!

cropper <- st_bbox(c('xmin' = -122.35, 'ymin' = 37.75,
                     'xmax' = -122.20, 'ymax' = 37.85), crs = st_crs(4326))
plot_that <- st_crop(do.call(rbind, final_surfaces), cropper)

ggplot() +
  layer_spatial(plot_that, aes(fill = norm), alpha = 0.9, col = NA) +
  facet_wrap(. ~ h3_resolution)  +
  scale_fill_gradientn(colors = c('#ffffD9', '#50BAC3', '#1A468A')) +
  scale_x_continuous(breaks = c(-122.32, -122.27, -122.22)) +
  ggtitle('H3 Suitability Analysis', 
          subtitle = 'Liveability at four resolutions') +
  labs(fill = 'Weight') +
  theme_minimal()

plot of chunk plot

Its nice to confirm that h3jsr does what it oughta, but my major take-home from this is that I really need to learn me some geospatial JS. The code is much more concise than the R equivalent. In fact, the above was worse before I took the time to revamp a couple of my wrapper functions. The JS is also very fast, and the interactive maps are a real winner. I know I could probably do something here with mapview or better yet, mapdeck and Shiny, but that would be even more code (not to mention getting it to work properly on my current blog setup, which I’m not sure is possible…). So I’m very impressed.

Anyway, there it is, feel free to experiment with the package, and I welcome PRs that will speed it up or improve useability. Its been a worthwhile learning exercise, and I’m using h3jsr fairly often at work. I’m pretty sure that an Rcpp-powered version of this that hooks into original-flavour H3 will be much, much faster, though, so I don’t think this package will go to CRAN.


So I’ve been going to this boxing/HIIT style gym for a while, which is fantastic because it turns out I love punching things and suffering. I almost have abs now! The gym is quite big on using personal analytics to track progress, so they offer the MyZone brand of fitness tracker, which is centred around the use of a chest strap. The strap monitors heart rate and uses it to calculate ‘MyZone Effort Points’ (MEPS). More suffering == more points, and there’s a big TV in the gym where MyZone user’s scores are visible for comparison. I’m well aware that the only person I’m really competing with is me-last-week, but I’m still curious about what kind of scores I get during my workouts.

I’ve worn a Fitbit for several years, is the thing, and I’m both averse to change and an inveterate cheapskate, so I didn’t really want to switch brands (or worse, wear both). So, how can I compare the data FitBit records for me when I’m working out to the data recorded by MyZone?

R, of course. And the magic of APIs. Read on for how.

First, I needed to see if I could access minute-by-minute heart rate data from FitBit, because I couldn’t calculate MEPS without that. I’m using a Charge 2 at present, and the FitBit mobile app provides graphs of heart-rate data during activities that are clearly high-resolution. That data’s gotta be somewhere and I wants it *grabby hands*

There are a few blog posts around about accessing FitBit data in R, but they seem to date quickly - access methods have not been very stable over time. Same goes for Stack Overflow and Fitbit’s own forums! There was also a package or two floating about, but they appear defunct. Distilling a few posts like this one and this one together with the official documentation got me where I needed to be, so current to July 2018, my procedure to get at my own intraday heart rate data is this:

  • Go to https://dev.fitbit.com/ and click on Manage > Register an App.
  • Login with fitbit credentials and follow the prompts. Name the ‘app’ something neutral like ‘my-data’, choose app type ‘Personal’ and throw in www.google.com or something similar wherever a URL is asked for (the only important one is callback url).

plot of chunk cap1

  • Once the app is created, click on the ‘OAuth 2.0 tutorial page’ link near the bottom of the screen:

plot of chunk cap2

  • Scroll down to the end of section 1 and click on the provided authorisation URL. A new tab will open, showing the FitBit authorisation interface:

plot of chunk cap3

  • I changed the auth period to 1 year so I don’t have to go through this too often, and ticked all options before clicking Allow, because I’ll probably play with the other endpoints at some point.
  • After clicking Allow, the browser tab redirects to the callback URL, but a whole lot of other stuff is now in the URL visible in the address bar. Copy the whole lot and go back to the OAuth 2.0 tutorial page. Paste that URL into the text box under the ‘2. Parse Response’ header. The access token will appear below the text box - its a long string of characters.
  • Save that token as an environment variable so its not stored unsecured. Clicking around on Windows, this is Control Panel > System > Advanced System Settings > Environment Variables… . I save mine as a User variable called ‘FITB_AUTH’. NB: You’ll then have to close and reopen RStudio if you do all this while its running.

With all that in place, I can finally write some damn R code.

library(httr)
library(tidyverse)

Before starting to extract data, its a good idea to check that one’s token is definitely working. The FitBit API has a “Retrieve State of Tokens” endpoint for this:

check_state <- function(token = NULL) {
  POST(url = 'https://api.fitbit.com/1.1/oauth2/introspect',
       add_headers(Authorization = paste0('Bearer ', token)),
       body = paste0('token=', token),
       content_type('application/x-www-form-urlencoded'))
}

state <- check_state(token = Sys.getenv('FITB_AUTH'))

content(state)$active
## [1] TRUE

My token is active, so I can proceed.

My last workout was Monday night. I used the Charge 2’s manual logging feature, so I know exactly when my workout started and finished. A GET request to retrieve minute-by-minute heart rate data for a set period looks like:

get_workout <- function(date = NULL, start_time = NULL, end_time = NULL, 
                        token = Sys.getenv('FITB_AUTH')) {
  GET(url =
        paste0('https://api.fitbit.com/1/user/-/activities/heart/date/',
               date, '/1d/1min/time/', start_time, '/', end_time, '.json'),
      add_headers(Authorization = paste0("Bearer ", token)))
}

got_workout <- get_workout(date = '2018-07-30', 
                           start_time = '18:47', end_time = '19:28')

The actual data can be seen using content():

workout <- content(got_workout)

It does come in all list-ified, which can be a little tricky to sort out. purrr functions are real lifesavers here. I’m going to turn a couple of sub-lists into data frames below - FitBit’s own workout summary, and the actual heart rate by minute data I’m after:

# json-as-list to dataframe (for simple cases without nesting!)
jsonlist_to_df <- function(data = NULL) {
    purrr::transpose(data) %>%
    purrr::map(., unlist) %>%
    as_tibble(., stringsAsFactors = FALSE)
}

# summary
workout[['activities-heart']][[1]][['heartRateZones']] <- 
  jsonlist_to_df(workout[['activities-heart']][[1]][['heartRateZones']])

# the good stuff
workout[['activities-heart-intraday']][['dataset']] <-
  jsonlist_to_df(workout[['activities-heart-intraday']][['dataset']]) 

# also let's get time formatted properly
workout$`activities-heart-intraday`$dataset$time <- 
  as.POSIXlt(workout$`activities-heart-intraday`$dataset$time, format = '%H:%M:%S')
lubridate::date(workout$`activities-heart-intraday`$dataset$time) <- '2018-07-30'
lubridate::tz(workout$`activities-heart-intraday`$dataset$time) <- 'Australia/Brisbane'

# looks better now:
workout
## $`activities-heart`
## $`activities-heart`[[1]]
## $`activities-heart`[[1]]$customHeartRateZones
## list()
## 
## $`activities-heart`[[1]]$dateTime
## [1] "2018-07-30"
## 
## $`activities-heart`[[1]]$heartRateZones
## # A tibble: 4 x 5
##   caloriesOut   max   min minutes name        
##         <dbl> <int> <int>   <int> <chr>       
## 1        0       93    30       0 Out of Range
## 2       62.1    130    93      11 Fat Burn    
## 3      230.     158   130      30 Cardio      
## 4        9.30   220   158       1 Peak        
## 
## $`activities-heart`[[1]]$value
## [1] "136.67"
## 
## 
## 
## $`activities-heart-intraday`
## $`activities-heart-intraday`$dataset
## # A tibble: 42 x 2
##    time                value
##    <S3: POSIXlt>       <int>
##  1 2018-07-30 18:47:00    99
##  2 2018-07-30 18:48:00   113
##  3 2018-07-30 18:49:00   135
##  4 2018-07-30 18:50:00   146
##  5 2018-07-30 18:51:00   153
##  6 2018-07-30 18:52:00   152
##  7 2018-07-30 18:53:00   155
##  8 2018-07-30 18:54:00   160
##  9 2018-07-30 18:55:00   152
## 10 2018-07-30 18:56:00   151
## # ... with 32 more rows
## 
## $`activities-heart-intraday`$datasetInterval
## [1] 1
## 
## $`activities-heart-intraday`$datasetType
## [1] "minute"

Now, as I mentioned way back up top, MEPS are calculated minute-by-minute as a percentage of max heart rate. The formula used to calculate max heart rate is on the MyZone website:

meps_max <- function(age = NULL) { 207 - (0.7 * age) }

Which makes mine 183. Note that this can be less accurate for some people, but as a depressingly average soul I don’t get to complain. Five MEPS zones are defined using percentage range of max HR (e.g. 3 MEPS/min at 70-79% of max HR), and I can calculate the appropriate heart-rate ranges for myself like this:

# I <3 tribble
my_MEPS <- tribble(~MEPS, ~hr_range, ~hr_lo, ~hr_hi, 
                       1,  '50-59%',   0.50,   0.59,
                       2,  '60-69%',   0.60,   0.69,
                       3,  '70-79%',   0.70,   0.79,
                       4,    '>=80',   0.80,   1.00) %>%
  mutate(my_hr_low = floor(meps_max(34) * hr_lo),
         my_hr_hi  = ceiling(meps_max(34) * hr_hi))
my_MEPS
## # A tibble: 4 x 6
##    MEPS hr_range hr_lo hr_hi my_hr_low my_hr_hi
##   <dbl> <chr>    <dbl> <dbl>     <dbl>    <dbl>
## 1     1 50-59%     0.5  0.59        91      109
## 2     2 60-69%     0.6  0.69       109      127
## 3     3 70-79%     0.7  0.79       128      145
## 4     4 >=80       0.8  1          146      184

Using that data, calculating total MEPS goes like this:

mutate(workout$`activities-heart-intraday`$dataset,
       meps = case_when(value >= 146 ~ 4,
                        value >= 128 ~ 3,
                        value >= 109 ~ 2,
                        value >= 91  ~ 1,
                        TRUE ~ 0)) %>%
  summarise("Total MEPS" = sum(meps))
## # A tibble: 1 x 1
##   `Total MEPS`
##          <dbl>
## 1          130

Given that the maximum possible MEPS in a 42-minute workout is 168, this isn’t too bad. Its also fairly consistent with past workouts. I kind of knew I’d done ok from how I had trouble lifting my arms after, but its nice to have the numbers to back that up :P


So there it is. Unfortunately this setup still isn’t ideal - httr’s Oauth 2.0 authorisation code flow doesn’t seem to quite work for FitBit and I’m not 100% sure why. The request functions above would be a bit simpler if I could get the ‘oauth dance’ to work, and I could also skip that whole ‘faffing around on dev.fitbit.com’ section. As it stands, I get to the oauth2.0_token() step and then it just sits in the browser while waiting for authentication. I’ve been poking away at this post on and off for literal months trying to crack it, but no dice. So whatever, perfect is the enemy of good and I know as soon as I post this, someone’s going to reply with an amazingly simple solution :P


Welp, I’ve written my first R package so I guess I oughta blog about it. dsmartr is a digital soils mapping package which implements the DSMART algorithm of Odgers et al (2014). The idea is to take existing polygon-based soils mapping and correlate it with various dense gridded environmental datasets (covariates), allowing production of a more detailed soils map. ‘Soil map disaggregation’ is a common phrase used to describe the process. I’m going to demonstrate the package’s use in a series of future blog posts, but today I want to stay meta and talk about how I got to this point in the first place.

In mid-2016 I was handed a work project that had two goals: 1) Use DSMART to disaggregate a whole lot of 1:100,000 scale soils mapping along the Queensland coast, and 2) use the disaggregated outputs as a basis for mapping soils attributes and constraints to agriculture at a fine scale (30m pixels). If the process worked well enough, it could put property-scale soils mapping in the hands of local landowners without the need for a lot of extra field and lab-work. If the process didn’t work that well, it would be a clear demonstration that we can only push legacy soils data so far.

I came into this with fairly minimal Python and R skills - I’d done some 101-level courses and mucked around with basic scripting, and I had a strong background in point-and-click GIS software (oh, and soil science, of course!). That alone might have got me through to some acceptable products without too much fuss, but I’d been reading a lot about reproducible workflows and had a bee in my bonnet about making the project fully open-source and replicable. Point-and-click was out of the question, I had to up my scripting game.

My learning from this point on was frustration-driven and somewhat chaotic, but since it gave me a lot of small, clear goals to work on, I think it was actually a good way to learn deeply. A lot of my intial upskilling was centered around input data preparation, which gave me a solid grounding in the R packages responsible for spatial data and database interaction. Tasks included:

  • Connecting to our Oracle-based soils database to extract both spatial and attribute data (skills: prying non-standard connection details out of database admin staff (‘…Why can’t you just use Discoverer?’ ‘SO. MANY. REASONS.’); using DBI; writing SQL in R scripts)
  • Cleaning the input map data and arranging it in a format suitable for disaggregation (skills: cleaning spatial data; combining multiple adjacent soils map projects into one coherent dataset; rearranging and reformatting attributes)
  • Finding and downloading environmental covariates from internal and external repositories (skills: finding out about covariates from the literature, conversations with colleagues, strategic googling; using GDAL to extract portions of large (e.g. Australia-sized) rasters stored in online repositories; appeasing my &*%^ing workplace firewall)
  • Creating new covariates using some of the downloaded data and a wide range of existing algorithms (skills: Using open source GIS tools like GRASS and SAGA via R with raster, rgrass7, and good ol’ base::system2(); picking up enough OS knowledge to get all those programs to talk to each other).

Other skills I picked up in this stage related to using RStudio to properly document my work. I started using R projects to manage each sub-area of my project, as the data had to be processed in geographically distinct sections. I moved from .R scripts to .Rmds, generating low-level reporting for each processing step I took from raw data to final product. I also started adding tables and graphs to my Rmd reports with packages like DT, ggplot2, and mapview. The process of learning to write these reports was incredibly valuable as it forced me to work out a solid structure for my data analysis process - one that worked both for me and for the less-technical people who were funding my project and supervising my work.

As I was developing these processes, I started to look more closely at the DSMART code I was using. The DSMART algorithm was originally implemented in Python, and later ported to R (non-CRAN). I had started my project using the Python version because ‘Everybody Says Python Is Better’. In this case… nope. I switched to R after realising the Py code couldn’t handle the data volume I was throwing at it. Also, I’d spent a couple of weeks battling obscure dependencies and compatibility issues that had arisen as the package aged, and I was heartily sick of it. I could have persisted trying to learn Py for this work, but R was just far more accessible. Better docs, more tutorials, and easier setup. CRAN really is amazing.

Anyway, the existing R version was more stable, if slightly slower, but still didn’t quite meet my needs…so I started tinkering, and things kind of snowballed from there. The R package I was using had some bugs and RAM-consumption issues that needed attention. I started small, fixing the bugs and tweaking some output files, and before long I had alternate versions of the main package functions that could handle much bigger input datasets without falling over. This was a huge confidence booster, so when the sf package was released as the successor to sp, I felt capable of modifying the code to take advantage of the newer package. This led to substantial speed gains and more readable code, although unfortunately I couldn’t drop sp completely as raster is not sf-compatible. At around this point (late 2016), I was achieving good enough outputs with my pilot study area to present my work at the joint Aus/NZ soils conference in Christchurch.

After that I decided I should go ahead and package my functions rather than relying on standalone scripts. Colleagues in my office wanted to use my scripts, and getting them to run on multiple machines without my supervision was difficult. After having spent all that time learning to code and applying my fancy new skills, I was having to rush through disaggregating the rest of the soils maps on my list, and I still had to get a good process for attribute mapping off the ground. I didn’t need distractions! I also wanted a packaged version of my code to make it easy to demonstrate exactly which code I’d used for my project, and so I could cite it clearly.

Starting the packaging process was very easy with the help of github and R packages. Quiet shout-out to the GitHub for Windows desktop app, incidentally. Getting version control set up was the easy part, though - going from scripts to packaged functions involved a lot more code modification than I thought.

My functions used a lot of tidyverse-style piped code which doesn’t really work inside packages without substantial modification, and I’d gotten a bit caught up using fancy new functions where base R was perfectly fine. I love tidyverse for data prep and display, but I don’t fully grok programming tidyverse-style yet and I confess I’m not in a huge hurry to learn.

I then learnt how to document my R code properly using roxygen2 and actually spent a lot of time on that - I’ve often been frustrated by other packages’ docs and couldn’t bear to be a hypocrite on that front. I also added quite a few functions to handle data preparation and post-processing evaluation. Draft version 0.0.0.9013 wound up being the version I used to finalise my disaggregation project in early November 2017, as I was over deadline and also quite burnt out after 18 months of very steep learning curve.

That’s probably the downside of learning the way I did - I got into this spiral of learning some new trick, immediately having to try it out, and then if it worked, having to re-run my code for half a dozen sub-project-areas so that all the outputs were consistent. This was time-consuming and stressful even though it did lead to stronger products, and it was difficult for me to draw a line and say ‘enough’. Eventually I admitted how exhausted I was getting and forced myself to wrap things up. At that point my outputs weren’t going to get any better, and I was basically running on Berocca and spite. D-minus work/life balance management, would not reccommend.

I’ve since recharged enough to progress the package to something I’m content to publicise - the last task was to unit test as much as possible. The main functions are now just wrappers that handle things like on-disk file creation and parallel processing, whereas before they contained a lot of sub-functions that did the actual work of disaggregation. The wrappers still make life a lot easier for the end user, but the important parts of the process like polygon sampling are now separated, documented and covered by unit tests so its clear to everyone that they do what they ought to. I’m not actually sure how best to extend unit test coverage beyond where it is now (ideas welcome!), but I’m confident that the package works as it should.

Learning resources

Throughout this project I’ve been very reliant on free online learning resources, as not many people in my workplace do any kind of programming, and there was no funding for formal training. Resources I found invaluable include:

  • Twitter: Twitter is amazing for #rstats. Following the right people and hashtags kept me up to date with everything from Big Stuff like new R packages to useful little tips like ‘use file.path() instead of paste0()’. The only downside is managing the onslaught of new ideas; this is a very fast-moving space.
  • rweekly.org has become a fantastic digest of all things #rstats.
  • GIS-SE and Stack Overflow - I know SE sites have an intimidating reputation, but its still worth wading in, you just have to pay attention to the social norms there. They do exist for a reason, which others have mentioned but bears repeating: I solved at least half a dozen major problems I was having without even asking a question. The act of trying to formulate an acceptable question that wouldn’t get locked led me straight to the solution. Better than a rubber duck. I’d like to think I’ve given back a bit, too.
  • Edzer Pebesma’s vignettes for sf, and his blog posts on r-spatial were invaluable for the move from sp to sf, and to gaining a deeper understanding of how spatial data stuff really works. This is something that a lot of point-and-click GIS users don’t even realise they don’t understand - I’ve had some uncomfortable Dunning-Kruger moments this past year, but I’m better off for it.
  • GitHub issues pages - if your problem is not on SE, its probably here.
  • Hadley Wickham’s online books ‘R for Data Science’ and ‘R packages’ (especially the latter). I know, me and everyone else, but they’re popular for a reason.
  • Yihui Xie’s knitr documentation: https://yihui.name/knitr/options/#chunk_options may as well have been my homepage for while there.
  • Kieran Healy’s ‘Data Visualisation: A practical introduction’. This is just so well-written, it doesn’t matter that its not aimed at my field. It works well as a general intro to R and to best-practice data visualisation, as well as providing specific coding instruction.

Honourable mention for the upcoming ‘Geocomputation with R’ ebook, which came along a little too late for me, but is worth a read for anyone new to this space. It’ll save you a lot of time. The RStudio community forums also launched recently and they’re pretty cool.

Where to from here?

  • At some point I should consider a CRAN submission, but that can wait until the various versions of R-based DSMART code have been reconciled. I don’t think anyone wants multiple versions of the same idea on CRAN, and I’ve had a few idle chats with Nathan Odgers and Brendan Malone about combining our efforts. Herding soil scientists is worse than herding cats though, so don’t hold your breath :P
  • Oh gosh stars is coming! I might be able to move away from sp/raster later this year, which will be nice as dsmartr’s dependency list is quite long. Not going to bother until stars is on CRAN, though.
  • Train myself to type ‘dsmartr’ correctly the first time instead of having to constantly change it from dsamrtr *sigh*

Ok, I’ve rambled enough. Next time, how to DSMART.

My current project is about to produce a Giant Heap of data for end users to play with, and I’m concerned that it might be a bit overwhelming to digest. Even I’m having trouble trawling through it all to make sure everything is correct. A web app that allows the user to drill into that heap and just pull out what they need may be necessary…better learn how to build one, I guess!

I’ve done just about everything else for the project in R, so I figured I’d maintain consistency and learn Shiny. As a bit of a ‘Hello World’ project, I decided to try and replicate a small standalone app used by soil scientists to pre-process soil laboratory data.

Soil lab data is collected on a sample basis: you dig your hole, you grab ~200-500g of soil within a set of given depth ranges, you bag the samples up, and send them to the lab. Budget and time constraints generally mean that you don’t get to sample every depth interval in a profile, so you must attempt to pick representative depth ranges. Best practice is one sample per horizon and/or one every half a metre or so, if the horizon is thick. It’s also good to grab one at the surface, and one at top of the B horizon, as the most interesting things tend to happen there (and as a result, data from those parts of the profile are often used in classification systems).

The result is a huge store of soil data that only ‘exists’ for part of each profile. I might have pH values for 0-10cm, 20-30, 50-60, 80-90, and 110-120, but I only have data for those depth slices. This makes it difficult to compare profiles from different locations, and it makes environmental modelling almost impossible.

The standard solution is to use a mass-preserving spline to interpolate between the available data, and produce estimates of mean values for continuous depth sections down the profile. The idea entered the scientific literature with Bishop, McBratney and Laslett’s 1999 paper Modelling soil attribute depth functions with equal-area quadratic smoothing splines and became standard practice fairly quickly. In the mid-2000’s the CSIRO-funded Australian Collaborative Land Evaluation Program (ACLEP) team released a standalone app to do the job, I suspect in response to too many homebrew implementations floating around. The app made certain that everyone doing splining would get the same results from a given dataset, and this was a big deal as the drive was on to produce unified national datasets like the Australian Soil Resource Information System (ASRIS) and, later, the Soil and Landscape Grid of Australia.

screencap of spline tool
Spline Tool v2.0 in action

The standalone app is still available from the ASRIS website, but ACLEP and ASRIS are sadly underloved these days and I don’t know how much longer they’ll be around. The app itself hasn’t been updated since ~2012 - the authors may have jinxed themselves by promising regular updates in the metadata :P.

Luckily, the core functionality of SplineTool has been replicated in R, with GSIF::mpsline. That means all I had to do is wrap that function up in a web-app interface that mimics the existing tool. ‘Hello World’, indeed.

The webapp is now online at https://obrl-soil.shinyapps.io/splineapp/, so check it out and let me know what you think. Hopefully its of use to people who can’t run the existing app, or don’t want to learn R just to get this one task done. It has all the original app features, except for RMSE and ‘singles reports’, which mpspline doesn’t produce. To make up for it, you can view outputs as well as inputs by site, save plots, and either download .csv outputs or an .rds containing the complete object output by mpspline.

screencap of splineapp
So fresh so clean

Read on if you care about how I got it working…

Process

I allowed myself a week to do this, and spent… probably a solid 24 hours of that on the app, mostly because I have no self control. At least half of that was dicking around with the UI styling, I must admit, but there was still a fairly steep learning curve to negotiate.

I went in to this with intermediate R skills and pretty basic html/css - I’d played around with making websites as a teenager mumble years ago, and then did the first few modules of freecodecamp’s course back in March before getting distracted and wandering off. The basic knowledge of Bootstrap I picked up there really helped, though.

The offical documentation and tutorials for Shiny are very good, so just working through them step by step got me most of the way there. For the rest, StackOverflow generally came to the rescue. This question about users adding to a list of values helped me implement custom output depth ranges, and this one got me a ‘save plots’ option, which the original app didn’t have.

There’s still a few things I couldn’t manage to crack, notably the ability to handle more flexible inputs. I wanted to be able to get the user to identify the input columns appropriately, rather than relying on a strictly formatted input dataset. Being able to upload a file with multiple attribute columns and then pick which to spline would have been nice. Oh well, there’s always version 2.0… jinx!

The source code is on my github, if you have any ideas for improvement I’d love to hear them.