In my last post, I put together a workflow for polygonising categorical raster data that gives nice-looking outputs. As mentioned though, they still need a bit of tidying-up before they can be used.

There are actually two problems with the data I generated. Most importantly, they’re incorrectly defined once exported from GRASS - while not corrupt, they aren’t valid under the Simple Features standard. This won’t mess with data display, but can potentially affect further analyses. The other issue is that the r.to.vect algorithm doesn’t give the most efficient outputs - it creates a lot of redundant vertices, by which I mean vertices that don’t significantly change the bearing of a polygon ring segment. These can slow down subsequent analysis, but more importantly they’re taking up disk space where I could be storing hilarious cat gifs. The internet can be…somewhat unhelpful about these issues, so let’s work through them in in turn.

library(sp)
library(sf)
library(tidyverse)
library(ggspatial)
library(gridExtra)
library(emo)

options(stringsAsFactors = FALSE)

Correcting polygons

Google ‘fix polygon topology’ or similar, and you’ll get a whole lot about managing inter-polygon issues (e.g. unwanted gaps and overlaps) but not much about intra-polygon. Its a tricky topic given that not all file formats follow the same standard, so maybe that’s fair enough. A great place to start is Edzer Pebesma’s blog post Tidying feature geometries with sf from earlier this year, plus all the linked materials. Unfortunately the lwgeom topology library doesn’t accompany sf on Windows yet, so I have to look elsewhere for help.

Lucky for me, some very clever folks from the 3D geoinformation research group at TU Delft have come up with a little command-line app that’ll correctly rebuild just about any polygon, no matter how badly borked. Its called ‘prepair’. Click through to github and go through the readme and linked documents to find out how it works.

Installation is easy, just grab the appropriate binary off github and unzip it somewhere sensible. I’m going to make a really horrible corrupt polygon and use it to demo the app and why its important. R is a really easy place to make terrible spatial data:

bad_poly <- matrix(c(0.5, 0.5,
                     0.4, 5.2,
                     9.8, 1.1,
                     7.6, 8.5,
                     9.5, 8.0,
                     9.4, 6.1,
                     7.4, 0.7,
                     0.5, 0.5), ncol = 2, byrow = TRUE) %>%
  st_polygon(x = list(.)) %>%
  st_sfc() %>%
  st_sf()

ggplot() + 
  geom_spatial(data = as(bad_poly, 'Spatial'), alpha = 0.5) + 
  ggtitle("Seems legit", subtitle = '...but is it?') +
  theme(axis.title.x=element_blank(),axis.title.y=element_blank()) +
  coord_equal()

plot of chunk badpoly

*shrugs* One polygon, looks like three triangles. What could go wrong? Let’s buffer it and see what happens.

ggplot() + 
  geom_spatial(data = as(bad_poly, 'Spatial'), alpha = 0.5) +
  geom_spatial(data = as(st_buffer(bad_poly, dist = 0.5), 'Spatial'),
               fill = '#277d8e', alpha = 0.3) +
  ggtitle('Oh heck', subtitle = 'oh dang') +
  theme(axis.title.x=element_blank(), axis.title.y=element_blank()) +
  coord_equal()

plot of chunk bpbuff

Well that didn’t work. Only part of the polygon got buffered! The middle bit is being ignored. Why is that… let’s use a little ggplot magic to break it down:

bps <- st_geometry(bad_poly) %>%
  unlist(., recursive = FALSE) %>%
  as.data.frame() %>%
  mutate(X1END = lead(X1), X2END = lead(X2))

ggplot() +
  geom_segment(data = bps, aes(x = X1, y = X2, xend = X1END, yend = X2END),
               arrow =  arrow(angle = 15, length = (unit(0.15, "inches")), type = 'closed'),
               na.rm = TRUE, col = 'grey35') + 
  geom_point(data = bps, aes(x = X1, y = X2), col = 'darkred') +
  geom_point(data = bps[1, ], aes(x = X1, y = X2), col = 'red', size = 2) +
  ggtitle('Bad Polygon', subtitle = '...exposed!!') +
  theme(axis.title.x = element_blank(), axis.title.y = element_blank()) +
  coord_equal()

plot of chunk bp_ggseg

See the missing dots where the lines cross? A valid geometry would (at least) have extra vertices there. The buffer algo doesn’t quite know what to do without them.

Funnily enough the first cleanup trick used by many spatial data practitioners is to buffer a polygon by distance = 0. The buffering algorithm breaks down and rebuilds polygons as a side effect, solving a lot of minor issues without much computational overhead. In this case though, it can’t catch what’s wrong with this polygon. If we look at the geometry before buffering:

st_as_text(st_geometry(bad_poly))
## [1] "POLYGON ((0.5 0.5, 0.4 5.2, 9.8 1.1, 7.6 8.5, 9.5 8, 9.4 6.1, 7.4 0.7, 0.5 0.5))"

and after:

st_as_text(st_geometry(st_buffer(bad_poly, dist = 0L)))
## [1] "MULTIPOLYGON (((0.5 0.5, 0.4 5.2, 7.86132971506106 1.94559023066486, 7.4 0.7, 0.5 0.5)), ((8.79730134932534 4.47271364317841, 7.6 8.5, 9.5 8, 9.4 6.1, 8.79730134932534 4.47271364317841)))"

we can see that st_buffer found the coordinates of those two ‘self-intersections’ and used them to rebuild the polygon in the best way it can (also by casting it as a multipolygon, which is the other problem with our Bad Polygon - its the wrong geometry type for a shape that complex). This means ignoring that middle triangle though, which is not what we want.

Incidentally, the area of the unbuffered polygon is 22.795, which is smaller than the 0-buffered one (25.6403538). This means both area calculations are wrong. The latter area must be from both of the multipolygon sub_areas seen above, so the former area must be only the bottom-left sub-area. Double dang!

Prepair does a much better job:

bad_wkt <- st_as_text(st_geometry(bad_poly))

good_wkt <- system2('C:/prepair_win64/prepair.exe', 
                     args = c('--wkt', paste0('"', bad_wkt, '"')), 
                     stdout = TRUE)

What does Bad Polygon look like now?

good_wkt
## [1] "MULTIPOLYGON (((0.4 5.2,0.5 0.5,7.4 0.7,7.861329715061059 1.945590230664858,0.4 5.2)),((7.6 8.5,8.797301349325338 4.472713643178411,9.4 6.1,9.5 8.0,7.6 8.5)),((7.861329715061059 1.945590230664858,9.8 1.1,8.797301349325338 4.472713643178411,7.861329715061059 1.945590230664858)))"

Nice, this time the output is a multipolygon with the three components I expect. plot of chunk resf

Last check, the area is now 28.4857075, as it should be.

So that’s cool! Looking back at the dataset I created in my last post though, QGIS says I have over 70 invalid geometries to repair, and prepair is a one-poly-at-a-time app. There is a version called pprepair that operates across whole datasets, but… its not available for Windows ;_;

On the bright side, the Bad Polygons produced by polygonising a raster per my last post are actually fixable using buffer-by-0:

cat_v_pretty <- read_sf(file.path(getwd(), 'cat_v_pretty.gpkg')) %>%
  mutate(ID   = 1:nrow(.))

cvp_b0 <- cat_v_pretty %>%
  # buffer expects projected coords
  st_transform(., 28355) %>% 
  st_buffer(., dist = 0L) %>% 
  st_union(., by_feature = TRUE) %>% 
  st_transform(., 4326)

I can demo this with the smallest error, cat_v_pretty[1716, ]:

plot of chunk p1716

Buffer-0 works here because the errors are relatively simple - the geometry may be invalid, but it isn’t corrupt the way my toy example above was. The OGC simple features polygon spec says that the exterior ring is supposed to be defined first, then any subsequent rings are interpreted as holes. The rings should also be stored as a list of matrices, one per ring. My invalid polygons are instead stored as single matrices, and the inner ring vertices may come first. My data plots well only because the sf plot method for polygons uses graphics::polypath, which has rules for figuring out fill areas that don’t consider the above. There’s no guarantee other functions in R (or other GIS software) will interpret these polygons the same way.

The 0-buffered dataset passes QGIS’ topology checks. There are still the redundant vertices to deal with, though - you can see them on the diagonals above.

‘Lossless’ vertex removal

Lossless, or shape-preserving, simplification means different things to different people, weirdly enough. Its not always clear what ‘redundant’ vertices are, for one thing, or how much shape change is allowable in a given dataset.

I’m operating in a pretty specific context: my polygons are generated by an algorithm that creates line segments in only 8 possible directions - or it should, but it struggled on the diagonals. The source data is also fine-scaled, so all my polygons are made of very short lines. This means that if I can get the bearing of each polygon segment, I can use it as a filter - and I don’t have to be very precise about that bearing.

Most simplification algorithms use a distance-based method of some kind, and while they’re efficient, they aren’t really capable of distinguishing vertex signal from noise in the way I need. I’m going to have to go it alone.

Below is a function that will work on sfc_POLYGON objects, even the mildly invalid ones shown above:

library(geosphere)

clean_verts <- function(x = NULL, dir_tol = 0) {
  g <- st_geometry(x)
  
  message(paste0(Sys.time(), ': removing redundant vertices...'))
  clean_g <- lapply(g, function(geometry) {
    # get a list of polygon subgeometries, if present (they're x,y matrices)
    lapply(geometry, function(subgeometry) {
      # for each, convert to table of points
      data.frame(subgeometry) %>%
        # turn that into a table of line segments 
        mutate(LX1 = lead(X1), LX2 = lead(X2)) %>%
        filter(!is.na(LX1)) %>% 
        # remove duplicates (0-length segments)
        filter(!(X1 == LX1 & X2 == LX2)) %>%
        rowwise() %>%
        # get the direction of each segment (to the nearest degree)
        mutate(DIR = round(geosphere::bearingRhumb(c(X1, X2), c(LX1, LX2)), dir_tol)) %>%
        ungroup() %>%
        # tag segments that run in the same dir as the one previous
        mutate(DEL = ifelse(DIR == lag(DIR), 1, 0),  
               DEL = ifelse(is.na(DEL), 0, DEL)) %>%
        # and chuck them out
        filter(DEL == 0) %>% 
        # get original points
        select(X1, X2) %>%
        # repeat point 1 at end of table to close ring
        add_row(X1 = head(.$X1, 1), X2 = head(.$X2, 1)) %>%
        as.matrix(., ncol = 2, byrow = TRUE)
    })}) 
  message(paste0(Sys.time(), ': ...complete')) 
  # spatialise cleaned data
  clean_g_sf <- lapply(clean_g, function(geom) {
     st_multipolygon(list(geom)) 
  }) %>%
    st_sfc() %>%
    st_sf()
}

# test!
p1716_cv <- clean_verts(cat_v_pretty[1716, ])

Great success, this looks much better:

plot of chunk cv_plot

Wrapping it all up

So I’ve got the components of my cleaning routine sorted, now I just have to plug them together and make them work over a whole dataset. I’m going to run prepair across all my polygons too, just to be thorough:

library(readr)
clean_verts_n <- function(x = NULL, dir_tol = 0) {
  g <- st_geometry(x)
  
  # remove redundant vertices
  message(paste0(Sys.time(), ': removing redundant vertices...'))
  clean_g <- lapply(g, function(geometry) {
    lapply(geometry, function(subgeometry) {
      data.frame(subgeometry) %>%
        mutate(LX1 = lead(X1), LX2 = lead(X2)) %>%
        filter(!is.na(LX1)) %>%
        filter(!(X1 == LX1 & X2 == LX2)) %>%
        rowwise() %>%
        mutate(DIR = round(geosphere::bearingRhumb(c(X1, X2), c(LX1, LX2)), dir_tol)) %>%
        ungroup() %>%
        mutate(DEL = ifelse(DIR == lag(DIR), 1, 0),  
               DEL = ifelse(is.na(DEL), 0, DEL)) %>%
        filter(DEL == 0) %>% 
        select(X1, X2) %>%
        add_row(X1 = head(.$X1, 1), X2 = head(.$X2, 1)) %>%
        as.matrix(., ncol = 2, byrow = TRUE)
    })}) 
  message(paste0(Sys.time(), ': ...complete')) 
  
  # spatialise cleaned data
  clean_g_sf <- lapply(clean_g, function(geom) {
     st_multipolygon(list(geom)) 
  }) %>% 
    st_sfc()
  
  message(paste0(Sys.time(), ': Running prepair...'))
  
  # now fix internal topology with prepair - this returns list of wkts
  pb <- txtProgressBar(min = 0, max = length(clean_g_sf), style = 3)
  pr_g <- lapply(seq_along(clean_g_sf), function(row) {
    gt  <- st_as_text(clean_g_sf[[row]])
    
    # some wkts are too long to send to the command line >.>
    gpr <- if(nchar(gt) < 7500) {
      system2('C:/prepair_win64/prepair.exe', 
              args = c('--wkt', paste0('"', gt, '"')),
              stdout = TRUE)
    } else {
      write_file(gt, file.path(getwd(), 'temp.txt'))
      system2('C:/prepair_win64/prepair.exe', 
              args = c('-f', file.path(getwd(), 'temp.txt')),
              stdout = file.path(getwd(), 'temp_fixed.txt'))
      read_file(file.path(getwd(), 'temp_fixed.txt'))
    }
    setTxtProgressBar(pb, row)
    gpr
  })
  message(paste0(Sys.time(), ': ...complete'))
  close(pb)
  # turn list of wkts back into a geometry column and replace the input
  prsf <- st_as_sfc(pr_g)
  st_geometry(x) <- prsf
  st_crs(x) <- st_crs(g)
  x
}

cvp_cvpr <- clean_verts_n(cat_v_pretty)

This is pretty kludgey and slow, but it gets the job done. It handles around 200 polygons/minute on my test machines. I wanted to be able to run prepair selectively to speed things up, but it does sometimes return vertices with higher precision than the input data, and that can cause topology check failures later.

plot of chunk cvp_pr

My final dataset looks identical to the input on-screen, passes its topology checks and is about a third smaller on disk (in GPKG format). I can now be confident that the data I’m generating is as robust and portable as possible.


In response to this R-sig-geo email, here’s a workflow for turning a categorical raster into a polygon layer, with appropriate smoothing. This is one of those tasks I’m surprised isn’t talked about more.

This requires a software environment set up according to my post on getting R and GRASS talking, so do that stuff first.

library(sp)
library(sf)
library(raster)
library(rgdal)
library(rgrass7)
library(tidyverse)
library(rosm)
library(ggspatial)
library(ggsn)
library(viridis)

options(stringsAsFactors = FALSE)

I’m going to use a random chunk of model output from a project I’m currently working on. The raster categories are predicted soil classes, and there’s a nodata area with an island in the file, so the data should be messy enough to provide a robust test of this method.

system2('gdalwarp', 
        args = c('-s_srs', 'EPSG:3577',
                 '-t_srs', 'EPSG:4326',
                 '-r', 'near', 
                 '-te', '1407921 -1884176 1436633 -1861473', 
                 '-te_srs', 'EPSG:3577',
                 paste0('"', srcfile, '"'),
                 '"C:/DATA/r2p/cat_clip.tif"'))
cr <- raster("C:/DATA/r2p/cat_clip.tif") 

plot of chunk plot1

Set up a new GRASS location:

initGRASS(gisBase  = "C:/OSGeo4W64/apps/grass/grass-7.2.1",
          gisDbase = getwd(),
          location = 'grassdata',
          mapset   = "PERMANENT", 
          override = TRUE)
## gisdbase    C:/DATA/r2p 
## location    grassdata 
## mapset      PERMANENT 
## rows        1 
## columns     1 
## north       1 
## south       0 
## west        0 
## east        1 
## nsres       1 
## ewres       1 
## projection  NA

The location’s crs must be matched to the input dataset before import. Use g.proj:

execGRASS("g.proj", 
          georef = file.path(getwd(), 'cat_clip.tif'),
          flags = c('t', 'c'))

And then the file can be imported with r.in.gdal. Lastly, the region settings need to be updated with g.region to match the raster (nrows, ncols, origin etc).

execGRASS('r.in.gdal',
          input  = file.path(getwd(), 'cat_clip.tif'),
          output = 'cat_clip',
          flags  = 'overwrite')

# to import a raster object already in R:
#writeRAST(cat_clip 'cat_clip')

execGRASS('g.region', raster = 'cat_clip')

# lets see what we've done:
gmeta()
## gisdbase    C:/DATA/r2p 
## location    grassdata 
## mapset      PERMANENT 
## rows        919 
## columns     971 
## north       -16.76522 
## south       -16.99697 
## west        145.2097 
## east        145.4547 
## nsres       0.0002521752 
## ewres       0.000252385 
## projection  +proj=longlat +no_defs +a=6378137 +rf=298.257223563
## +towgs84=0.000,0.000,0.000

So far, so good. I’ll do a strict polygonisation with r.to.vect first.

execGRASS('r.to.vect',
          input  = 'cat_clip',
          output = 'cat_v_raw',
          type   = 'area',
          flags  = c('v', 'overwrite'))

# pull the results into R (rgrass7 only supports sp)
cat_v_raw <- readVECT('cat_v_raw') %>% 
  st_as_sf() %>%
  st_cast('POLYGON', warn = FALSE)

# backup
st_write(cat_v_raw, file.path(getwd(), 'cat_v_raw.gpkg'), quiet = TRUE)

Here’s a zoomed-in look:

plot of chunk zoom1

This is, of course, the only way to produce a vector dataset that fully represents the source data, but a perfect analog is not always desirable. Small pixels and large extents can result in massive polygon counts. Products like this perform poorly and are difficult to work with, particularly on tablets without a pointer or stylus. A better alternative is to simplify the output map somewhat before converting to vector.

Raster simplification

Before simplifying a raster, an appropriate minimum zone size must be chosen - this becomes the minimum acceptable polygon size. Choice of zone size is specific to whatever data you’re working with. Since I’m playing with a soils map, I’m going be a good little soil scientist and pay attention to the Guidelines for Surveying Soil and Land Resources (McKenzie et al, 2008, the ‘Blue Book’).

According to these guidelines, soil map scale is determined from sample point density, using some well-established rules of thumb e.g. 8-16 points per sq km = ~1:50,000. This map was produced not by sampling in the real world, but by using a model algorithm that generates what are essentially ‘virtual sites’ at a rate (for this project) of ~15 sites per square kilometre. The algorithm has a floor of 15 sites per polygon for Reasons, so polygons smaller than 1 sq km are sampled at a higher density. I have examined some model sample point files and found a median sample rate of about 50 sites per polygon.

Referring to Table 3.1 of the Blue Book (p. 32), that median sample rate translates to a scale of ~1:10,000, and a minimum delineable area of 0.4 hectares. One pixel in this raster is approximately 0.075 hectares, so this means that I should remove any continuous areas of less than 6 pixels before vectorising.

There are plenty of other ways to pick a target zone size, do as you will. Now, to the actual smoothing:

For data in a projected crs, you can use the convenience wrapper r.reclass.area, but since I’m using lat-long I have to take the long way. This is just a chain of r.to.vect, v.clean and v.to.rast with particular settings locked in.

Note that many people seem to think applying a moving window filter to a raster (modal, in this case) is the only way to smooth it out. I disagree, that tends to blur out too much detail. It’s especially harsh on small linear features. The process below respects the source data better.

# the first step (r.to.vect) was completed above
execGRASS('v.clean',
          input     = 'cat_v_raw',
          type      = 'area',
          output    = 'cat_v_smoothed',
          tool      = 'rmarea',
          threshold = 4000, # square meters
          flags     = c('overwrite'))

execGRASS( 'v.to.rast',
           input  = 'cat_v_smoothed',
           type   = 'area',
           output = 'cat_r_smoothed',
           use    = 'cat',
           flags  = 'overwrite')

## the above is equivalent to the below
#execGRASS('r.reclass.area',
#          input  = 'cat_clip',
#          output = 'cat_r_smoothed',
#          value  = 0.4,             
#          mode   = 'lesser',
#          method = 'rmarea',
#          flags  = c('overwrite'))

cat_v_smoothed <- readVECT('cat_v_smoothed') %>% 
  st_as_sf() %>%
  st_cast('POLYGON', warn = FALSE)

# backup
st_write(cat_v_smoothed, file.path(getwd(), 'cat_v_smoothed.gpkg'), quiet = TRUE)

The simplified vector map produced with v.clean above is usable, and we’ve dropped the number of polygons from 16225 to 3880, which is quite respectable (original raster data underneath for comparison):

plot of chunk zoom2

That said, you can get a much nicer looking output from re-vectorising the smoothed raster, using the ‘s’ flag.

execGRASS('r.to.vect',
          input  = 'cat_r_smoothed',
          output = 'cat_v_pretty',
          type   = 'area',
          flags  = c('v', 's', 'overwrite'))

cat_v_pretty <- readVECT('cat_v_pretty') %>% 
  st_as_sf() %>%
  st_cast('POLYGON', warn = FALSE)

# backup
st_write(cat_v_pretty, file.path(getwd(), 'cat_v_pretty.gpkg'), quiet = TRUE)

plot of chunk zoom3

The final product has 3600 features and looks like :

plot of chunk plot2

Topology issues

Outside GRASS, the results of this workflow can have self-intersection problems, for instance:

screencap of topological problem

I feel like if this were easily fixed, a tool would exist already. A challenge for someone not-me!

This is a setup post about getting R and the OSGeo4W package to talk to each other on Windows by setting environment variables appropriately. R’s built-in spatial capabilities have been drastically enhanced the last few years, but a lot of really useful tools remain external to R itself, and this is an easy way to get at lots of them. I’ve also been using R as a workflow recording and reporting tool as much as an actual data processing environment, so I find this setup extremely useful for reproducibility.

This post can’t be all things to all people, but it should work for most. Its been tested on standard Win10 and 10.1, plus a few workplace-specific builds of Win7 that have had god-knows-what done to them by my IT people, bless their uncommunicative, control-freaky hearts. I’ll be updating this fairly regularly as requirements evolve (and as I learn more tricks!). I won’t be helping debug your setup in the comments, though, post any problems on stackexchange where they belong.


My usual practice for a new setup is

  • Install R to C:\R\*version*
  • Install RStudio
  • Install RTools (not essential, but handy for compiling some packages from source code)

then make sure both RTools and R are on the system Path, fairly early in the list. If they don’t appear there as a result of the install process, just paste in the following at the very start of the path, without altering what’s already there, e.g.

C:\RTools\bin;C:\RTools\mingw_32\bin;C:\R\R-3.3.3\bin;

Adjust R version number as needed.

Next, install the OSGeo4W package to its default location (mine is C:\OSGeo4W64). Its easiest to run the automatic install first and then go back into setup under advanced mode to select any extras you want. After that, set all of the following environment variables:

In User variables:

  • GISRC = %USERPROFILE%\AppData\Roaming\GRASS7\rc
  • GRASS_ADDON_PATH = %USERPROFILE%\AppData\Roaming\GRASS7\addons

the first one tells the system where the GRASS rc file is. The rc file contains local settings for some GRASS startup variables, which you can read about here. This file won’t exist until you open GRASS for the first time and set up a database location, but you can still set GISRC ahead of time. The second one tells the system where any GRASS addon modules you choose to install are located.

In System variables:

  • OSGEO4W_ROOT = C:\OSGeo4W64
  • GDAL_DATA = %OSGEO4W_ROOT%\share\gdal
  • GDAL_DRIVER_PATH = %OSGEO4W_ROOT%\bin\gdalplugins
  • GEOTIFF_CSV = %OSGEO4W_ROOT%\share\gdal
  • GISBASE = %OSGEO4W_ROOT%\apps\grass\grass-7.2.0
  • GRASS_PYTHON = %OSGEO4W_ROOT%\bin\pythonw.exe
  • PROJ_LIB = %OSGEO4W_ROOT%\share\proj
  • PYTHONHOME = %OSGEO4W_ROOT%\apps\Python27
  • PYTHONPATH = %GISBASE%\etc\python
  • SAGA = C:\OSGeo4W64\apps\saga

These make your system aware of the versions of GDAL/OGR, GRASS, Python and SAGA that come with OSGeo4W. Adjust the version number in GISBASE as neccessary. You may also need to use -ltr for some of the apps if you installed those instead (e.g. saga-ltr). Note that setting SAGA to %OSGEO4W_ROOT%\apps\saga won’t work.

Now edit the Path again - among the rest of what’s already there, add the following, in this order:

  • %SAGA%
  • %SAGA%\modules
  • %PYTHONHOME%
  • %OSGEO4W_ROOT%\bin
  • %OSGEO4W_ROOT%\apps\qgis
  • %GISBASE%\bin
  • %GISBASE%\scripts
  • %GISBASE%\lib

appending to the end of the Path should work, but you may have to make some adjustments depending on what-all else you have installed on your machine.

You can test functionality in R with a few system2 calls, e.g.

system2('gdal_translate')
system2('saga_gui')
system2('qgis-browser')

You should be able to access any of the command-line options that are listed when you open the OSGeo4W shell with the above method. The rgrass7 package should also initialise properly when loaded with library(). There is a wrapper package for SAGA (RSAGA) as well, but tbh I find it easier to just use system2('saga_cmd', args = c(...)). IMO its not much of a wrapper package if you can’t feed an R spatial object in directly…

You can also get at the QGIS processing toolbox via the RQGIS package, although its generally faster and simpler to access the underlying software directly, where possible. For example, a system call to gdalwarp is around 400 times faster than RQGIS::run_qgis('gdalogr:warpreproject') on my machine (and a lot simpler to write), but QGIS-only tools like ‘Distance to nearest Hub’ are probably worth a look.

In response to this stackexchange question, here’s a workflow for getting vector outlines of non-NULL areas in rasters, using GRASS 7 to do the hard work and R to handle batching. This whole thing is really hacky, but there don’t seem to be any other solutions around at present.

Setup

I’m working with the OSGeo4W64 package, R 3.3.1 and RStudio in a Windows 10 environment.

Data prep

Get all your target rasters into one folder. Make sure that:

  • files are named in a logical sequence that does not start with a number.
  • files are all in the same projection
  • files should be spatially aligned; i.e. the pixels should be all on the same overarching grid system.

Files can be in any GDAL/OGR compatible format, although I’ve only tested this with .tif and .asc. I’m using a few chunks of Geoscience Australia’s 1” SRTM DEM-H today.

R/GRASS interface

GRASS and R need a little help to speak to each other. Go to my blog post about setting environment variables and follow the instructions.

GRASS setup

Open GRASS and establish a new location with the same CRS as your target rasters. Don’t worry about region settings for now, and there’s no need to create any extra mapsets beyond ‘PERMANENT’. Open the PERMANENT mapset in your new location, and then just exit GRASS. This will update the file that GISRC points to.

R setup

Open Rstudio and make sure the packages ‘sp’, ‘raster’, ‘rgdal’, ‘XML’, and ‘rgrass7’ are installed. I have them, and am ready to process.

Processing

library(sp)
library(raster)
library(rgdal)
library(rasterVis)
library(ggplot2)
library(XML)
library(rgrass7)

The console output will indicate a connection to the GRASS location you just created. Now, set your R working directory to your raster folder, and list the files available:

setwd('C:/footprint_test')
rasters <- list.files(getwd(), pattern = '\\.tif$', full.names = T)

They look like this:

plot of chunk rplots

Props to Andrew Tredennick for the ggplot method used above, although I’d be careful with very large rasters.

Next, import all the rasters you’ve listed into your GRASS location using r.in.gdal; they will be stored in GRASS’ native data format under the mapset PERMANENT.

grassnames_list <- list() 
for (item in rasters) {
  grass_name <- substr(basename(item), 1, nchar(basename(item)) - 4)
  execGRASS('r.in.gdal', 
            flags = 'overwrite', 
            parameters = list(input  = item,
                              output = grass_name))
  grassnames_list[[item]] <- grass_name
}

Now use g.region to set region extents to cover all the files you just imported.

execGRASS('g.region', parameters= list(raster=paste(grassnames_list, collapse=",")))

I’m using r.mapcalc to make data/nodata versions of each raster; its very powerful but not accessible via rgrass7 for some reason, so we’re just going to make a direct system call. This’ll only work if you provide instructions in a text file, so that needs to be generated first, using some string manipulation shenanigans.

exp_list <- list()
fpname_list <- list()
for (item in rasters) {
  grassname <- substr(basename(item), 1, nchar(basename(item)) -4)
  fp_name   <- paste0(grassname, '_foot')
  exp       <- paste0(fp_name, ' = if(isnull(', grassname, '), null() , 1)')
  
  exp_list[[item]]    <- exp
  fpname_list[[item]] <- fp_name
}

write(c(paste(exp_list, collapse = '\n'), 'end'),
      file = paste0(getwd(), '/rmc_instructions.txt'),
      ncolumns = 1)

Open the output *.txt if you want to see an example of r.mapcalc’s map algebra format. Run that:

rmc   <- 'C:/OSGeo4W64/apps/grass/grass-7.0.5/bin/r.mapcalc.exe'
calcs <- 'C:/footprint_test/rmc_instructions.txt'
system2(rmc, paste0('file=', calcs, ' --overwrite'))

Time to vectorise those footprints with r.to.vect:

vname_list <- list()
for (fp in fpname_list) {
  v_name <- paste0(fp, '_v')
  execGRASS('r.to.vect', 
            flags = 'overwrite',
            parameters = list(input  = fp,
                              output = v_name,
                              type   = 'area'))
  vname_list[[fp]] <- v_name
}

If you open your mapset now (or look at it in the QGIS browser, assuming you have the GRASS plugin activated), you’ll see your imported rasters, the footprints for each, and a vectorised version of the latter. Nothing left to do now but export with v.out.ogr. There’s a wide choice of export formats available, just demo-ing with shp:

outname_list <- list()
for (vn in vname_list) {
  out_name <- paste0(getwd(), '/', vn, '.shp')
  execGRASS('v.out.ogr', 
            flags = 'overwrite', 
            parameters = list(input  = vn,
                              output = out_name,
                              format = 'ESRI_Shapefile'))
  outname_list[[vn]] <- out_name
}

Output files will have multiple polygons if there was more than one ‘island’ of data in the input. If you want one multipolygon, use v.dissolve (this is why I had to set all those python environment variables). Note that you’ll see a lot of windows flashing up as v.dissolve runs, don’t panic. After that, use flag -m with v.out.ogr:

dissname_list <- list()
for (vn in vname_list) {
  diss_name <- paste0(vn, '_sp')
  execGRASS('v.dissolve', 
            flags = 'overwrite',
            parameters = list(input  = vn,
                              column = 'value',
                              output = diss_name))
  dissname_list[[vn]] <- diss_name
}

outname_list2 <- list()
for (dn in dissname_list) {
  out_name2 <- paste0(getwd(), '/', dn, '.shp')
  execGRASS('v.out.ogr', 
            flags = c('m', 'overwrite'), 
            parameters = list(input  = dn,
                              output = out_name2,
                              format = 'ESRI_Shapefile'))
  outname_list2[[dn]] <- out_name2
}

End result: plot of chunk output

You can find all the files related to this post in https://github.com/obrl-soil/bits-n-pieces/footprint_test.

A lot of new soil mapping projects are being delivered as a stack of attribute surfaces rather than an intepreted choropleth map. This gives the end user freedom to mix and match components, generating custom products that suit their needs. That’s great if you know how, but I haven’t seen all that much documentation around, so. Here’s a quick workflow for one of the simplest possible products, a soil texture class map from particle size distribution rasters.

I’m demo-ing this using some modelled data I pinched from the Soil and Landscape Grid of Australia, which is a pretty boss product. To be specific, I used the gdal_translate method I blogged about here to extract a ~1 x 1 degree area from each of the 0-5cm clay, silt and sand datasets (ulx 148, uly -22).

_but what does it mean?!_

I am also using the soiltexture package to do most of the heavy lifting, and boy am I ever glad I didn’t have to code all that myself.

Required packages and data sources:

library(sp)
library(rgdal)
library(raster)
library(soiltexture)
library(rasterVis)
library(RColorBrewer)
library(dplyr)

working_dir <- 'C:/Users/obrl_soil/Downloads'
clay_src <- paste0(working_dir, '/Clay_05_1dd.tif')
silt_src <- paste0(working_dir, '/Silt_05_1dd.tif')
sand_src <- paste0(working_dir, '/Sand_05_1dd.tif')
inputs <- c(clay_src, silt_src, sand_src)

Note the order of entry for inputs. One must be consistent from this point on. Next thing to do is read the three rasters into a RasterStack object, and promote that to SpatialPixelsDataFrame.

SSC <- stack()

for (i in 1:length(inputs)) {
  rn  <- inputs[i]  
  r   <- raster(rn)
  SSC <- addLayer(SSC, r)
}

SSCP <- as(SSC, 'SpatialPixelsDataFrame')

Yeah, I used a loop in R code. Fight me.

I chose SpatialPixelsDataFrame (SPDF) over SpatialGridDataFrame (SGDF) for a couple of reasons. Firstly, the workflow fails for SGDF if there are any NA values in the input data, so coastal datasets would be a problem. I’m too much of a n00b to pin down why SPDF can handle it and SGDF can’t beyond ‘SPDF has @grid.index’ but there it is. The other issue is that SGDF is intended for strictly regular grids, and anything projected in spherical coordinates fails that test. All this data is in WGS84 (and no, it doesn’t matter if the extent is small!).

Now, the soiltexture package often looks for default column headings so some renaming at this point can make life easier. Also, I’m going to round off the PSD data, since it wanders out to 6+ decimal places. Laboratories don’t measure PSD that precisely - anything past ±0.01 is a fantasy, and the rounded data is just easier to read.

names(SSCP@data) <- c('CLAY', 'SILT', 'SAND')
SSCP@data <- round(SSCP@data, 2)
SSCP@data$raw_totals <- rowSums(SSCP@data[, 1:3])

Particle size data should add to about 100%. There’s some legitimate variation in lab results, partly due to the presence of organic matter, dissolvable materials etc, and partly due to inherent method imprecision, but a good result range is 100% ±5%. Maybe ±10% in some surface soils, thanks to elevated OM (OM% by volume). Probably not in central Queensland, though.

Anyway modelled data can fail this acceptable-range test, because each fraction is predicted independent of the others. This sample isn’t too bad - the mean of total PSD is 97.11 and the standard deviation is 3.54 but the range is still 77.6 to 118.04 and the histogram looks like

plot of chunk histo

Orange = too low, blue = too high. There’s also a spatial pattern to the out of range values which bothers me a bit:

plot of chunk failmap

Glancing at the histogram and the satellite imagery, there’s particular underprediction on alluvial flats, idk why - those are usually the more heavily sampled areas, for accessibility reasons if nothing else. Whatever, all up, about a third of the pixels are out of range. So a) be cautious about modelled PSD data applications and interpretation and b) proportional normalisation of your PSD data should be a routine step unless you have a good reason not to do so. The following is the simplest kind of normalisation: fraction/total of fractions * 100.

SSCP_norm <- TT.normalise.sum(tri.data = SSCP@data, residuals = T)
colnames(SSCP_norm)[1:3] <- paste0(colnames(SSCP_norm)[1:3], "_n")
SSCP@data <- cbind(SSCP@data, round(SSCP_norm, 2)) 
rm(SSCP_norm)

str(SSCP@data)
## 'data.frame':	1440000 obs. of  9 variables:
##  $ CLAY      : num  10.5 12.3 15.1 12.9 10.6 ...
##  $ SILT      : num  7.84 8.85 9.36 8.63 9.29 8.99 8.73 9.36 9.56 8.64 ...
##  $ SAND      : num  86.3 80.8 75.2 79.5 83 ...
##  $ raw_totals: num  104.7 102 99.6 101 102.9 ...
##  $ QA        : Factor w/ 3 levels "Too low","Just right!",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ CLAY_n    : num  10 12.1 15.1 12.8 10.3 ...
##  $ SILT_n    : num  7.49 8.68 9.39 8.55 9.03 9.03 8.81 9.63 9.31 8.9 ...
##  $ SAND_n    : num  82.5 79.2 75.5 78.7 80.7 ...
##  $ residuals : num  4.69 2.01 -0.35 0.97 2.89 -0.46 -0.89 -2.76 2.74 -2.96 ...

Now the normalised data can be classified. ‘soiltexture’ comes with a bunch of predefined texture triangles. I’m going to use the Australian one to match the source data. Check the documentation for TT.points.in.classes for details but the upshot is that I’m pulling out a text vector of classes with length = n pixels and adding it as a new @data$column:

SSCP@data <- cbind(SSCP@data, 
                  "TEXCLASS" = TT.points.in.classes(tri.data  = SSCP@data[, c('CLAY_n', 'SILT_n',  'SAND_n')],
                                                    css.names = c('CLAY_n', 'SILT_n', 'SAND_n'),
                                                    class.sys = "AU2.TT", 
                                                    PiC.type  = "t",
                                                    collapse  = ', ')
                  )

For the speed-freaks, everything from data import to this point (excluding plots) took just over a minute on my little ol’ Surface Pro (Intel Core i5 1.7GHz processor, 4GB RAM, Win 10).

At this point some stuff has to be dealt with:

  • Some PSDs will fall on texture class boundaries and will be named e.g. “class1, class2”. I just want a quick map so I’m happy to accept class1 in all cases. Note: rounding the input data actually made this issue slightly worse, but I’m still not willing to leave what amounts to imaginary variation in place.
  • A small number of cells have failed to classify for unknown reasons (probably the fault of the AU2.TT triangle rules, there’s a comment to that effect in the vignette). They all seem to be ~80% sand and ~9% each of silt and clay.
  • An unintended consequence of this is that one of the texture classes is "" - and since its character data in a data frame, its been coerced to factor. R hates zero-length factors, and its not very informative anyway.
levels(SSCP@data$TEXCLASS)

[1] “” “C” “C, CL” “CL” “L” “L, CL” “LS”
[8] “SCL” “SL” “SL, L” “SL, SCL”

SSCP@data$TEXCLASS_SIMPLE <- as.character(SSCP@data$TEXCLASS)
SSCP@data$TEXCLASS_SIMPLE[SSCP@data$TEXCLASS_SIMPLE == ""] <- "Not classified"
SSCP@data$TEXCLASS_SIMPLE <- as.factor(SSCP@data$TEXCLASS_SIMPLE)
SSCP@data$TEXCLASS_SIMPLE <- recode(SSCP@data$TEXCLASS_SIMPLE,
                                   "C, CL" = "C", "L, CL" = "L", "SL, SCL" = "SL", "SL, L" = "SL")
levels(SSCP@data$TEXCLASS_SIMPLE)

[1] “C” “CL” “L” “LS”
[5] “Not classified” “SCL” “SL”

Lastly, I want my final factor list in a sensible order so my map looks pretty.

SSCP@data$TEXCLASS_SIMPLE <- factor(SSCP@data$TEXCLASS_SIMPLE, 
                                   levels = c('C', 'CL', 'SCL', 'L', 'SL', 'LS', 'Not classified'))

Lets take a look:

plot of chunk mappymap

Nooooooiice *pats self on back*

To export this as GeoTiff you still have to make it numeric, but its easy enough to write a lookup table to accompany the image

SSCP@data$TEXCLASS_SN <- as.numeric(SSCP@data$TEXCLASS_SIMPLE)
writeGDAL(SSCP['TEXCLASS_SN'], 
          paste0(working_dir, '/tcmap_148-22.tif'), 
          drivername = 'GTiff')

LUT <- data.frame(unique(SSCP@data[, c('TEXCLASS_SN', 'TEXCLASS_SIMPLE')]))        
write.table(LUT, 
            file = paste0(working_dir, '/tcmap_148-22_LUT.txt'), 
            sep = ", ",
            quote = FALSE,
            row.names = FALSE,
            col.names = TRUE)  

And that’s that. Next time, if I feel so inclined: what to do after you’ve classified all your depth slices.