Introduction

Quantifying the geographical extent, location and spatial dynamics of croplands, rural and urban settlements and different types of vegetation cover provides essential information for monitoring and managing human dominated (anthropic) ecosystems and landscapes. Large portions of Africa remain terra incognita in this context. The main reason for monitoring land cover is to assess where in a particular country or region of interest (ROI) significant impacts of humans on ecosystem services can be expected, and vice versa.

GeoSurvey is an application for collecting and analyzing land cover observations. High resolution satellite images and/or other aerial (e.g., aircraft or drone) imagery can be systematically and rapidly labeled by either trained image interpreters and/or by vetted crowds of Citizen Scientists. When done with care, these observations can result in large, well-structured, properly labeled, geospatial data sets that are suitable for machine learning and geostatistical predictions of land cover and in some instances for monitoring land use. When supplied with properly time-stamped imagery, GeoSurvey can also be used for monitoring ecosystem and landscape changes.

The detailed manual for conducting your own GeoSurveys is available at: GeoSurvey manual. The manual should definitely be consulted to obtain more information about how GeoSurvey can be used to carry out potentially high value surveys of remote areas. There is also a great slide deck available here, which illustrates different land cover and use labeling approaches. I’ll not cover those issues in this notebook and will assume that you already have well-designed GeoSurvey data and collocated spatial features in hand.

The main goal of this notebook is to illustrate starter code guidelines for predictive land cover mapping and the associated statistical small area estimates SAE for variables such as cropland area, building densities, settlement occurrences and woody vegetation cover that define the anthropic land cover types in a given country or any other ROI. I use Rwanda’s most recent GeoSurvey data and the associated gridded (raster) features to illustrate the general approach and the main data analysis steps. Rwanda, being a small country, is convenient for this illustration because the script will run reasonably fast and will hopefully not test your patience … too much. You can also try other African GeoSurvey datasets, which are openly available via the GeoSurvey workflow repository on OSF.

Data setup

To actually run this notebook, you will need to load the packages indicated in the chunk directly below. This allows you to assemble the Rwanda-wide GeoSurvey observations, link those to the spatial data and then model them using machine learning algorithms MLA and/or geostatistics. The notebook itself is versioned and maintained on my Github, and you can fork and modify it from there as you see fit.

# Package names
packages <- c("osfr", "downloader", "jsonlite", "rgdal", "sp", "raster", "leaflet", "DT", "htmlwidgets",
              "caret", "caretEnsemble", "mgcv", "MASS", "randomForest", "xgboost", "nnet",
              "klaR", "dplyr", "doParallel", "dismo", "arm")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
    install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

Data downloads

This chunk downloads the data, which are needed to run this example. The downloads contain the most recent GeoSurvey observations (labels), 48 raster covariates (features) and the administrative boundaries of Rwanda, which were sourced from GADM.

# Set working directory
dir.create("RW_GS20", showWarnings = F)
setwd("./RW_GS20")
dir.create("Results", showWarnings = F)

# Download GeoSurvey data
download("https://www.dropbox.com/s/oqao51hxxvc09ec/RW_geos_2019.csv.zip?raw=1", "RW_geos_2019.csv.zip", mode = "wb")
unzip("RW_geos_2019.csv.zip", overwrite = T)
geos <- read.table("RW_geos_2019.csv", header = T, sep = ",")

# Download GADM-L5 shapefile (courtesy of: http://www.gadm.org)
download("https://www.dropbox.com/s/fhusrzswk599crn/RWA_level5.zip?raw=1", "RWA_level5.zip", mode = "wb")
unzip("RWA_level5.zip", overwrite = T)
shape <- shapefile("gadm36_RWA_5.shp")

# Data raster stack download from OSF at: https://osf.io/xts2y/
osf_retrieve_file("xts2y") %>% osf_download(conflicts = "overwrite")
unzip("RW_250m_2020.zip", overwrite = T)
glist <- list.files(pattern="tif", full.names = T)
grids <- stack(glist)

# Download figures at: https://osf.io/yu8ts/
osf_retrieve_file("yu8ts") %>% osf_download(conflicts = "overwrite")
unzip("GeoSurvey_figs.zip", overwrite = T)

The figure below attempts to clarify the basic GeoSurvey observation and tagging approach with 4 archetypal (250 × 250 m) quadrats from Rwanda (i.e., the blue squares in the figure). The upper left portion of the figure shows all of the tagged buildings in a quadrat where buildings are present. The upper right of the figure shows an example of a cropland grid count where cropland occupies virtually the entire quadrat. The applied if-else rule is that if cropland is present, then ccount is set to a count between 1-16 depending on the amount of cropland cover that is present within the quadrat. If cropland is absent, then ccount is set to 0. The lower left shows a quadrat with dense woody vegetation (> 60% cover). The lower right is an example of a grassland quadrat in which buildings, cropland and dense woody vegetation cover are absent. Note that mixtures of these land cover archetypes occur frequently (e.g., buildings and/or dense woody vegetation and croplands), at this spatial scale and can be accounted for in the subsequent data analysis and prediction steps.

Figure 1: Examples of archetypal GeoSurvey quadrats observed over Rwanda.

Data assembly

The following chunks assemble the GeoSurvey observations from q = 23,199 (250 × 250 m) quadrats, which were collected by 6 experienced image interpreters between 2^nd April - 18^th July of 2019. It also calculates building counts and cropland grid proportions from the associated point location variables (bloc & cgrid). I’ll use the resulting cropland grid count (ccount) variable for small area cropland estimation later on in the notebook.

# Attach GADM-L5 administrative unit names from shape
coordinates(geos) <- ~lon+lat
projection(geos) <- projection(shape)
gadm <- geos %over% shape
geos <- as.data.frame(geos)
geos <- cbind(gadm[ ,c(4,6,8,10,12)], geos)
colnames(geos) <- c("region","district","sector","cell", "village","time","observer","id","lat","lon","BP","CP","TP","WP","bloc","cgrid")

# Coordinates and number of buildings per quadrat
bp <- geos[which(geos$BP == "Y"), ] ## identify quadrats with buildings
bp$bloc <- as.character(bp$bloc)

# Coordinates of the tagged building locations from quadrats with buildings
c <- fromJSON(bp$bloc[1])
bcoord <- do.call("rbind", c$feature$geometry$coordinates)
for(i in 2:nrow(bp)) {
  c <- fromJSON(bp$bloc[i])
  bcoord_temp <- do.call("rbind", c$feature$geometry$coordinates)
  bcoord <- rbind(bcoord, bcoord_temp)
}
bcoord <- as.data.frame(bcoord) ## vector of coordinates per quadrats with buildings
colnames(bcoord) <- c("lon","lat")

# Number of tagged building locations from quadrats with buildings
bcount <- rep(NA, nrow(bp))
for(i in 1:nrow(bp)) {
  t <- fromJSON(bp$bloc[i])
  bcount[i] <- nrow(t$features)
}
bcount ## vector of number of buildings per quadrats with buildings
ba <- geos[which(geos$BP == "N"), ]
ba$bcount <- 0
bp <- cbind(bp, bcount)
geos <- rbind(ba, bp)
geos <- geos[order(geos$time),] ## sort in original sample order

# Cropland grid count
cp <- geos[which(geos$CP == "Y"), ] ## identify quadrats with cropland
cp$cgrid <- as.character(cp$cgrid)

# Number of tagged grid locations (out of 16 possible grid nodes per quadrat)
ccount <- rep(NA, nrow(cp))
for(i in 1:nrow(cp)) {
  t <- fromJSON(cp$cgrid[i])
  ccount[i] <- nrow(t$features)
}
ccount ## cropland grid count
ca <- geos[which(geos$CP == "N"), ]
ca$ccount <- 0
cp <- cbind(cp, ccount)
geos <- rbind(ca, cp)
geos <- geos[order(geos$time), ] ## sort back into original random sample order

Grids

The processed Rwanda grid data (in the grids raster stack) were derived and reprojected from their primary open sources. You can also download the entire stack directly at RwaSIS grids. The short descriptions of the included rasters, and their sources are provided in the table immediately below.

These Rwanda-wide (actually Africa-wide) features will change over time and I will update them if and when needed. Also note that these are grouped by factor variables that designate the occurrence of cropland (CP) as a Jenny-type function f(CP) ~ (a, c, o, r, s) where:

a - anthropic variables
c - climatic variables
o - organismal and ecologically successional variables
r - relief / topographical variables
s - parent material and soil related variables

The main notion here is that the occurrence and distribution of croplands must always be associated with the distribution of humans and their built infrastructure, but also constrained or facilitated by changes in typically much slower environmental factors such as climate, ecological succession, topography, parent materials and soils. Note that all of these change and interact over space-time and should be, but are currently not adequately monitored across Africa.

This chunk reprojects the GeoSurvey data to the Lambert Azimuthal Equal Area (LAEA) grid of the AfSIS raster variables and then writes out the combined dataframe RW_GS_data.csv into your ./RW_GS20/Results directory, if you’d like to process those outputs in software other than R. It also generates a location map of where in Rwanda those 23k+ GeoSurvey observations were obtained. The spatially balanced sampling frame that was used to specify the GeoSurvey locations is available on Github.

# Project GeoSurvey coords to grid CRS
geos.proj <- as.data.frame(project(cbind(geos$lon, geos$lat), "+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs"))
colnames(geos.proj) <- c("x","y")
geos <- cbind(geos, geos.proj)
coordinates(geos) <- ~x+y
projection(geos) <- projection(grids)

# Extract gridded variables at GeoSurvey locations
geosgrid <- extract(grids, geos)
gsdat <- as.data.frame(cbind(geos, geosgrid))
gsdat <- gsdat[ which(gsdat$ccount < 17), ]

# Write out data frame
write.csv(gsdat, "./RW_GS20/Results/RW_GS_data.csv", row.names = F)

# Plot GeoSurvey sample locations
w <- leaflet() %>%
  setView(lng = mean(gsdat$lon), lat = mean(gsdat$lat), zoom = 8) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(gsdat$lon, gsdat$lat, clusterOptions = markerClusterOptions())
saveWidget(w, 'RW_GS_sample_locs.html', selfcontained = T) ## save widget
w ## plot widget

Ensemble based learning and mapping with `caret` and `caretEnsemble`

The following chunks calibrate cropland presence/absence observations using different machine learning algorithms (MLAs) to various spatial feature inputs using the caret and caretEnsemble packages. The main idea is to train several potentially competing algorithms with k-fold cross-validation. At the end of the model training processes, the various models are ensembled (combined / stacked) on an independent validation dataset. As shown in the figure below there are four conceptual steps to run these workflows:

Label vetting (quality control) for a sub-sample of all of the GeoSurvey observations that were logged by GeoSurveyors. This step is generally used to assess classification error rates for the different GeoSurveyors, particularly for crowd-sourced GeoSurveys. It can also be used to establish a quality controlled validation dataset manually should that be deemed necessary.
Calibration and model stacking, which involve calibrating several potentially contrasting MLAs with cross-validation. I shall use a combination of generalized linear, bagging, boosting and neural network base models here.
Ensemble predictions are subsequently based on a stacked model that is applied to and tested on an independent validation dataset. This step provides robust statistical estimates of how the different models in the prediction stack should be weighted against one-another.
Model prediction results are subsequently placed back into the gridded feature stack for model refitting and/or updating. This step is also really useful for improving model performance over time with new data and/or for developing predictions at higher spatial resolution where needed.

Figure 2: GeoSurvey land cover prediction workflow.

In order to monitor changing ecosystems and landscapes the first three steps should be repeated over time; i.e., to facilitate the feedback-loop in step 4. To start the fitting processes the next chunk scrubs some of the extraneous objects in memory, removes any incomplete cases, sets-up labels and features, and creates a randomized (80:20%) partition between the training and validation dataframes.

rm(list=setdiff(ls(), c("gsdat","grids","glist"))) ## scrub extraneous objects in memory
gsdat <- gsdat[complete.cases(gsdat[ ,c(19:67)]),] ## removes incomplete cases

# Set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)

# Split data into calibration and validation sets
gsIndex <- createDataPartition(gsdat$CP, p = 4/5, list = F, times = 1)
gs_cal <- gsdat[ gsIndex,]
gs_val <- gsdat[-gsIndex,]

# GeoSurvey calibration labels
labs <- c("CP") ## insert other presence/absence labels (BP, WP, TP) here!
lcal <- as.vector(t(gs_cal[labs]))

# Raster calibration features
fcal <- gs_cal[ ,19:38,42:67]

Note again that I am using previously vetted cropland presence/absence (CP) as an example. This produces probability maps of where in Rwanda croplands are likely to occur versus where they are unlikely to be present. You can also substitute other GeoSurvey variables as labels and specify those with the labs variable in the script-chunk above. While these additional variables are included in the RW_soil_data.csv data file, I shall leave those for you to explore. Presented next are some starter models, which I consider to be contrasting both in terms of the gridded features that are used, as well as the MLAs that are used in fitting the observational data to those.

Machine learning models

Click on the tabs to view individual models and their stacked predictions …

Spatial trend

This is a spatially smoothed generalized additive model applying the gam function on the CP observations at different sampling locations in Rwanda, based only on their distance to the fixed LAEA datum georeference. It is similar to kriging with cross-validation, but is simpler and much faster to compute in this context. I use the standard caret syntax here to illustrate the general specification, output and prediction steps.

# Select spatial covariates
gf_cpv <- gs_cal[ ,39:41] ## specifies features as DX, DY & DXY coordinate rasters

# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc) ## this parallizes

# Control setup for cross-validation
set.seed(seed)
tc <- trainControl(method = "cv", classProbs = T, 
                   summaryFunction = twoClassSummary, allowParallel = T)

# Model training
gm <- train(gf_cpv, lcal, 
            method = "gam",
            preProc = c("center","scale"), 
            family = "binomial",
            metric = "ROC",
            trControl = tc)

# Model outputs & predictions
gm.pred <- predict(grids, gm, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./RW_GS20/Results/", labs, "_gm.rds", sep = "")
saveRDS(gm, fname)

Central places

Central places (sensu Central place theory) are influential variables for predicting of where croplands are likely to occur (or not, e.g. in, city centers, forest reserves or national parks). The model below focuses on central place variables such as distances to major and minor roads, urban & rural settlements, parks & reserves, cell towers & electricity networks among other largely anthropically controlled / infrastructure variables. Also note that I always save the current models as .Rds files. This allows us to load the models at a later stage to e.g., re-run various analyses and/or to integrate previously fitted models into new scripts and notebooks.

# Select central place covariates
gf_cpv <- gs_cal[ ,25:38] ## these are the anthropic "distance-to" central place features

# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Control setup for cross validation
set.seed(seed)
tc <- trainControl(method = "cv", classProbs = T,
                   summaryFunction = twoClassSummary, allowParallel = T)

# Model training
cp <- train(gf_cpv, lcal, 
            method = "glmStepAIC",
            family = "binomial",
            preProc = c("center","scale"), 
            trControl = tc,
            metric ="ROC")

# Model outputs & predictions
cp.pred <- predict(grids, cp, type = "prob") ## central place predictions
stopCluster(mc)
fname <- paste("./RW_GS20/Results/", labs, "_cp.rds", sep = "")
saveRDS(cp, fname)

Ensemble MLAs

This chunk fits 4 additional models that use the that use all of gridded calibration data with 10-fold cross-validation. You can learn more about how these algorithms work by following links at: MASS, randomForest, xgboost and nnet. You can use caretEnsemble instead of caret as long as the feature variables (grids in this case), and the trainControl methods are the same for each model in the caretList function. This shortens the script-length of this notebook but does not otherwise affect the overall caret functionality. Note however that the calculations take a bit of time to run on a normal 8-core, 16 Gb memory computer. This is not a big problem for a relatively small ROI like Rwanda, but it might be computationally challenging for larger countries like the DRC, Tanzania or Ethiopia. I fit these models with 10-fold cross-validation and default-tuning of the relevant hyperparameters.

# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method = "cv", number = 10, classProbs = T,
                   summaryFunction = twoClassSummary, allowParallel = TRUE, savePredictions="final")

# Fit 4 additional calibration models using all of the gridded features
clist <- caretList(fcal, lcal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("glmStepAIC", "rf", "xgbTree", "nnet"),
                   preProcess = c("center","scale"),
                   metric = "ROC")

gl.pred <- predict(grids, clist$glmStepAIC, type = "prob")
rf.pred <- predict(grids, clist$rf, type = "prob")
xt.pred <- predict(grids, clist$xgbTree, type = "prob")
nn.pred <- predict(grids, clist$nnet, type = "prob")
stopCluster(mc)
fname <- paste("./RW_GS20/Results/", labs, "_clist.rds", sep = "")
saveRDS(clist, fname)

Model stacking

This chunk fits a model ensemble with the glmStepAIC function from the MASS library using the validation dataframe (gs_val). You could explore other meta-model options here, but I think that this approach provides a reasonable combination and weighting of the 6 models that were produced in the previous calibration / training steps. Again, the fitting is done with cross-validation.

# Stacking setup
preds <- stack(1-gm.pred, 1-cp.pred, 1-gl.pred, 1-rf.pred, 1-xt.pred, 1-nn.pred)
names(preds) <- c("gm","cp","gl","rf","xt","nn")

# Extract predictions on the validation set
coordinates(gs_val) <- ~x+y
projection(gs_val) <- projection(preds)
gspred <- extract(preds, gs_val) ## extracts the probabilities of each model in the stack
gspred <- as.data.frame(cbind(gs_val, gspred))

# Set validation labels and features
gs_val <- as.data.frame(gs_val)
lval <- as.vector(t(gs_val[labs])) ##  subset validation labels
fval <- gspred[ ,67:72] ## subset validation features

# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Control setup
set.seed(1385321)
tc <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = T, 
                   summaryFunction = twoClassSummary, allowParallel = T)

# Model training
st <- train(fval, lval,
            method = "glmStepAIC",
            family = "binomial",
            metric = "ROC",
            trControl = tc)

# Model outputs & predictions
st.pred <- predict(preds, st, type = "prob") ## stacked spatial predictions
stopCluster(mc)
fname <- paste("./RW_GS20/Results/", labs, "_st.rds", sep = "")
saveRDS(st, fname)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7899   0.2040   0.2144   0.2414   2.8659  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.3845     0.2216 -19.783   <2e-16 ***
## rf            7.1080     0.6996  10.160   <2e-16 ***
## xt            1.1850     0.6248   1.897   0.0579 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4166.9  on 4638  degrees of freedom
## Residual deviance: 1482.5  on 4636  degrees of freedom
## AIC: 1488.5
## 
## Number of Fisher Scoring iterations: 6

Stack prediction

A ROC curve provides information about a classification test’s performance. The closer the apex of the curve toward the upper left corner, the greater the discriminatory ability of the test (i.e., the true-positive rate (Sensitivity) is high and the false-positive rate (1 - Specificity) is low. I use it here to assess the classification performance of the stacked model (st) on the validation dataframe. This next chunk does the calculations and plots the ROC curve using the dismo package. Areas under the ROC curve AUC of 1 discriminate perfectly. AUC values of 0.5 are no better than a random guess.

# Plot validation set ROC curve
cp_pre <- predict(st, fval, type="prob")
cp_val <- cbind(lval, cp_pre)
cpp <- subset(cp_val, cp_val=="Y", select=c(Y))
cpa <- subset(cp_val, cp_val=="N", select=c(Y))
cp_eval <- evaluate(p=cpp[,1], a=cpa[,1]) ## calculate ROC on validation set

Figure 3: Cropland classification ROC curve for the validation set.

Classification

The ROC curve can then be used to threshold the spatial predictions to create a cropland mask map. Pixels above, or below the threshold are classified as either having cropland present, or absent. This is useful for additional prediction accuracy checks. It also provides a cropland mask that is useful for planning of any subsequent ground sampling campaigns that involve croplands as the main ROI (e.g., for predictive soil mapping and/or laying out agronomic management response trials). The next chunk sets the relevant ROC threshold based on the validation set kappa statistic and calculates the cropland mask and writes out the 8 prediction grids to a geotif file, which can be imported to a GIS of your choice. It also generates an overview map of the cropland occurrence probabilities.

# Generate feature mask
t <- threshold(cp_eval) ## calculate thresholds based on ROC
r <- matrix(c(0, t[,1], 0, t[,1], 1, 1), ncol=3, byrow = T) ## set threshold value <kappa>
mask <- reclassify(1-st.pred, r) ## reclassify the stacked predictions

# Write prediction grids
gspreds <- stack(preds, 1-st.pred, mask)
names(gspreds) <- c("gm","cp","gl","rf","xt","nn","st","mk")
fname <- paste("./RW_GS20/Results/","RW_", labs, "_preds_2020.tif", sep = "")
writeRaster(gspreds, filename=fname, datatype="FLT4S", options="INTERLEAVE=BAND", overwrite=T)

# Write output data frame
coordinates(gsdat) <- ~x+y
projection(gsdat) <- projection(grids)
gspre <- extract(gspreds, gsdat)
gsout <- as.data.frame(cbind(gsdat, gspre))
gsout$mzone <- ifelse(gsout$mk == 1, "Y", "N")
fname <- paste("./RW_GS20/Results/","RW_", labs, "_out.csv", sep = "")
write.csv(gsout, fname, row.names = F)

# Project st.pred to EPSG:3857
st <- 1-st.pred
# msk <- mask < 1
# stm <- mask(st, msk, maskvalue = TRUE)
stl <- projectRaster(st, crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")

# set color pallette
pal <- colorBin("Reds", domain = 0:1, na.color = "transparent") 

# render map
w <- leaflet() %>% 
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addRasterImage(stl, colors = pal, opacity = 0.5) %>%
  addLegend(pal = pal, values = values(stl), title = "Cropland prob.")
w ## plot widget

Small area estimates (SAE)

The term “small area” generally refers to small administrative areas such as e.g., regions, districts, sectors, cells and/or villages … in the Rwandan context. It may also refer to any other small domain within a given ROI. If, for example, a ground survey has been carried out for e.g., a whole a country, the sample size within any particular smaller sub-area may be too small to generate precise area estimates from the data. To deal with this problem, SAE uses additional data, such as GeoSurvey predictions, which will have much denser spatial coverage than e.g., a typical country-wide ground survey. This section of the notebook illustrates some of the classic approaches to this common estimation problem (e.g., see the article by Battese & Fuller, 1982).

# Calculates mean national-level model
m0 <- glm(cbind(ccount, 16-ccount) ~ 1, family = binomial, gsout)
summary(m0)

## 
## Call:
## glm(formula = cbind(ccount, 16 - ccount) ~ 1, family = binomial, 
##     data = gsout)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -6.070  -2.029   1.786   3.488   3.488  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.77126    0.00353   218.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 280156  on 23198  degrees of freedom
## Residual deviance: 280156  on 23198  degrees of freedom
## AIC: 312645
## 
## Number of Fisher Scoring iterations: 4

The above chunk calculates the overall cropland area of Rwanda as ~68% of its total land area of ~2.37 Mha. The standard 95% confidence interval of this estimate is ± 1%. The next chunk calculates post-stratified estimates using the GeoSurvey cropland occurrence predictions. It also produces map of the expected within pixel cropland proportions and areas.

# Post-stratified spatial model with stacked GeoSurvey cropland occurrence predictions
m1 <- glm(cbind(ccount, 16-ccount) ~ st, family = binomial, gsout)
m1.pred <- predict(gspreds, m1, type = "response") ## maps pixel-level predictions
m1.area <- m1.pred*6.25 ## calculates and maps cropland area (ha) estimates per pixel
anova(m0, m1) ## anova comparison

## Analysis of Deviance Table
## 
## Model 1: cbind(ccount, 16 - ccount) ~ 1
## Model 2: cbind(ccount, 16 - ccount) ~ st
##   Resid. Df Resid. Dev Df Deviance
## 1     23198     280156            
## 2     23197     117829  1   162327

While the national-level estimate of cropland proportion (m0) is not affected much, its precision as measured by the residual deviance comparison between the two models (m0 vs m1) is. You could also try to fit this by including the individual base model predictions to derive predictions similar to those of the stacked (st) model. This could introduce an alternative SAE weighting scheme for further consideration and testing. The main point here is to consider the main stacking options for the cropland area predictions.

# Post-stratified spatial model with base model GeoSurvey cropland occurrence predictions
m2 <- glm(cbind(ccount, 16-ccount) ~ gm+cp+gl+rf+xt+nn, family = binomial, gsout)
m2.pred <- predict(gspreds, m2, type = "response") ## maps pixel-level predictions
m2.area <- m1.pred*6.25 ## calculates and maps cropland area (ha) estimates per pixel
anova(m1, m2) ## anova comparison

## Analysis of Deviance Table
## 
## Model 1: cbind(ccount, 16 - ccount) ~ st
## Model 2: cbind(ccount, 16 - ccount) ~ gm + cp + gl + rf + xt + nn
##   Resid. Df Resid. Dev Df Deviance
## 1     23197     117829            
## 2     23192     115503  5   2326.5

The slightly lower deviance of the m2 model for area predictions indicates that it might be preferable. Note however that the difference is small enough and has not been validated. So, I would still go with the m1 prediction for this particular illustration, based on the parsimony principle.

The next chunk calculates small area estimates for the 30 national districts and 417 sectors located within districts that are reporting cropland area information to e.g., the Ministry of Agriculture (MINAGRI) in Rwanda. Again, the inclusion of the country-wide GeoSurvey cropland occurrence predictions in the respective small area models provide potentially useful information for sub-national planning purposes.

# Small area hierarchical random-effect estimates by 417 sectors in 30 districts
m3 <- glmer(cbind(ccount, 16-ccount) ~ 1 + (1|district/sector), family = binomial, gsout)
m4 <- glmer(cbind(ccount, 16-ccount) ~ st + (1|district/sector), family = binomial, gsout)
summary(m4)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: cbind(ccount, 16 - ccount) ~ st + (1 | district/sector)
##    Data: gsout
## 
##      AIC      BIC   logLik deviance df.resid 
## 144288.1 144320.3 -72140.1 144280.1    23195 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -12.164  -0.671   0.318   1.559  51.507 
## 
## Random effects:
##  Groups          Name        Variance Std.Dev.
##  sector:district (Intercept) 0.15369  0.3920  
##  district        (Intercept) 0.06672  0.2583  
## Number of obs: 23199, groups:  sector:district, 417; district, 30
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.42589    0.07037  -77.11   <2e-16 ***
## st           7.12183    0.05003  142.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##    (Intr)
## st -0.681

Again, the inclusion of the country-wide GeoSurvey cropland occurrence predictions (m4, above) in the respective small area models provide potentially useful information for sub-national planning.

anova(m3, m4) ## anova comparison

## Data: gsout
## Models:
## m3: cbind(ccount, 16 - ccount) ~ 1 + (1 | district/sector)
## m4: cbind(ccount, 16 - ccount) ~ st + (1 | district/sector)
##    npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)    
## m3    3 239192 239216 -119593   239186                        
## m4    4 144288 144320  -72140   144280 94906  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Additional interpretations

Building densities

In Rwanda agricultural activities are generally conducted in close proximity to buildings and rural settlements, though generally not in urban areas (e.g., the city of Kigali). The map below shows the relationship between modeled building densities (bcount) in relation to Rwanda’s main electrical grid infrastructure. This is a good depiction of where people reside in Rwanda. There is an additional R-script available on Github that does the relevant calculations, generates the spatial building density predictions including geotif files, and their uncertainties. Also, again note that this is the main anthropic variable driving cropland expansions and the associated land management interventions and environmental problems.

Figure 4: Predicted building counts (no. buildings ha^-1) relative to the main electrical grid in Rwanda (2020).

Land cover

This map shows the current AfSIS land cover classification for Rwanda (in 8 categories for the year 2020) and their associated area estimates. The map was derived using a combination of the occurrence (presence / absence) of buildings, croplands and dense (> 60%) woody vegetation cover using the ensemble prediction approaches described in the main notebook text. Note that the cropland region of interest area is highlighted by the red rectangles in the map legend table.

Figure 5: AfSIS GeoSurvey land cover classification and area estimates for Rwanda (LCC, 2020).

Takeaways

The main takeaways of this notebook are the following:

This notebook produces precise and accurate (>95%) spatial predictions of the observed occurrence (presence / absence) and the associated small cropland areas in Rwanda based on a completely reproducible, standard ensemble machine learning workflow (e.g., see Rocca, 2019).
The workflow can be flexibly extended to map other land cover variables (e.g., the distribution and density of buildings, dense woody vegetation cover, built-up or soil conservation infrastructure and crop type distributions, among others). It can also be rapidly extended to cover new geographies and ROIs with new observations, geodata and/or MLAs.
The workflow outputs can be used to improve the precision and accuracy of small land cover area estimates, and depending on the availability of properly time-stamped GeoSurveys and their associated remote sensing data, also for cropland area monitoring and change detection workflows, both operationally and at low cost compared to conventional land cover mapping practices.
The predictions in the introductory examples covered by this notebook will be updated as additional ground data for the occurrence of e.g., specific crops, cropping systems and soil / site surveys become available for Rwanda over the coming year via the ongoing RwaSIS project.

Any questions or comments about this notebook are most welcome via AfSIS. You can download the entire markdown document from my Github.

Workflows for predictive land cover mapping and small-area estimation

M.G. Walsh

23, March, 2022

Introduction

Data setup

Data downloads

Data assembly

Grids

Ensemble based learning and mapping with `caret` and `caretEnsemble`

Machine learning models

Spatial trend

Central places

Ensemble MLAs

Model stacking

Stack prediction

Classification

Small area estimates (SAE)

Additional interpretations

Building densities

Land cover

Takeaways

Workflows for predictive land cover mapping and small-area estimation

M.G. Walsh

23, March, 2022

Introduction

Data setup

Data downloads

Data assembly

Grids

Ensemble based learning and mapping with caret and caretEnsemble

Machine learning models

Spatial trend

Central places

Ensemble MLAs

Model stacking

Stack prediction

Classification

Small area estimates (SAE)

Additional interpretations

Building densities

Land cover

Takeaways

Ensemble based learning and mapping with `caret` and `caretEnsemble`