Introduction
This notebook provides practical guidelines for predictive soil mapping (PSM) using machine learning (ML). The specifics can, and should of course be, modified to fit the purpose of the specific predictive soil mapping tasks of interest. Our main intent here is to provide a reproducible, generalized mapping framework and some of the associated ML computing workflows in R. The actual data and workflows should also be readily transferable to other computing environments if needed. The notebook itself is maintained on Github, and you can fork and modify it from there as you see fit.
The first section of the notebook sets up georeferenced soil survey, remote sensing and GIS data of Rwanda for spatial analyses and predictions. We focus on Rwanda’s croplands, which are the primary Region of Interest (ROI) in this context and the target for various land management interventions of the RwaSIS project. Based on a recent high-resolution GeoSurvey (2019), croplands are currently estimated to occupy ~68% of Rwanda’s overall land area (of ~2.37 Mha). The gridded data offer several advantages including: Africa-wide coverage and consistency, and a robust analysis process that ensures high quality predictions for use in your soil mapping applications.
The subsequent sections illustrate prediction workflows, which use ensembles of machine learning algorithms (MLAs). See the various review articles at: Emsemble models, with various remote sensing and GIS training input and validation layers. There are also some really good free-and-open books (e.g. Lovelace et al, 2021 and Hengl & MacMillan, 2019) available if you’d like to get into the subject matter at greater depth.
While the Legacy soil data used in this example were not collected on a reproducible sampling frame, they may provide an initial indication of where specific soil problems or nutrient deficiencies/imbalances are prevalent in Rwanda. As new data become available over the course of the project, RwaSIS and its partners should update the relevant data sources and soil mapping workflows.
In the last section of the notebook, we produce initial (robust) uncertainty estimates of the soil property predictions that are generated by our ensemble model workflow via Quantile Regression. These estimates show where in Rwanda the current models fit the ground measurements reasonably well and where they do not. As new data are generated by RwaSIS (and others) these uncertainties are likely to decrease, potentially increasing the level of confidence in any resulting cropland management recommendations (e.g. for fertilizer, lime amendments and/or soil erosion/deposition predictions).
General data setup
To run this notebook, you will need to load the packages indicated in the chunk directly below. This allows you to assemble the different example dataframes and rasters providing a lot of options to generate spatial predictions via machine learning and/or geostatistical algorithms. We encourage you to explore the many options in this context!
# Package names
packages <- c("osfr", "rgdal", "sp", "raster", "quantreg", "leaflet", "DT", "devtools",
"caret", "caretEnsemble", "mgcv", "MASS", "randomForest", "xgboost", "nnet", "Cubist",
"plyr", "doParallel")
# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Load packages
invisible(lapply(packages, library, character.only = TRUE))
Data downloads
The following chunk then downloads the data that we have assembled, which are needed for running this particular example. Specifically, it assembles georeferenced field observations of soil measurements, e.g. (pH, soil carbon and soil nutrient composition) and links these to remote sensing and GIS images represented by the grids
feature stack. We focus on predicting the topsoil pH measurements here, but the approach is equally applicable to any given Georeferenced soil property or measurement under consideration.
# Set working directory
dir.create("PSM", showWarnings = F)
setwd("./PSM")
dir.create("Results", showWarnings = F)
# Download soil data
osf_retrieve_file("djcfz") %>% osf_download(conflicts = "overwrite")
unzip("RW_wetchem.zip", overwrite = T)
prof <- read.table("Profiles.csv", header = T, sep = ",")
samp <- read.table("Samples.csv", header = T, sep = ",")
geos <- merge(prof, samp, by="pid")
# Download and assemble raster stacks (https://osf.io/h24wd/)
osf_retrieve_file("h24wd") %>% osf_download(conflicts = "overwrite")
unzip("RW_250m_2020.zip", overwrite = T)
glist <- list.files(pattern="tif", full.names = T)
grids <- stack(glist)
# Download figures
osf_retrieve_file("42gry") %>% osf_download(conflicts = "overwrite")
unzip("figures.zip", overwrite = T)
The processed Rwanda remote sensing and GIS data (features, … grids
in the chunk above) were derived from and transformed from their primary open sources. You can also download the entire grids
raster stack directly at RwaSIS grids. The short descriptions of the included features, and their sources are provided in the table immediately below.
These Rwanda-wide (actually Africa-wide) features will change over time and we will update them if and when needed. Also note that these are grouped by factor variables that designate cropland soil condition (x) as a function of f(x) ~ (a,c,o,r,s) where:
- a - anthropic variables
- c - climatic variables
- o - organismal (primarily vegetative/land cover related)
- r - relief and erosion/deposition related terrain variables
- s - parent material and soil related variables
Data assembly
The next chunk then sets up the soil and site properties in a dataframe that will generate the the training and validation subsets for the different MLAs we shall apply to the spatial prediction of soil properties. The soil property data presented here are courtesy of CROPNUTS.
# Project legacy data coords to grid CRS
geos.proj <- as.data.frame(project(cbind(geos$lon, geos$lat), "+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs"))
colnames(geos.proj) <- c("x","y")
geos <- cbind(geos, geos.proj)
coordinates(geos) <- ~x+y
projection(geos) <- projection(grids)
# Extract gridded variables at survey locations
geosgrid <- extract(grids, geos)
gsdat <- as.data.frame(cbind(geos, geosgrid))
# Randomize dataframe
set.seed(1235813)
gsdat <- gsdat[sample(1:nrow(gsdat)), ]
The following chunk then writes out the randomized dataframe RW_soil_data.csv
into your ./PSM/Results
directory, if you’d like to process those outputs in software other than R. It also generates a location map of where those soil samples were obtained.
# Write data frame
write.csv(gsdat, "./PSM/Results/RW_soil_data.csv", row.names = F)
# Soil sample locations
w <- leaflet() %>%
setView(lng = mean(gsdat$lon), lat = mean(gsdat$lat), zoom = 8) %>%
addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
addCircleMarkers(gsdat$lon, gsdat$lat, clusterOptions = markerClusterOptions())
w ## plot widget
Machine-learning-based predictive mapping with caret
and caretEnsemble
The following chunks predict topsoil soil pH values using contrasting machine learning algorithms (MLAs) with varying remote sensing and GIS (feature) inputs using the caret
and caretEnsemble
packages. The generalized MLA prediction workflow is shown in the figure below. This approach has won a lot of data science competitions e.g., at Kaggle. You may want to take a look there. They have some fantastic data science resources, courses and challenges openly available.
The main idea is to train a number of potentially contrasting models with k-fold cross-validation. At the end of the model training processes, the various models are ensembled (combined/stacked) on an independent validation dataset. When applied over time and space, this is a form of Reinforcement learning, which should produce increasingly accurate predictions as new field, lab data and MLAs are obtained and run.
The following chunk scrubs some of the objects in memory, removes incomplete cases (if there are any), sets-up labels and features, and creates a randomized partition between the training and validation dataframes. Everything is parallelized to facilitate efficient use of either local or cloud-based computing resources. Note that there are other options available for this (e.g. snowfall, among others.
rm(list=setdiff(ls(), c("gsdat","grids","glist"))) ## scrub extraneous objects in memory
gsdat <- gsdat[complete.cases(gsdat[ ,c(24:75)]),] ## removes incomplete cases
# set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)
# split data into calibration and validation sets
gsIndex <- createDataPartition(gsdat$pH, p = 4/5, list = F, times = 1)
gs_cal <- gsdat[ gsIndex,]
gs_val <- gsdat[-gsIndex,]
# Soil calibration labels
labs <- c("pH") ## insert other labels (e.g. "C","N","P","K", "eCEC", ...) here!
lcal <- as.vector(t(gs_cal[labs]))
# raster calibration features
fcal <- gs_cal[c(24:43,47:72)]
Note that we are using topsoil (0-20 cm) pH as an example here. You can substitute other soil properties like (SOC, P, K, etc.) as labels, in the labs
variable. You may want to transform those labs
variables prior to model fitting, and perhaps also to tune the respective models to better represent the distributional and/or compositional attributes of the individual soil properties, should the need arise. Anyhow, those additional variables are also included in the RW_soil_data.csv
data file for you to try. Click on the blue tabs below to see how the individual models were trained with k-fold validation and subsequently stacked using the validation (unseen) data.
Spatial trend
This is a simple spatially smoothed generalized additive model applying the gam
function on the pH values at different sampling locations in Rwanda, based only on their georeference. It is similar to ordinary kriging with cross-validation but it is simpler and much faster to compute in this context.
# select locational covariates
gf_cpv <- gs_cal[,44:46]
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(seed)
tc <- trainControl(method = "cv", allowParallel = T)
# model training
gm <- train(gf_cpv, lcal,
method = "gam",
preProc = c("center","scale"),
metric = "RMSE",
trControl = tc)
gm.pred <- predict(grids, gm) ## spatial predictions
stopCluster(mc)
fname <- paste("./PSM/Results/", labs, "_gm.rds", sep = "")
saveRDS(gm, fname)
Central places
Central places are influential variables for places where human impacts occur. They are correlated with both extraction and deposition of soil nutrients and toxic elements, soil erosion and deposition, acidification and many other soil disturbance processes. The model below focuses on central place indicators such as distances to roads and settlements, surface water sources, cell towers, night lights, and electricity networks among other anthropic factors.
# select central place covariates
gf_cpv <- gs_cal[c(30:43,62)]
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# control setup
set.seed(seed)
tc <- trainControl(method = "cv", allowParallel = T)
# model training
cp <- train(gf_cpv, lcal,
method = "glmStepAIC",
preProc = c("center","scale"),
trControl = tc,
metric = "RMSE")
cp.pred <- predict(grids, cp) ## spatial predictions
stopCluster(mc)
fname <- paste("./PSM/Results/", labs, "_cp.rds", sep = "")
saveRDS(cp, fname)
Ensemble MLAs
This chunk fits 4 additional models, which use of gridded calibration data with 10-fold cross-validation. You can learn more about how these algorithms work by following links at: MASS, randomForest, xgboost and Cubist.
You can use caretEnsemble
instead of caret
as long as the feature variables (grids
in this case), and the trainControl
methods are the same for each model in the caretList
function. This shortens the script-length of this notebook but does not otherwise affect the overall caret
functionality. Note however that the calculations take a bit of time to run on a normal 8-core, 16 Gb memory computer. This is not a big problem for a relatively small ROI like Rwanda, but it might be computationally challenging for larger countries like the DRC, Tanzania or Ethiopia. I fit these models with 10-fold cross-validation and default-tuning of the relevant hyperparameters.
# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")
# Fit 4 additional calibration models using all of the gridded features
clist <- caretList(fcal, lcal,
trControl = tc,
tuneList = NULL,
methodList = c("glmStepAIC", "rf", "xgbTree", "cubist"),
preProcess = c("center","scale"),
metric = "RMSE")
gl.pred <- predict(grids, clist$glmStepAIC)
rf.pred <- predict(grids, clist$rf)
xt.pred <- predict(grids, clist$xgbTree)
cu.pred <- predict(grids, clist$cubist)
stopCluster(mc)
fname <- paste("./PSM/Results/", labs, "_clist.rds", sep = "")
saveRDS(clist, fname)
Stacked predictions
The main point here is not to evaluate a best individual model but rather to evaluate the combination of the 6 previously fitted models against a 20% hold-out validation dataset. This provides robust statistical estimates of how the different models should be weighted against one-another.
# Stacking setup
preds <- stack(gm.pred, cp.pred, gl.pred, rf.pred, xt.pred, cu.pred)
names(preds) <- c("gm","cp","gl","rf","xt","cu")
# Extract model predictions
coordinates(gs_val) <- ~x+y
projection(gs_val) <- projection(preds)
gspred <- extract(preds, gs_val)
gspred <- as.data.frame(cbind(gs_val, gspred))
# Stacking model validation labels and features
gs_val <- as.data.frame(gs_val)
lval <- as.vector(t(gs_val[labs]))
fval <- gspred[,74:79] ## subset validation features
This next chunk fits the stacked model with the glmStepAIC
function from the MASS
library using the validation dataframe. You could explore other options here, but I find that this provides a reasonable combination and weighting of the 6 models that were produced in the model training steps presented above.
# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# model setup
set.seed(seed)
tc <- trainControl(method="repeatedcv", number=10, repeats=3, allowParallel=T)
st <- train(fval, lval,
method = "glmStepAIC",
trControl = tc,
metric = "RMSE")
st.pred <- predict(preds, st) ## spatial predictions
stopCluster(mc)
fname <- paste("./PSM/Results/", labs, "_st.rds", sep = "")
saveRDS(st, fname)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.61684 -0.38493 -0.05492 0.40116 2.53372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3685 0.3593 -1.026 0.30569
## gl 0.2599 0.1053 2.467 0.01408 *
## rf 0.4942 0.1509 3.275 0.00115 **
## xt 0.3143 0.1431 2.197 0.02865 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.3817394)
##
## Null deviance: 269.16 on 378 degrees of freedom
## Residual deviance: 143.15 on 375 degrees of freedom
## AIC: 716.55
##
## Number of Fisher Scoring iterations: 2
Topsoil pH generally takes on values between 3 - 9 in Rwanda, and there are extensive areas of very acid soils in the country. The chunk below generates the prediction map of pH across the RwaSIS cropland ROI.
# Download cropland mask image (https://osf.io/b642y/)
# setwd("./PSM")
osf_retrieve_file("b642y") %>% osf_download(conflicts = "overwrite")
# Apply cropland mask
cpmask <- raster("RW_CP_mask.tif")
stm.pred <- mask(st.pred, cpmask, inverse=FALSE, maskvalue=NA, updatevalue=NA)
# Project stm.pred to EPSG:3857
stll <- projectRaster(stm.pred, crs="+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")
# Set color pallette
pal <- colorBin("Reds", domain = 3:9, reverse = TRUE, na.color = "transparent")
# Render map
w <- leaflet() %>%
setView(lng = mean(gsdat$lon), lat = mean(gsdat$lat), zoom = 8) %>%
addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
addRasterImage(stll, colors = pal, opacity = 0.5) %>%
addLegend(pal = pal, values = values(stll), title = "Topsoil pH")
w ## plot widget
Geotif outputs
This chunk writes out the 7 prediction rasters that you can import as geotif files into a GIS system of your choice for visualizations, reports and/or further processing.
# Write prediction grids
gspreds <- stack(preds, st.pred)
names(gspreds) <- c("gm","cp","gl","rf","xt","cu","st")
fname <- paste("./PSM/Results/","RW_", labs, "_preds_2020.tif", sep = "")
writeRaster(gspreds, filename=fname, datatype="FLT4S", options="INTERLEAVE=BAND", overwrite=T)
Stacked prediction uncertainty estimates
There are many ways to quantify the uncertainty inherent in these predictions. We take a simple but quite robust approach here using quantile regression with (quantreg). We are mainly interested in the initial spread of the ROI-wide predictions (sensu, their 95% probable intervals). Once new data are collected, we shall be revising these and will also point out some additional (more spatially explicit) techniques.
coordinates(gsdat) <- ~x+y
projection(gsdat) <- projection(grids)
gspre <- extract(gspreds, gsdat)
gsout <- as.data.frame(cbind(gsdat, gspre))
stQ <- rq(pH~st, tau=c(0.05,0.5,0.95), data=gsout) ## quantile regression fit
par(pty="s", mar=c(4,4,1,1))
plot(pH~st, xlab="Ensemble pH prediction", ylab="Measured pH (water)", cex.lab=1.3,
xlim=c(3,9), ylim=c(3,9), gsout)
curve(stQ$coefficients[2]*x+stQ$coefficients[1], add=T, from=3, to=9, col="blue", lwd=2)
curve(stQ$coefficients[4]*x+stQ$coefficients[3], add=T, from=3, to=9, col="red", lwd=2)
curve(stQ$coefficients[6]*x+stQ$coefficients[5], add=T, from=3, to=9, col="blue", lwd=2)
abline(c(0,1), col="black")
## Call:
## rq(formula = pH ~ st, tau = c(0.05, 0.5, 0.95), data = gsout)
##
## Coefficients:
## tau= 0.05 tau= 0.50 tau= 0.95
## (Intercept) -1.207954 -0.8359303 -0.1689385
## st 1.107665 1.1407059 1.1394833
##
## Degrees of freedom: 1907 total; 1905 residual
Main takeaways
The machine learning workflow presented in this notebook produces precise and accurate predictions of topsoil pH values of Rwanda based on georeferenced legacy data. It can be flexibly updated as new data e.g., via the RwaSIS project become available over the coming year.
The workflow can be easily changed to model and predict the spatial distributions other important soil properties e.g., SOC, eCEC, P, K, Ca, Mg and S concentrations, among others. It can also be rapidly extended to support new geographical regions of interest and their respective soil mapping institutions, operationally and at low cost.
As a final thought about this notebook: should you come up with a better set of data-sources, analysis suggestions, approaches, algorithms or predictions around these data, we would certainly like to hear about them! Any questions or comments are most welcome via AFSIS.