Soil acidity can be thought of as occurring in two forms (or pools): as active or soil solution acidity, measured by pH in a 1:1 solution of dry soil in water, and as reserve acidity. The reserve (or exchangeable) acidity, measured by Hp (primarily meq H+ + Al2+ / 100 g soil), is held on soil colloids and controls the overall level of the solution acidity. It is also commonly many times higher than the solution acidity. Hp levels depend on several properties such as the amount and type of clay (mineralogy), the amount of organic matter, and the CaO, MgO, K2O, Al2O3 and Na2O oxide concentrations of the soil. Thus, two soils could have the same soil pH but completely different lime requirements.
To effectively raise the soil pH of acid soils over the long-term, both active and reserve acidities should be neutralized to a level that can be tolerated by the crop combination/rotation that is being grown. Soil testing labs and extension services typically determine the buffering capacity and the lime requirement by measuring or estimating reserve acidity; see e.g., the excellent introductory tutorial in Moorberg & Crouse (v.2). Though, soil pH is the trigger for determining when lime is needed for mineral soils, it is not used to determine how much lime is needed to increase pH to the recommended range for a specific crop. While Hp measurements are not difficult to do in a wet chemistry lab, they are time-consuming and hard to do well. Some of the associated buffering solutions also produce hazardous waste and need to be handled carefully.
The spectral signatures of soils, and materials generally, are defined by their reflectance or absorbance as a function of wavelength in the electromagnetic spectrum. Under controlled conditions, the signatures result from electronic transitions of atoms and vibrational stretching and bending of structural groups of atoms that form molecules or crystals. Mid-infrared (MIR) spectroscopy has been shown to provide highly repeatable, rapid and low cost measurements of many different soil properties in numerous studies. The amount of light absorbed by a soil sample is measured with minimal sample preparation (drying and fine grinding) across a wide range of wavelengths to provide a unique spectral signature. An individual measurement can be performed in about 30 seconds, in contrast to conventional soil tests, which are typically slow, labor-intensive, expensive and/or use harmful chemicals.
The main motivation for using Hp in this example is that AfSIS is currently involved in several new, large-area projects, which are focusing on acid soil management and cropland lime requirements. The other operational management aspect of this is that acidity problems should probably be diagnosed and remediated first in the areas where they occur, and before attempting to solve for e.g., fertilizer input and/or other soil related agronomic management issues.
This notebook is intended for self-learning. It does not go into the details of the Spectroscopy or the determining physical chemistry itself, but instead it focuses on Chemometric processes and the associated machine learning workflows that are needed to generate useful predictions from a population of spectral signatures (features) relative to their corresponding reference measurements (labels). The calibrations would need to be adjusted to the specific MIR spectrometer instrument that is being used. In this particular example, I shall use topsoil (0-20 cm) and co-located subsoil (20-50 cm) data that were collected as part of the AfSIS project which sampled the major Köppen-Geiger climate zones of Africa excluding deserts, urban and other non-photosynthetically active land areas.
To actually run the notebook, you will need to load the packages indicated in the chunk directly below. This allows you to assemble the wet chemistry and spectral dataframes providing a lot of options to generate spectral predictions of acidity relevant soil properties such as pH, EC, CEC, SOC, Ca:Mg, soil texture, and mineralogy among others. The notebook itself is maintained on my Github, and you can fork and modify it from there as you see fit.
# Package names
packages <- c("osfr", "caret", "caretEnsemble", "MASS", "pls", "glmnet", "randomForest", "xgboost", "Cubist", "quantreg", "leaflet", "htmlwidgets", "plyr", "dplyr", "doParallel")
# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(packages[!installed_packages])
}
# Load packages
invisible(lapply(packages, library, character.only = TRUE))
The next chunk downloads the data needed for running this particular example. It assembles georeferenced soil measurements collected across Africa and links these to Bruker Alpha FT-IR spectra.
# Create a data folder in your current working directory
dir.create("soil_acidity", showWarnings = F)
setwd("./soil_acidity")
dir.create("Results", showWarnings = F)
# Data download
osf_retrieve_file("f4dpw") %>% osf_download(conflicts = "skip")
unzip("acidity_data.zip", overwrite = T)
## Warning in unzip("acidity_data.zip", overwrite = T): error 1 in extracting from
## zip file
prof <- read.table("profiles.csv", header=T, sep=",") ## sample locations and depths
cnls <- read.table("cnls.csv", header=T, sep=",") ## wet chemistry data from CNLS, Nairobi
cnls$xCEC <- cnls$m3Ca/200+cnls$m3Mg/120+cnls$m3K/390+cnls$Hp ## calculates extractable xCEC
text <- read.table("text.csv", header=T, sep=",") ## AfSIS LDPSA soil texture
oxid <- read.table("oxid.csv", header=T, sep=",") ## AfSIS XRF metal oxide data
spec <- read.table("alpha_spec.csv", header=T, sep=",") ## AfSIS spectral absorbance data
# Merge the wet chemistry reference dataframes
adata <- merge(prof, cnls, by = "ssid")
adata <- merge(adata, text, by = "ssid")
adata <- merge(adata, oxid, by = "ssid")
# Download figures
osf_retrieve_file("42gry") %>% osf_download(conflicts = "skip")
unzip("figures.zip", overwrite = T)
## Warning in unzip("figures.zip", overwrite = T): error 1 in extracting from zip
## file
The following chunk then writes out the dataframe
AfSIS_reference_data.csv
into your
./soil_acidity/Results
directory if you’d like to process
those outputs in software other than R. It also generates a location map
of where those soil samples were obtained.
# Write out reference data frame
write.csv(adata, "./soil_acidity/Results/AfSIS_reference_data.csv", row.names = F)
# Soil sample locations
w <- leaflet() %>%
setView(lng = mean(adata$lon), lat = mean(adata$lat), zoom = 4) %>%
addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
addCircleMarkers(adata$lon, adata$lat, clusterOptions = markerClusterOptions())
w ## plot widget
One potentially problematic factor for many MLAs is the “Curse of dimensionality” that is imposed by the high dimensionality of the spectral data: w = 2,541 individual wavebands in this particular case. Moreover, many of those wavebands are strongly correlated, particularly those proximal to one another on the spectrum. These types of data pose generalization and prediction challenges for MLAs that involve Bagging, Boosting, Bayesian and Deep learning and can lead to overfitting problems.
Other algorithms such as Partial least squares regression (PLS), Ridge regression and/or Lasso regression handle the high dimensional data and collinearities well in a linear context but are frequently not very good at predicting the common non-linear and/or threshold relationships between the spectral signatures and their reference measurements.
There are some ways of mitigating these trade-offs. The first is to reduce the dimensionality and collinearity of the spectral data. There are a number of pre-processing techniques that can be applied in that context including: Principal components analysis (PCA), Independent components analysis (ICA) as well as other signal processing techniques such as, Non-negative matrix factorization (NMF). I use PCA here, and the next chunk appends the spectra and their resulting spectral component scores to the wet chemistry reference data.
# Spectral principal components
spec.pca <- prcomp(spec[,2:2542], center=T, scale=T)
pcas <- predict(spec.pca, spec)
pcas <- pcas[,1:20] ## save the first 20 components, which explain ~95% of the total spectral variance
fname <- paste("./soil_acidity/Results/", "spec_pcas.rds", sep = "")
saveRDS(spec.pca, fname) ## saves spectral PCA model
# Merge & write files
spec <- cbind(spec, pcas)
adata <- merge(adata, spec, by = "ssid")
write.csv(adata, "./soil_acidity/Results/AfSIS_acidity_data.csv", row.names=F)
The other main option is of course to run both types of algorithms (the data regularization or selection-based and the data reduction-based), and then look at if they can be usefully combined. This is the approach that I will take here.
caret
and
caretEnsemble
The following chunks calibrate soil Hp values using different machine
learning algorithms (MLAs) to XRF and MIR spectral (feature) inputs
using the caretEnsemble
package. This general approach has won a lot of data science
competitions e.g., at Kaggle. You
may want to take a look there. They have some fantastic data science
resources, courses and challenges openly available. You can also check
out the AfSIS sponsored Africa
Soil Property Prediction Challenge (2014), from which the notebook
that is presented here was developed.
The main idea is to train a number of potentially competing models with k-fold cross-validation. At the end of the model training processes, the various models are ensembled (combined/stacked) on an independent validation dataset. When applied consistently over time and space, this is a form of Reinforcement learning, which should produce increasingly accurate predictions as new field, lab data and MLAs are obtained and run.
I’ll be using a machine learning approach, which is algorithmic, rather than data modeling based. You might want to take a look at Breiman (2001) to gauge the differences between the “two cultures” he describes. The figure above shows the basic workflow of the algorithmic approach that I apply here to predicting reserve acidity (Hp). The following chunk initially scrubs some of the extraneous objects in memory, sets-up labels and features, and creates a randomized partition between the training and validation dataframes.
rm(list=setdiff(ls(), c("adata"))) ## scrubs extraneous objects in memory
# Set randomization seed
seed <- 1235813
set.seed(seed)
# Randomize dataframe
adata <- adata[sample(1:nrow(adata)), ]
# Split data into calibration and validation sets
gsIndex <- createDataPartition(adata$Hp, p = 8/10, list=F, times = 1)
cal <- adata[ gsIndex,]
val <- adata[-gsIndex,]
# Set calibration labels
labs <- c("Hp") ## insert other labels ("pH", "EC", "eCEC", ...) here!
lcal <- as.vector(t(cal[labs]))
# Calibration features
fcal <- select(cal,26:2566,9) ## full Alpha FT-IR spectral signatures + pH
pcal <- select(cal,2567:2586,9) ## spectral principal components + pH
Note that I also include soil pH (in water) in the calibration features because it is routinely, quickly and cheaply measured in most soil spectral labs (and frequently even in the field). It is also inversely (though loosely) related to Hp. The general notion is that soils with a pH > 7 are unlikely to have any measurable reserve acidity (see the figure below).
par(pty="s", mar=c(4,4,1,1))
plot(Hp~pH, xlab="pH (water)", ylab="Hp (meq / 100 g soil)", cex.lab=1.3,
xlim=c(3,10), ylim=c(0,5), adata)
caretEnsemble
has 3 primary functions: caretList
,
caretEnsemble
and caretStack
.
caretList
is used to build lists of caret models on the
same training data, with the same model resampling parameters.
caretEnsemble
and caretStack
are used to
create ensemble models from such lists of individual caret
models. caretEnsemble
uses a generalized linear model glm to
create a simple weighted combination of the constituent models and
caretStack
uses a specific caret model to combine the
outputs from a choice of other component caret
models. The
main advantage of running caretEnsemble
is that the
intended individual caret
models can
be run all-at-once, rather than running each model separately …
abbreviating and hopefully clarifying the calibration code. All model
fitting processes can be (are) parallelized to facilitate efficient use
of either local or cloud-based computing resources. Note that there are
also other options available for this e.g., the snowfall
package, among others.
The next chunk fits 3 initial models that use the full spectral
signatures i.e., the w = 2,541 individual wavebands. These are
reasonably common linear Chemometric
algorithms and you can learn more about how they work in R from the
following links: pls
,
glmnet
and xgbLinear
.
I fit these with 10-fold cross-validation and default-tuning of the
relevant hyperparameters.
Also, note that this chunk can take up to 1 hour to run on a normal
computer with 8 cores and 16 Gb of RAM. There are just a lot of spectral
features to sort through.
# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Specify model training controls
set.seed(seed)
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")
# Fit the 3 initial calibration models
llist <- caretList(fcal, lcal,
trControl = tc,
tuneList = NULL,
methodList = c("pls", "glmnet", "xgbLinear"),
preProcess = c("center","scale"),
metric = "RMSE")
stopCluster(mc)
The next chunk fits 3 additional, tree/forest based MLAs that use the that use the spectral principal component scores (pc = 20) + pH. You can learn more about how they work in R from the following links at: randomForest, xgboost and Cubist. This actually runs quite quickly on a normal computer.
# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")
# Fit 3 calibration models using the spectral principal component features
tlist <- caretList(pcal, lcal,
trControl = tc,
tuneList = NULL,
methodList = c("rf", "xgbTree", "cubist"),
preProcess = c("center","scale"),
metric = "RMSE")
stopCluster(mc)
caret
The main point here is not to evaluate a best individual model but rather to evaluate the combination of the previously fitted models against a 20% hold-out validation dataset. This provides robust statistical estimates of how the different models should be weighted against one-another. This chunk initially generates the individual model predictions for the validation dataframe.
# Validation features
fval <- select(val,26:2566,9) ## full Alpha FT-IR spectral signatures + pH
pval <- select(val,2567:2586,9) ## spectral principal components + pH
val$pls <- predict(llist$pls, fval)
val$glm <- predict(llist$glmnet, fval)
val$xbl <- predict(llist$xgbLinear, fval)
val$rfo <- predict(tlist$rf, pval)
val$xbt <- predict(tlist$xgbTree, pval)
val$cub <- predict(tlist$cubist, pval)
This next chunk fits the model ensemble with the
glmStepAIC
function from the MASS
library
using the validation dataframe. You could explore other
options here, but I think that this provides a reasonable combination
and weighting of the 6 models that were produced in the ensemble
training steps.
lval <- as.vector(t(val[labs]))
fval <- select(val,pls,glm,xbl,rfo,xbt,cub) ## subset validation models
# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Model setup and fitting
set.seed(seed)
tc <- trainControl(method="repeatedcv", number=10, repeats=3, allowParallel=T)
st <- train(fval, lval,
method = "glmStepAIC",
trControl = tc,
metric = "RMSE")
val$st <- predict(st, val) ## stacked predictions
stopCluster(mc)
fname <- paste("./soil_acidity/Results/", labs, "_st.rds", sep = "")
saveRDS(st, fname)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.79034 -0.11686 -0.01604 0.08395 1.78569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01522 0.02321 -0.656 0.51235
## xbl -0.20406 0.07744 -2.635 0.00877 **
## rfo 0.28159 0.13339 2.111 0.03546 *
## xbt 0.18675 0.07468 2.501 0.01283 *
## cub 0.76458 0.08194 9.331 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.06610227)
##
## Null deviance: 90.944 on 368 degrees of freedom
## Residual deviance: 24.061 on 364 degrees of freedom
## AIC: 51.735
##
## Number of Fisher Scoring iterations: 2
As can be seen from summary output panel above, the
cubist
predictions outweigh the influence of all of the
other models in the st
ensemble predictions on the
validation set, with much smaller contributions from the
glmnet
and xgbLinear
models. This raises the
question if a simpler, more parsimonious model, which only uses
cubist
with the spectral principal components plus pH would
be a potential alternative to provide operational diagnostics for Hp.
This would certainly reduce computing times and would also lend itself
to automation. I explore this in the next section.
There are many ways to quantify the uncertainty inherent in these predictions. I take a simple but quite robust approach here using quantile regression with quantreg. I am mainly interested in the overall spread of the spectral ensemble predictions; that is, the 90% probable intervals for the entire dataset.
# All features
fall <- select(adata,26:2566,9) ## full Alpha FT-IR spectral signatures + pH
pall <- select(adata,2567:2586,9) ## spectral principal components + pH
adata$pls <- predict(llist$pls, fall)
adata$glm <- predict(llist$glmnet, fall)
adata$xbl <- predict(llist$xgbLinear, fall)
adata$rfo <- predict(tlist$rf, pall)
adata$xbt <- predict(tlist$xgbTree, pall)
adata$cub <- predict(tlist$cubist, pall)
adata$st <- predict(st, adata)
# Write out predictions
preds <- select(adata,1,11,2587:2593) ## change manually here if you are analyzing other labels (pH, xCEC etc)
write.csv(preds, "./soil_acidity/Results/Hp_preds_final.csv", row.names=F)
Run and plot the quantile regression estimates of Hp against the spectral predictions.
# Quantile regression fits
stQ <- rq(Hp~st, tau=c(0.05,0.5,0.95), data = adata) ## full stacked model
cuQ <- rq(Hp~cub, tau=c(0.05,0.5,0.95), data = adata) ## cubist model only
par(pty="s", mar=c(4,4,1,1))
plot(Hp~st, xlab="Stacked model Hp prediction", ylab="Measured Hp", cex.lab=1.3,
xlim=c(0,5), ylim=c(0,5), adata)
curve(stQ$coefficients[2]*x+stQ$coefficients[1], add=T, from=0, to=5, col="blue", lwd=2)
curve(stQ$coefficients[4]*x+stQ$coefficients[3], add=T, from=0, to=5, col="red", lwd=2)
curve(stQ$coefficients[6]*x+stQ$coefficients[5], add=T, from=0, to=5, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)
plot(Hp~cub, xlab="Cubist Hp prediction", ylab="Measured Hp", cex.lab=1.3,
xlim=c(0,5), ylim=c(0,5), adata)
curve(cuQ$coefficients[2]*x+cuQ$coefficients[1], add=T, from=0, to=5, col="blue", lwd=2)
curve(cuQ$coefficients[4]*x+cuQ$coefficients[3], add=T, from=0, to=5, col="red", lwd=2)
curve(cuQ$coefficients[6]*x+cuQ$coefficients[5], add=T, from=0, to=5, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)
Just by visual inspection the 2 plots are quite similar, and there is
actually a near perfect correlation between them. Note however, that
this is a somewhat unusual situation that would not be the case for most
other spectral soil property predictions. The general rule is that a
properly calibrated and validated ensemble model will not perform any
worse than the best constituent model in the stack … but sometimes much
better. This statement also describes the example used in the this
notebook with the st
ensemble having a slightly lower (but
negligible) validation RMSE than the cubist
model. The
cubist
model also has slightly wider 90% probable intervals
than the stacked model. If you have run all of the R chunks in this
notebook you can always check those contentions. Then, if all went well,
the prediction file to check should be at
./soil_acidity/Results/Hp_preds.csv
in your working
directory. You could also try to stack other MLAs on these data.
caretEnsemble
and caret
offer >180 of your
favorite MLA alternatives to the ones that are presented here … if you
are into this kind of thing, please explore.
Lime application is recommended by soil test if soil pH is below the optimum range for any of the crops in the cropping system or crop rotation. The minimum pH of the soil is determined by the crop type or crop species with the highest soil pH requirement. This is typically calculated for a rolling 3-year period. To adjust soil pH to a target pH value for a specific cropping system, one must know both the current soil pH and its Hp.
Most soil testing laboratories and extension services make lime recommendations based on calibrations of pH measured before and after the addition of a pH buffer solution (e.g., CaOH2 or the SMP buffer solution, which is commonly used in the US … among others). Soil-test-based lime recommendations (LR, in kg ha-1 CaCO3 equivalents, see our AfSIS schema for this in the figure below) are subsequently adjusted to account for incorporation depth, liming material CCE, lime type (e.g. calcitic vs dolomitic), lime moisture content, parent material / mineralogy and time since previous lime application. The associated calculations are pretty straightforward (see e.g., the excellent guidance provided by PSU extension about this). The AfSIS approach to purely soil-test-based recommendations is also pretty straightforward (and simplistic) currently … see the schema for liming topsoils to pH ~ 6.5 directly below.
While this is just another rule of thumb liming recommendation schema that farmers could take or leave on the basis of soil tests, there are of course both much more complex and more dynamic socio-economic and environmental interactions at play e.g., between the effects of applications of lime, nitrogen fertilizers, other nutrient remediation measures, existing cropping systems, farmer investment risk aversion, water-use and greenhouse gas emissions that should be considered. Solving for those additionalities will require systematic and spatially/temporally representative surveys and field trials to monitor any potential socio-economic constraints and also any potentially associated environmental side effects. Systematic agronomic field trials should also be used to (in)validate and challenge any of the inherent assumptions that are made in using Hp for making soil-test-based liming recommendations to farmers (sensu the Precautionary principle). However, that is also true for any other purely soil-test-based procedures that might be used for issuing such recommendations.
The main takeaways from this notebook are the following:
This notebook produces precise & accurate MIR spectral predictions of the reserve acidity of soils (which are commonly used for providing soil-test-based liming recommendations) based on a completely reproducible, standard ensemble machine learning workflow (e.g., see Rocca, 2019).
The workflow can be changed quickly to model and predict other soil acidity related variables (e.g., pH, EC, CEC, SOC, Ca:Mg and mineralogically determined soil properties, among others). It can also be rapidly extended to support new geographical regions of interest and/or to their respective soil testing laboratories, operationally, and at low cost.
The workflow’s predictions could be used reliably / operationally to improve the precision, reduce the costs of and the environmental footprints of soil acidity assessments in survey, mapping, experimental and monitoring applications that focus on soil acidity and nutrient management of croplands.
At the other end of the pH scale, the identical workflow can be followed to diagnose e.g., salinity and sodicity levels, soil aggregate instability and certain micro-nutrient deficiencies of cropland soils, which are also prevalent and perhaps greater than the soil acidity related problems in many parts of Africa. I’ll provide additional notebooks around these issues in the near future.
The main note of caution is that soil-test-based liming recommendations (among other agronomic recommendations) should always be contextualized to specific crop production environments. Simple soil tests are a really good start but are basically not enough to provide useful, reliable and complete evidence-based extension advice to farmers. Any such recommendations should also require replicable code and updatable data to test the underlying heuristics.
Computationally, the main predictive chunks presented in this notebook run fairly fast (+/-) and probably could be automated, such that when a (soil) MIR measurement is received in a laboratory, a reasonable Hp prediction and the associated soil-test-based liming recommendations could be issued in near-real time. Note that for other spectrometer setups the calibration / validation transfer steps would need to be taken into account. This is also a good reason to properly curate any physical soil reference samples.
Any questions or comments about this notebook are most welcome via AFSIS.