Introduction

Soil acidity can be thought of as occurring in two forms (or pools): as active or soil solution acidity, measured by pH in a 1:1 solution of dry soil in water, and as reserve acidity. The reserve (or exchangeable) acidity, measured by Hp (primarily meq H+ + Al2+ / 100 g soil), is held on soil colloids and controls the overall level of the solution acidity. It is also commonly many times higher than the solution acidity. Hp levels depend on several properties such as the amount and type of clay (mineralogy), the amount of organic matter, and the CaO, MgO, K2O, Al2O3 and Na2O oxide concentrations of the soil. Thus, two soils could have the same soil pH but completely different lime requirements.

To effectively raise the soil pH of acid soils over the long-term, both active and reserve acidities should be neutralized to a level that can be tolerated by the crop combination/rotation that is being grown. Soil testing labs and extension services typically determine the buffering capacity and the lime requirement by measuring or estimating reserve acidity; see e.g., the excellent introductory tutorial in Moorberg & Crouse (v.2). Though, soil pH is the trigger for determining when lime is needed for mineral soils, it is not used to determine how much lime is needed to increase pH to the recommended range for a specific crop. While Hp measurements are not difficult to do in a wet chemistry lab, they are time-consuming and hard to do well. Some of the associated buffering solutions also produce hazardous waste and need to be handled carefully.

The spectral signatures of soils, and materials generally, are defined by their reflectance or absorbance as a function of wavelength in the electromagnetic spectrum. Under controlled conditions, the signatures result from electronic transitions of atoms and vibrational stretching and bending of structural groups of atoms that form molecules or crystals. Mid-infrared (MIR) spectroscopy has been shown to provide highly repeatable, rapid and low cost measurements of many different soil properties in numerous studies. The amount of light absorbed by a soil sample is measured with minimal sample preparation (drying and fine grinding) across a wide range of wavelengths to provide a unique spectral signature. An individual measurement can be performed in about 30 seconds, in contrast to conventional soil tests, which are typically slow, labor-intensive, expensive and/or use harmful chemicals.

The main motivation for using Hp in this example is that AfSIS is currently involved in several new, large-area projects, which are focusing on acid soil management and cropland lime requirements. The other operational management aspect of this is that acidity problems should probably be diagnosed and remediated first in the areas where they occur, and before attempting to solve for e.g., fertilizer input and/or other soil related agronomic management issues.

This notebook is intended for self-learning. It does not go into the details of the Spectroscopy or the determining physical chemistry itself, but instead it focuses on Chemometric processes and the associated machine learning workflows that are needed to generate useful predictions from a population of spectral signatures (features) relative to their corresponding reference measurements (labels). The calibrations would need to be adjusted to the specific MIR spectrometer instrument that is being used. In this particular example, I shall use topsoil (0-20 cm) and co-located subsoil (20-50 cm) data that were collected as part of the AfSIS project which sampled the major Köppen-Geiger climate zones of Africa excluding deserts, urban and other non-photosynthetically active land areas.

General data setup

To actually run the notebook, you will need to load the packages indicated in the chunk directly below. This allows you to assemble the wet chemistry and spectral dataframes providing a lot of options to generate spectral predictions of acidity relevant soil properties such as pH, EC, CEC, SOC, Ca:Mg, soil texture, and mineralogy among others. The notebook itself is maintained on my Github, and you can fork and modify it from there as you see fit.

# Package names
packages <- c("osfr", "caret", "caretEnsemble", "MASS", "pls", "glmnet", "randomForest", "xgboost", "Cubist", "quantreg", "leaflet", "htmlwidgets", "plyr", "dplyr", "doParallel")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

The next chunk downloads the data needed for running this particular example. It assembles georeferenced soil measurements collected across Africa and links these to Bruker Alpha FT-IR spectra.

# Create a data folder in your current working directory
dir.create("soil_acidity", showWarnings = F)
setwd("./soil_acidity")
dir.create("Results", showWarnings = F)

# Data download
osf_retrieve_file("f4dpw") %>% osf_download(conflicts = "skip")
unzip("acidity_data.zip", overwrite = T)
## Warning in unzip("acidity_data.zip", overwrite = T): error 1 in extracting from
## zip file
prof <- read.table("profiles.csv", header=T, sep=",") ## sample locations and depths
cnls <- read.table("cnls.csv", header=T, sep=",") ## wet chemistry data from CNLS, Nairobi
cnls$xCEC <- cnls$m3Ca/200+cnls$m3Mg/120+cnls$m3K/390+cnls$Hp ## calculates extractable xCEC  
text <- read.table("text.csv", header=T, sep=",") ## AfSIS LDPSA soil texture
oxid <- read.table("oxid.csv", header=T, sep=",") ## AfSIS XRF metal oxide data
spec <- read.table("alpha_spec.csv", header=T, sep=",") ## AfSIS spectral absorbance data
  
# Merge the wet chemistry reference dataframes
adata <- merge(prof, cnls, by = "ssid")
adata <- merge(adata, text, by = "ssid")
adata <- merge(adata, oxid, by = "ssid")

# Download figures
osf_retrieve_file("42gry") %>% osf_download(conflicts = "skip")
unzip("figures.zip", overwrite = T)
## Warning in unzip("figures.zip", overwrite = T): error 1 in extracting from zip
## file

The following chunk then writes out the dataframe AfSIS_reference_data.csv into your ./soil_acidity/Results directory if you’d like to process those outputs in software other than R. It also generates a location map of where those soil samples were obtained.

# Write out reference data frame
write.csv(adata, "./soil_acidity/Results/AfSIS_reference_data.csv", row.names = F)

# Soil sample locations
w <- leaflet() %>%
  setView(lng = mean(adata$lon), lat = mean(adata$lat), zoom = 4) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(adata$lon, adata$lat, clusterOptions = markerClusterOptions())
w ## plot widget 

Spectral feature conversions

One potentially problematic factor for many MLAs is the “Curse of dimensionality” that is imposed by the high dimensionality of the spectral data: w = 2,541 individual wavebands in this particular case. Moreover, many of those wavebands are strongly correlated, particularly those proximal to one another on the spectrum. These types of data pose generalization and prediction challenges for MLAs that involve Bagging, Boosting, Bayesian and Deep learning and can lead to overfitting problems.

Other algorithms such as Partial least squares regression (PLS), Ridge regression and/or Lasso regression handle the high dimensional data and collinearities well in a linear context but are frequently not very good at predicting the common non-linear and/or threshold relationships between the spectral signatures and their reference measurements.

There are some ways of mitigating these trade-offs. The first is to reduce the dimensionality and collinearity of the spectral data. There are a number of pre-processing techniques that can be applied in that context including: Principal components analysis (PCA), Independent components analysis (ICA) as well as other signal processing techniques such as, Non-negative matrix factorization (NMF). I use PCA here, and the next chunk appends the spectra and their resulting spectral component scores to the wet chemistry reference data.

# Spectral principal components
spec.pca <- prcomp(spec[,2:2542], center=T, scale=T)
pcas <- predict(spec.pca, spec)
pcas <- pcas[,1:20] ## save the first 20 components, which explain ~95% of the total spectral variance
fname <- paste("./soil_acidity/Results/", "spec_pcas.rds", sep = "")
saveRDS(spec.pca, fname) ## saves spectral PCA model

# Merge & write files
spec <- cbind(spec, pcas)
adata <- merge(adata, spec, by = "ssid")
write.csv(adata, "./soil_acidity/Results/AfSIS_acidity_data.csv", row.names=F)

The other main option is of course to run both types of algorithms (the data regularization or selection-based and the data reduction-based), and then look at if they can be usefully combined. This is the approach that I will take here.

Model training with caret and caretEnsemble

The following chunks calibrate soil Hp values using different machine learning algorithms (MLAs) to XRF and MIR spectral (feature) inputs using the caretEnsemble package. This general approach has won a lot of data science competitions e.g., at Kaggle. You may want to take a look there. They have some fantastic data science resources, courses and challenges openly available. You can also check out the AfSIS sponsored Africa Soil Property Prediction Challenge (2014), from which the notebook that is presented here was developed.

The main idea is to train a number of potentially competing models with k-fold cross-validation. At the end of the model training processes, the various models are ensembled (combined/stacked) on an independent validation dataset. When applied consistently over time and space, this is a form of Reinforcement learning, which should produce increasingly accurate predictions as new field, lab data and MLAs are obtained and run.

**Figure 1:** MLA training, validation and prediction workflow.

Figure 1: MLA training, validation and prediction workflow.

I’ll be using a machine learning approach, which is algorithmic, rather than data modeling based. You might want to take a look at Breiman (2001) to gauge the differences between the “two cultures” he describes. The figure above shows the basic workflow of the algorithmic approach that I apply here to predicting reserve acidity (Hp). The following chunk initially scrubs some of the extraneous objects in memory, sets-up labels and features, and creates a randomized partition between the training and validation dataframes.

rm(list=setdiff(ls(), c("adata"))) ## scrubs extraneous objects in memory

# Set randomization seed
seed <- 1235813
set.seed(seed)

# Randomize dataframe
adata <- adata[sample(1:nrow(adata)), ]

# Split data into calibration and validation sets
gsIndex <- createDataPartition(adata$Hp, p = 8/10, list=F, times = 1)
cal <- adata[ gsIndex,]
val <- adata[-gsIndex,]

# Set calibration labels
labs <- c("Hp") ## insert other labels ("pH", "EC", "eCEC", ...) here!
lcal <- as.vector(t(cal[labs]))

# Calibration features
fcal <- select(cal,26:2566,9) ## full Alpha FT-IR spectral signatures + pH
pcal <- select(cal,2567:2586,9) ## spectral principal components + pH

Note that I also include soil pH (in water) in the calibration features because it is routinely, quickly and cheaply measured in most soil spectral labs (and frequently even in the field). It is also inversely (though loosely) related to Hp. The general notion is that soils with a pH > 7 are unlikely to have any measurable reserve acidity (see the figure below).

par(pty="s", mar=c(4,4,1,1))
plot(Hp~pH, xlab="pH (water)", ylab="Hp (meq / 100 g soil)", cex.lab=1.3, 
     xlim=c(3,10), ylim=c(0,5), adata)
**Figure 2:** Relationship between soil pH and reserve acidity (Hp).

Figure 2: Relationship between soil pH and reserve acidity (Hp).

caretEnsemble has 3 primary functions: caretList, caretEnsemble and caretStack. caretList is used to build lists of caret models on the same training data, with the same model resampling parameters. caretEnsemble and caretStack are used to create ensemble models from such lists of individual caret models. caretEnsemble uses a generalized linear model glm to create a simple weighted combination of the constituent models and caretStack uses a specific caret model to combine the outputs from a choice of other component caret models. The main advantage of running caretEnsemble is that the intended individual caret models can be run all-at-once, rather than running each model separately … abbreviating and hopefully clarifying the calibration code. All model fitting processes can be (are) parallelized to facilitate efficient use of either local or cloud-based computing resources. Note that there are also other options available for this e.g., the snowfall package, among others.

The next chunk fits 3 initial models that use the full spectral signatures i.e., the w = 2,541 individual wavebands. These are reasonably common linear Chemometric algorithms and you can learn more about how they work in R from the following links: pls, glmnet and xgbLinear. I fit these with 10-fold cross-validation and default-tuning of the relevant hyperparameters. Also, note that this chunk can take up to 1 hour to run on a normal computer with 8 cores and 16 Gb of RAM. There are just a lot of spectral features to sort through.

# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
set.seed(seed)
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")

# Fit the 3 initial calibration models
llist <- caretList(fcal, lcal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("pls", "glmnet", "xgbLinear"),
                   preProcess = c("center","scale"),
                   metric = "RMSE")
stopCluster(mc)

The next chunk fits 3 additional, tree/forest based MLAs that use the that use the spectral principal component scores (pc = 20) + pH. You can learn more about how they work in R from the following links at: randomForest, xgboost and Cubist. This actually runs quite quickly on a normal computer.

# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")

# Fit 3 calibration models using the spectral principal component features
tlist <- caretList(pcal, lcal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("rf", "xgbTree", "cubist"),
                   preProcess = c("center","scale"),
                   metric = "RMSE")
stopCluster(mc)

Model stacking and prediction with caret

The main point here is not to evaluate a best individual model but rather to evaluate the combination of the previously fitted models against a 20% hold-out validation dataset. This provides robust statistical estimates of how the different models should be weighted against one-another. This chunk initially generates the individual model predictions for the validation dataframe.

# Validation features
fval <- select(val,26:2566,9) ## full Alpha FT-IR spectral signatures + pH
pval <- select(val,2567:2586,9) ## spectral principal components + pH

val$pls <- predict(llist$pls, fval)
val$glm <- predict(llist$glmnet, fval)
val$xbl <- predict(llist$xgbLinear, fval)
val$rfo <- predict(tlist$rf, pval)
val$xbt <- predict(tlist$xgbTree, pval)
val$cub <- predict(tlist$cubist, pval)

This next chunk fits the model ensemble with the glmStepAIC function from the MASS library using the validation dataframe. You could explore other options here, but I think that this provides a reasonable combination and weighting of the 6 models that were produced in the ensemble training steps.

lval <- as.vector(t(val[labs]))
fval <- select(val,pls,glm,xbl,rfo,xbt,cub) ## subset validation models

# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Model setup and fitting
set.seed(seed)
tc <- trainControl(method="repeatedcv", number=10, repeats=3, allowParallel=T)

st <- train(fval, lval,
            method = "glmStepAIC",
            trControl = tc,
            metric = "RMSE")

val$st <- predict(st, val) ## stacked predictions
stopCluster(mc)
fname <- paste("./soil_acidity/Results/", labs, "_st.rds", sep = "")
saveRDS(st, fname)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.79034  -0.11686  -0.01604   0.08395   1.78569  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01522    0.02321  -0.656  0.51235    
## xbl         -0.20406    0.07744  -2.635  0.00877 ** 
## rfo          0.28159    0.13339   2.111  0.03546 *  
## xbt          0.18675    0.07468   2.501  0.01283 *  
## cub          0.76458    0.08194   9.331  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.06610227)
## 
##     Null deviance: 90.944  on 368  degrees of freedom
## Residual deviance: 24.061  on 364  degrees of freedom
## AIC: 51.735
## 
## Number of Fisher Scoring iterations: 2

As can be seen from summary output panel above, the cubist predictions outweigh the influence of all of the other models in the st ensemble predictions on the validation set, with much smaller contributions from the glmnet and xgbLinear models. This raises the question if a simpler, more parsimonious model, which only uses cubist with the spectral principal components plus pH would be a potential alternative to provide operational diagnostics for Hp. This would certainly reduce computing times and would also lend itself to automation. I explore this in the next section.

Ensemble prediction uncertainty estimates

There are many ways to quantify the uncertainty inherent in these predictions. I take a simple but quite robust approach here using quantile regression with quantreg. I am mainly interested in the overall spread of the spectral ensemble predictions; that is, the 90% probable intervals for the entire dataset.

# All features
fall <- select(adata,26:2566,9) ## full Alpha FT-IR spectral signatures + pH
pall <- select(adata,2567:2586,9) ## spectral principal components + pH

adata$pls <- predict(llist$pls, fall)
adata$glm <- predict(llist$glmnet, fall)
adata$xbl <- predict(llist$xgbLinear, fall)
adata$rfo <- predict(tlist$rf, pall)
adata$xbt <- predict(tlist$xgbTree, pall)
adata$cub <- predict(tlist$cubist, pall)
adata$st <- predict(st, adata)

# Write out predictions
preds <- select(adata,1,11,2587:2593) ## change manually here if you are analyzing other labels (pH, xCEC etc)
write.csv(preds, "./soil_acidity/Results/Hp_preds_final.csv", row.names=F)

Run and plot the quantile regression estimates of Hp against the spectral predictions.

# Quantile regression fits
stQ <- rq(Hp~st, tau=c(0.05,0.5,0.95), data = adata) ## full stacked model
cuQ <- rq(Hp~cub, tau=c(0.05,0.5,0.95), data = adata) ## cubist model only
par(pty="s", mar=c(4,4,1,1))
plot(Hp~st, xlab="Stacked model Hp prediction", ylab="Measured Hp", cex.lab=1.3, 
     xlim=c(0,5), ylim=c(0,5), adata)
curve(stQ$coefficients[2]*x+stQ$coefficients[1], add=T, from=0, to=5, col="blue", lwd=2)
curve(stQ$coefficients[4]*x+stQ$coefficients[3], add=T, from=0, to=5, col="red", lwd=2)
curve(stQ$coefficients[6]*x+stQ$coefficients[5], add=T, from=0, to=5, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)

plot(Hp~cub, xlab="Cubist Hp prediction", ylab="Measured Hp", cex.lab=1.3, 
     xlim=c(0,5), ylim=c(0,5), adata)
curve(cuQ$coefficients[2]*x+cuQ$coefficients[1], add=T, from=0, to=5, col="blue", lwd=2)
curve(cuQ$coefficients[4]*x+cuQ$coefficients[3], add=T, from=0, to=5, col="red", lwd=2)
curve(cuQ$coefficients[6]*x+cuQ$coefficients[5], add=T, from=0, to=5, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)
**Figure 3:** Quantile regression fits of modeled vs observed reserve acidity (Hp). The blue lines are the 5% and 95% quantile regression estimates. On the left are the stacked model `st` predictions. On the right are the `cubist` based predictions.**Figure 3:** Quantile regression fits of modeled vs observed reserve acidity (Hp). The blue lines are the 5% and 95% quantile regression estimates. On the left are the stacked model `st` predictions. On the right are the `cubist` based predictions.

Figure 3: Quantile regression fits of modeled vs observed reserve acidity (Hp). The blue lines are the 5% and 95% quantile regression estimates. On the left are the stacked model st predictions. On the right are the cubist based predictions.

Just by visual inspection the 2 plots are quite similar, and there is actually a near perfect correlation between them. Note however, that this is a somewhat unusual situation that would not be the case for most other spectral soil property predictions. The general rule is that a properly calibrated and validated ensemble model will not perform any worse than the best constituent model in the stack … but sometimes much better. This statement also describes the example used in the this notebook with the st ensemble having a slightly lower (but negligible) validation RMSE than the cubist model. The cubist model also has slightly wider 90% probable intervals than the stacked model. If you have run all of the R chunks in this notebook you can always check those contentions. Then, if all went well, the prediction file to check should be at ./soil_acidity/Results/Hp_preds.csv in your working directory. You could also try to stack other MLAs on these data. caretEnsemble and caret offer >180 of your favorite MLA alternatives to the ones that are presented here … if you are into this kind of thing, please explore.

Soil-test-based liming recommendations

Lime application is recommended by soil test if soil pH is below the optimum range for any of the crops in the cropping system or crop rotation. The minimum pH of the soil is determined by the crop type or crop species with the highest soil pH requirement. This is typically calculated for a rolling 3-year period. To adjust soil pH to a target pH value for a specific cropping system, one must know both the current soil pH and its Hp.

Most soil testing laboratories and extension services make lime recommendations based on calibrations of pH measured before and after the addition of a pH buffer solution (e.g., CaOH2 or the SMP buffer solution, which is commonly used in the US … among others). Soil-test-based lime recommendations (LR, in kg ha-1 CaCO3 equivalents, see our AfSIS schema for this in the figure below) are subsequently adjusted to account for incorporation depth, liming material CCE, lime type (e.g. calcitic vs dolomitic), lime moisture content, parent material / mineralogy and time since previous lime application. The associated calculations are pretty straightforward (see e.g., the excellent guidance provided by PSU extension about this). The AfSIS approach to purely soil-test-based recommendations is also pretty straightforward (and simplistic) currently … see the schema for liming topsoils to pH ~ 6.5 directly below.


**Figure 4:** AfSIS schema for field-crop, soil-test-based (topsoil, 0-20 cm) Aglime recommendations (LR, in kg ha^-1^ CCE).

Figure 4: AfSIS schema for field-crop, soil-test-based (topsoil, 0-20 cm) Aglime recommendations (LR, in kg ha-1 CCE).

While this is just another rule of thumb liming recommendation schema that farmers could take or leave on the basis of soil tests, there are of course both much more complex and more dynamic socio-economic and environmental interactions at play e.g., between the effects of applications of lime, nitrogen fertilizers, other nutrient remediation measures, existing cropping systems, farmer investment risk aversion, water-use and greenhouse gas emissions that should be considered. Solving for those additionalities will require systematic and spatially/temporally representative surveys and field trials to monitor any potential socio-economic constraints and also any potentially associated environmental side effects. Systematic agronomic field trials should also be used to (in)validate and challenge any of the inherent assumptions that are made in using Hp for making soil-test-based liming recommendations to farmers (sensu the Precautionary principle). However, that is also true for any other purely soil-test-based procedures that might be used for issuing such recommendations.

Takeaways

The main takeaways from this notebook are the following:

Any questions or comments about this notebook are most welcome via AFSIS.