1 Introduction

Soil acidity can be thought of as occurring in two forms (or pools): as active or soil solution acidity, measured by pH in a 1:1 solution of dry soil in water, and as exchangeable acidity. The exchangeable (or reserve) acidity, measured by Hp (meq H⁺ + Al²⁺ / 100 g soil), is held on soil colloids and controls the overall level of the solution acidity. It is also commonly many times higher than the solution acidity. Hp levels depend on several properties such as the amount and type of clay (mineralogy), the amount of organic matter, and the CaO, MgO, K₂O, Al₂O₃ and Na₂O oxide concentrations of the soil. Thus, two soils could have the same soil pH but completely different lime requirements.

To effectively remediate acid soils over the long-term, both active and exchangeable acidities should be neutralized to a level that can be tolerated by the crop combination/rotation that is being grown. Soil testing labs and extension services typically determine the buffering capacity and the lime requirement by measuring or estimating exchangeable acidity. Though, soil pH is the trigger for determining when lime is needed for mineral soils, it is not used to determine how much lime is needed to increase pH to the recommended range for a specific crop.

The Lime Target Acidity Saturation (LiTAS) model (Merlos et al., 2023) estimates the excess exchangeable acidity (xHp) of soils using the following formula:

\[ \text{xHp } (\text{cmolc kg}^{-1} \text{soil}) = \frac{\text{Hp} - \left( \frac{\text{TAS}}{100} \right) \times \text{ECEC}}{a + \left( \frac{\text{TAS}}{100} \right) \times (b - a)} \]

where:

Hp is the exchangeable acidity (cmolc kg^-1).
ECEC is the effective cation exchange capacity (cmolc kg^-1).
TAS is the target acidity saturation of the ECEC after liming (%).
a and b are empirical parameters derived from soil incubation studies, with typical values of a = 0.6 and b = 0.92.

To transform xHp from units of charge per soil mass to units CaCO₃ mass per unit area, soil bulk density (BD, g cm^-3), lime incorporation depth (depth, cm), and the CaCO₃ equivalent (CCE, g CaCO₃ g^-1 Lime) of the applied lime are needed. The lime application rate (LR) can be calculated as:

\[ \text{LR } (\text{t ha}^{-1}) = \begin{cases} 0 & \text{if } \text{xHp} < 0 \\ \frac{\text{xHp } \times \text{ BD } \times \text{ depth}}{\text{CCE}} & \text{if } \text{xHp} \geq 0 \end{cases} \]

The LiTAS model aims to improve accuracy over previous liming models by leveraging strong empirical relationships. It is robust to variations in soil properties and experimental conditions. However, many African countries lack the necessary, locally calibrated, soil tests, and the high costs associated with these tests (generally 30-60 USD per sample for the lab analyses alone) make it impractical for most farmers to accurately assess and regularly monitor lime requirements. In contrast the laboratory costs for MIR screening are less than 5 USD per sample.

The recommendation process for xHp remediation used in this notebook involves evaluating whether to take immediate remediation actions, perform measurements to assess xHp levels, or take no action. The framework leverages a multilevel model to predict xHp levels of soil samples based on their MIR spectral properties. The model provides prior distributions of xHp levels, which are then updated with MIR soil test data to form a posterior distribution. Different actions—remediation without measurement, doing nothing, and measuring first—are assessed using this posterior distribution.

The notebook is intended for self-learning and focuses on the machine learning and Bayesian workflows needed to generate useful predictions from a population of spectral signatures or wet chemistry properties (features) relative to their corresponding reference measurements (labels). In this example, we use topsoil (0-20 cm) and co-located subsoil (20-50 cm) data collected as part of the AfSIS project, which sampled the major Köppen-Geiger climate zones of Africa, excluding deserts, urban areas, and other non-photosynthetically active land areas.

The R markdown document is maintained and versioned on the AfSIS OSF repository here, from where it can be downloaded and modified. Please cite this work as: M.G. Walsh, J.V. Silva, and J. Chamberlin (2024). Excess soil acidity remediation decisions with MIR screening. https://doi.org/10.17605/OSF.IO/ZXBEF.

2 Setup

# Package names
packages <- c("osfr", "tidyverse", "ggExtra", "cowplot", "ggridges", "caret", "caretEnsemble",
              "doParallel", "randomForest", "xgboost", "glmnet", "pROC", "brms", "DT")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
    install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

2.1 Data download

# Download AfSIS reference data
osf_retrieve_node("s3auy") %>%
osf_ls_files(n_max = Inf) %>%
osf_download(path = "./data", conflicts = "overwrite")

refd <- read.csv("./data/AfSIS_ref_data.csv", header = TRUE, sep = ",")
spec <- read.csv("./data/alpha_spec.csv", header = TRUE, sep = ",")

3 ETL

3.1 Calculate ECEC

ecec <- refd %>%
  select(country, site, ssid, depth, lon, lat, pH, Hp, m3Ca, m3Mg, m3K, m3Na, m3Al) %>%
  mutate(
    Ca = m3Ca / 200,
    Mg = m3Mg / 122,
    K = m3K / 391,
    Na = m3Na / 230,
    Al = m3Al / 90,
    ecec = Ca + Mg + K + Na + Hp,
    Hp_sat = Hp / ecec,
    depth = ifelse(depth == "Topsoil", "top", "sub")
  )

3.2 Calculate excess exchangeable acidity (xHp)

# Specify parameters
a <- 0.6
b <- 0.92
TAS <- 10 # target exch. acidity saturation (%)

# Calculate the lime factor (lf)
lf <- 1 / (a + (TAS / 100) * (b - a))

# Calculate the excess acidity (xHp in cmol(+)/kg) 
ecec <- ecec %>%
  mutate(
    xHp = lf * (Hp - (TAS / 100) * ecec),
    xHp_cut = ifelse(xHp < 0, "n", "e") # n = nil, e = potential excess
  )

3.3 Spectral feature conversions

# Compute spectral principal components
alpha.pca <- spec %>%
  select(2:2542) %>%
  prcomp(center = TRUE, scale. = TRUE)

# Save PCA model
saveRDS(alpha.pca, "./results/alpha_pca.rds")

alpha.pca <- readRDS("./results/alpha_pca.rds")

# Predict principal components
pcas <- predict(alpha.pca, spec[, 2:2542])

# Merge to ecec data frame
pcas <- spec %>%
  select(ssid) %>%
  bind_cols(as.data.frame(pcas[, 1:20]))

ecec <- ecec %>%
  right_join(pcas, by = "ssid")

# Write .csv for reuse
ecec <- ecec %>%
  select(site, ssid, lon, lat, depth, pH, Hp, Ca, Mg, K, Na, Al, ecec,
         xHp, xHp_cut, Hp_sat, PC1, PC2, PC3, PC4, PC5, PC6, PC7, PC8, PC9, PC10, PC11,
         PC12, PC13, PC14, PC15, PC16, PC17, PC18, PC19, PC20)

write.csv(ecec, "./results/Ref_ecec.csv", row.names = FALSE)

str(ecec)

## 'data.frame':    1872 obs. of  36 variables:
##  $ site   : chr  "Chinyanghuku" "Chinyanghuku" "Chinyanghuku" "Chinyanghuku" ...
##  $ ssid   : chr  "icr005928" "icr005929" "icr005946" "icr005947" ...
##  $ lon    : num  36.1 36.1 36.1 36.1 36.1 ...
##  $ lat    : num  -6.93 -6.93 -6.91 -6.91 -6.88 ...
##  $ depth  : chr  "top" "sub" "top" "sub" ...
##  $ pH     : num  7.56 6.91 6.91 6.87 7.88 ...
##  $ Hp     : num  0.09 0.13 0.18 0.14 0.27 0.18 0.09 0.18 0.18 0.09 ...
##  $ Ca     : num  4.73 3.15 3.65 3.31 36.06 ...
##  $ Mg     : num  1.84 2.52 2.99 2.59 3.12 ...
##  $ K      : num  0.536 0.376 0.215 0.15 0.632 ...
##  $ Na     : num  0.156 0.177 0.177 0.241 0.148 ...
##  $ Al     : num  8.3 9.47 6.61 5.74 6.69 ...
##  $ ecec   : num  7.36 6.35 7.21 6.43 40.23 ...
##  $ xHp    : num  -1.022 -0.799 -0.856 -0.795 -5.938 ...
##  $ xHp_cut: chr  "n" "n" "n" "n" ...
##  $ Hp_sat : num  0.01223 0.02046 0.02496 0.02179 0.00671 ...
##  $ PC1    : num  32.1 19.5 24 20.1 28.4 ...
##  $ PC2    : num  21.34 11.13 13.25 9.36 16.06 ...
##  $ PC3    : num  -17.82 5.12 10.47 7.13 25.1 ...
##  $ PC4    : num  -2.7 -4.48 -3.22 -5.15 3.91 ...
##  $ PC5    : num  8.9 9.72 10.77 12.63 8.43 ...
##  $ PC6    : num  -7.24 -1.46 -0.91 -3.29 5.43 ...
##  $ PC7    : num  1.13 4.25 6.2 5.21 8.6 ...
##  $ PC8    : num  -3.6697 1.3521 2.4578 0.0122 -5.8869 ...
##  $ PC9    : num  3.746 2.029 2.45 3.508 -0.962 ...
##  $ PC10   : num  -7.385 -4.847 -4.499 -5.24 0.937 ...
##  $ PC11   : num  2.94 2.58 3.11 2.95 1.23 ...
##  $ PC12   : num  -2.251 -1.256 -0.901 -2.228 -1.753 ...
##  $ PC13   : num  0.6142 0.8169 0.9331 0.177 0.0915 ...
##  $ PC14   : num  -2.32 -1.53 -2.04 -1.36 -1.7 ...
##  $ PC15   : num  1.166 -1.516 -2.16 -1.42 0.229 ...
##  $ PC16   : num  -1.071 0.126 0.289 -0.809 -0.453 ...
##  $ PC17   : num  0.313 -0.202 -0.533 -0.949 -1.367 ...
##  $ PC18   : num  -0.431 -0.672 -1.332 -1.184 0.441 ...
##  $ PC19   : num  2.1 1.81 1.22 2.02 1.45 ...
##  $ PC20   : num  -1.266 -0.468 0.278 -0.452 0.832 ...

4 MIR model

Mid-infrared (MIR) spectrometry provides highly repeatable, rapid, and low-cost measurements of many soil properties (Shepherd, Shepherd and Walsh, 2015). It measures the amount of light absorbed by a soil sample across a wide range of wavelengths, creating a unique spectral signature with minimal preparation (drying and fine grinding). An individual measurement takes about 30 seconds, unlike conventional soil tests, which are slow, labor-intensive, expensive, and/or use harmful chemicals.

We will be using a stacked generalization approach (Wolpert, 1992) in this section of the notebook. Stacked generalization is an ensemble learning technique that seeks to improve model predictions by combining the outputs of multiple models, potentially of different types, in a strategic manner. This is how it works:

Split the data: into representative training (calibration) and test (validation) sets.
Base Models: You start by training several different predictive models using a training dataset. These models can be of any type and are often diverse, such as decision trees, support vector machines, neural networks, etc. They are known as base (or level-0) models.
Hold-Out or Cross-Validation Predictions: Next, you use these base models to make predictions on a separate dataset, which can either be a hold-out validation set or generated through cross-validation. Essentially, you’re interested in capturing the predictions each model makes on data it hasn’t seen before.
Stacking: The predictions made by the base models are then used as input features to train a higher-level model, known as a meta-model or a level-1 model. The idea is that this meta-model learns how to best combine the predictions from the base models to make a final prediction.
Final Prediction: When you need to make predictions on new data, you first run the data through all the base models, take their predictions, and then input those predictions into the meta-model. The meta-model’s output is the final prediction.

The intuition behind stacking is that while individual models may have particular strengths and weaknesses, a meta-model may be able to learn the best way to combine their predictions, to achieve better overall performance. Stacking has frequently been shown to outperform other ensemble methods in various tasks, given a careful choice of features, base-models, and stacking approach (Kaggle AI report, 2023).

4.1 Setup

# Select data
spec_pred <- ecec %>%
  select(ssid, site, xHp, xHp_cut, PC1, PC2, PC3, PC4, PC5, PC6, PC7, PC8, PC9,
         PC10, PC11, PC12, PC13, PC14, PC15, PC16, PC17, PC18, PC19, PC20) %>%
  filter(!is.na(xHp))

# Split data into calibration and validation sets
gsIndex <- createDataPartition(spec_pred$xHp, p = 8/10, list=F, times = 1)
cal <- spec_pred[ gsIndex,]
val <- spec_pred[-gsIndex,]

4.2 Base-learner training

# Start doParallel to parallelize model fitting
set.seed(5321)
mc <- makeCluster(detectCores() - 4)
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions = "final")

formula <- xHp ~ PC1+PC2+PC3+PC4+PC5+PC6+PC7+PC8+PC9+PC10+PC11+
                 PC12+PC13+PC14+PC15+PC16+PC17+PC18+PC19+PC20

# Fit models
blist <- caretList(formula,
                   data = cal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("rf", "xgbTree", "cubist"),
                   metric = "RMSE")

# Save fitted model object
stopCluster(mc)
saveRDS(blist, "./results/Ref_xHp_base.rds")

4.3 Base-learner predictions

blist <- readRDS("./results/Ref_xHp_base.rds") 

# Generate predictions on validation set
preds <- as.data.frame(predict(blist, val))

# Bind to `val`
val <- val %>%
  bind_cols(preds)

4.4 Stacked-learner training

# Generate predictions on calibration set
preds <- as.data.frame(predict(blist, cal))

# Bind to `cal`
cal <- cal %>%
  bind_cols(preds)

xHp_sta <- brm(data = cal,
           family = gaussian(),
           formula = xHp ~ rf + xgbTree + cubist + (1|site),
           # prior = priors,
           iter = 2500, warmup = 1000, chains = 4, cores = 8,
           control = list(adapt_delta = .975, max_treedepth = 20),
           seed = 1235813)

saveRDS(xHp_sta, "./results/Ref_xHp_stack.rds")

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: xHp ~ rf + xgbTree + cubist + (1 | site) 
##    Data: cal (Number of observations: 1480) 
##   Draws: 4 chains, each with iter = 2500; warmup = 1000; thin = 1;
##          total post-warmup draws = 6000
## 
## Multilevel Hyperparameters:
## ~site (Number of levels: 58) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.17      0.03     0.11     0.23 1.00     2058     3533
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.07      0.03     0.01     0.13 1.00     2714     3995
## rf            0.07      0.05    -0.03     0.16 1.00     4124     3963
## xgbTree       0.18      0.04     0.11     0.26 1.00     6144     4465
## cubist        0.79      0.05     0.70     0.88 1.00     4526     4558
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     0.68      0.01     0.66     0.71 1.00     5892     4497
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

4.5 Stacked-learner prediction

# Generate predictions on validation set
preds <- predict(xHp_sta, newdata = val) %>%
  as.data.frame() %>%
  select(1:2)

val <- val %>%
  bind_cols(preds)

4.6 Final MIR model prediction

base <- as.data.frame(predict(blist, spec_pred))

spec_pred <- spec_pred %>%
  bind_cols(base)

stack <- predict(xHp_sta, spec_pred, probs = c(0.1, 0.9)) %>%
  as.data.frame() %>%
  select(1:4)

spec_pred <- spec_pred %>%
  bind_cols(stack)

# Write .csv
write.csv(spec_pred, "./results/Ref_spec_preds.csv", row.names = FALSE)

## 'data.frame':    1848 obs. of  31 variables:
##  $ ssid     : chr  "icr005928" "icr005929" "icr005946" "icr005947" ...
##  $ site     : chr  "Chinyanghuku" "Chinyanghuku" "Chinyanghuku" "Chinyanghuku" ...
##  $ xHp      : num  -1.022 -0.799 -0.856 -0.795 -5.938 ...
##  $ xHp_cut  : chr  "n" "n" "n" "n" ...
##  $ PC1      : num  32.1 19.5 24 20.1 28.4 ...
##  $ PC2      : num  21.34 11.13 13.25 9.36 16.06 ...
##  $ PC3      : num  -17.82 5.12 10.47 7.13 25.1 ...
##  $ PC4      : num  -2.7 -4.48 -3.22 -5.15 3.91 ...
##  $ PC5      : num  8.9 9.72 10.77 12.63 8.43 ...
##  $ PC6      : num  -7.24 -1.46 -0.91 -3.29 5.43 ...
##  $ PC7      : num  1.13 4.25 6.2 5.21 8.6 ...
##  $ PC8      : num  -3.6697 1.3521 2.4578 0.0122 -5.8869 ...
##  $ PC9      : num  3.746 2.029 2.45 3.508 -0.962 ...
##  $ PC10     : num  -7.385 -4.847 -4.499 -5.24 0.937 ...
##  $ PC11     : num  2.94 2.58 3.11 2.95 1.23 ...
##  $ PC12     : num  -2.251 -1.256 -0.901 -2.228 -1.753 ...
##  $ PC13     : num  0.6142 0.8169 0.9331 0.177 0.0915 ...
##  $ PC14     : num  -2.32 -1.53 -2.04 -1.36 -1.7 ...
##  $ PC15     : num  1.166 -1.516 -2.16 -1.42 0.229 ...
##  $ PC16     : num  -1.071 0.126 0.289 -0.809 -0.453 ...
##  $ PC17     : num  0.313 -0.202 -0.533 -0.949 -1.367 ...
##  $ PC18     : num  -0.431 -0.672 -1.332 -1.184 0.441 ...
##  $ PC19     : num  2.1 1.81 1.22 2.02 1.45 ...
##  $ PC20     : num  -1.266 -0.468 0.278 -0.452 0.832 ...
##  $ rf       : num  -1.02 -1.55 -1.55 -1.45 -4.48 ...
##  $ xgbTree  : num  -0.61 -1.8 -1.34 -1.04 -5.22 ...
##  $ cubist   : num  -0.751 -0.539 -0.991 -0.823 -5.399 ...
##  $ Estimate : num  -0.681 -0.776 -1.047 -0.847 -5.45 ...
##  $ Est.Error: num  0.691 0.7 0.69 0.69 0.695 ...
##  $ Q10      : num  -1.56 -1.68 -1.94 -1.73 -6.34 ...
##  $ Q90      : num  0.2154 0.1254 -0.1499 0.0414 -4.539 ...

p <- ggplot(spec_pred, aes(x = xHp, y = Estimate)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, linetype = "solid", color = "red") +
  labs(
    x = "MIR predicted xHp",
    y = "Reference xHp"
  ) +
  theme_minimal() +
  coord_fixed()

# Add marginal density plots
p <- ggMarginal(p, type = "density", margins = "both", size = 5, 
                               fill = "grey", color = "black")

Figure 1. Stacked MIR model predictions vs reference xHp values.

4.7 Site-level posteriors

# Extract the posterior samples for the random `site` intercepts
post_samp <- as_draws_df(xHp_sta)
ran_eff <- post_samp %>% 
  select(starts_with("r_site"))

# Convert the random effects to a long format
ran_eff_long <- ran_eff %>%
  pivot_longer(cols = everything(),
               names_to = "site",
               values_to = "value")

# Clean up the site names
ran_eff_long <- ran_eff_long %>%
  mutate(site = gsub("r_site\\[|,Intercept\\]", "", site))

p <- ggplot(ran_eff_long, aes(x = value, y = site, fill = site)) +
  geom_density_ridges(alpha = 0.7, rel_min_height = 0.01, scale = 1.2) +
  scale_y_discrete(expand = expansion(add = c(1.5, 1.5))) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(x = "Intercept Value",
       y = "AfSIS Site") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12),
        axis.text.y = element_text(size = 10, angle = 0, hjust = 1))

Figure 2. Posterior distributions of site-level random intercepts of the MIR model.

5 Remediation decisions

When recommending acid soil remediation at a new location or site, some critical decisions must be made. The options are: (1) undertake costly remediation with agricultural lime to mitigate crop production risks, (2) do nothing, or (3) conduct a relatively inexpensive assessment to measure the risk and then decide whether to (a) remediate or (b) do nothing. This process can occur at the farm, community and national levels. Accurate risk estimates are essential for conducting these analyses. Precise local risk estimates enable the development of localized recommendations, allowing focused attention and effort on farms and communities at the highest risk. The main steps in the approach are:

Generate the Posterior Predictive Distribution: For each observation, generate a posterior predictive distribution of the exchangeable acidity (xHp) using posterior_predict() in brms.
Calculate Credible Interval: Construct the Bayesian credible interval (e.g., 80% CI) for model posterior predictions.
Decision heuristics: Do Nothing: If the entire credible interval is below 0; Remediate: If the entire credible interval is above 0, or Measure: If the credible interval includes 0.

This approach ensures cost-effective soil acidity remediation decisions by incorporating hierarchical Bayesian modeling to account for geographical and location-specific variations in excess soil acidity levels. The next chunk handles the heuristics.

spec_pred <- spec_pred %>%
  mutate(
      inrange = 0 >= Q10 & 0 <= Q90, # 80% CI, adjust as needed
      xHp_rec = case_when(
          Estimate > 0 & inrange == FALSE ~ "remediate",
          inrange == TRUE ~ "measure",
          TRUE ~ "do_nothing"
      )
    )

5.1 Generate site-level recommendations

site_summaries <- spec_pred %>%
  group_by(site) %>%
  summarize(
    total_locations = n(),
    num_remediate = sum(xHp_rec == "remediate"),
    num_measure = sum(xHp_rec == "measure"),
    num_do_nothing = sum(xHp_rec == "do_nothing"),
    prop_remediate = num_remediate / total_locations,
    prop_measure = num_measure / total_locations,
    prop_do_nothing = num_do_nothing / total_locations,
    site_decision = case_when(
      prop_remediate > 0.5 ~ "Remediate site",
      prop_measure > 0.5 ~ "Measure",
      TRUE ~ "Do nothing"
    )
  )

# Create a more concise summary column
site_summaries <- site_summaries %>%
  mutate(
    summary = paste0(
      "Remediate: ", round(prop_remediate * 100, 1), "%, ",
      "Measure: ", round(prop_measure * 100, 1), "%, ",
      "Do Nothing: ", round(prop_do_nothing * 100, 1), "%"
    )
  ) %>%
  select(site, total_locations, summary, site_decision)

6 Takeaways

This notebook provides a comprehensive framework for diagnosing excess soil acidity using Mid-Infrared (MIR) screening techniques. The document guides users through the processes of data handling, analysis, and the decision-making necessary for effective soil remediation. The key takeaways are:

We demonstrate generalized stacking of 3 base-learners in a Bayesian multilevel model to form posterior predictions of xHp given MIR measurements of a diverse set of African soils. The approach produces precise & accurate MIR spectral predictions of the excess exchangeable acidity of soils (which are commonly used for providing soil-test-based liming recommendations) based on a completely reproducible, standard ensemble machine learning workflow (e.g., see Rocca, 2019).
The script also incorporates a Bayesian recommendation approach to make data-driven soil remediation recommendations. This method allows for probabilistic modeling of soil acidity and provides robust, evidence-based guidance for selecting appropriate soil acidity remediation strategies.
The workflow can be changed quickly to model and predict other soil acidity related variables (e.g., pH, EC, CEC, SOC, Ca:Mg and mineralogically determined soil properties, among others). It can also be rapidly extended to support new geographical regions of interest and/or to their respective soil testing laboratories, operationally, and at low cost.
The workflow’s predictions could be used reliably / operationally to improve the precision, reduce the costs of and the environmental footprints of soil acidity assessments in survey, mapping, experimental and monitoring applications that focus on soil acidity and nutrient management of croplands.
At the other end of the pH scale, the identical workflow can be followed to diagnose e.g., salinity and sodicity levels, soil aggregate instability and certain micro-nutrient deficiencies of cropland soils, which are also prevalent and perhaps greater than the soil acidity related problems in many parts of Africa. We shall provide additional notebooks around these issues in the near future.
The main note of caution is that soil-test-based liming recommendations (among other agronomic recommendations) should always be contextualized to specific crop production environments. Simple soil tests are a good start but are basically not enough to provide useful, reliable and complete evidence-based extension advice to farmers. Any such recommendations should also require replicable code and updatable data to test the underlying heuristics.
Computationally, the main predictive chunks presented in this notebook run fairly fast (+/-) and probably could be automated, such that when a (soil) MIR measurement is received in a laboratory, a reasonable Hp prediction and the associated soil-test-based liming recommendations could be issued in near-real time. Note that for other spectrometer setups the calibration / validation transfer steps would need to be taken into account. This is also a good reason to properly curate any physical soil reference samples.

Appendix

The operational use of the MIR prediction approach discussed earlier is impractical in situations where the necessary data and analytical infrastructure are unavailable. In many African countries, a significant limitation is the scarcity of spectrometers, particularly MIR spectrometers, even in national plant and soil laboratories. However, this situation is rapidly improving with the emergence of cost-effective, robust, and highly accurate setups that enable spectral labs to be deployed in remote, off-grid, or even mobile environments.

As these new technologies become more accessible, it remains essential to base georeferenced lime requirement estimates for excess acidity on locally measured soil properties. In the meantime, this can be supported by the increasing availability of historical (legacy) wet chemistry data, which is standardized and freely accessible for many African countries (e.g., WoSIS).

The “stopgap” model we propose is designed to assess excess soil acidity (excess vs. nominal on the LiTAS scale) using soil pH (in water) and exchangeable Ca as covariates, which are commonly available in legacy databases. The model employs a partial pooling (multilevel) strategy that accounts for systematic differences between sites by partitioning variance into between-site and within-site components. This approach allows each site to have a distinct average outcome while also estimating the overall population average as:

\[ \text{logit}(P(y_i = 1)) = \beta_0 + \beta_1 \cdot \text{pH}_i + \beta_2 \cdot \text{Ca}_i + \beta_3 \cdot (\text{pH}_i \times \text{Ca}_i) + u_{\text{site}[i]} \]

Where:

\(y_i\) is the binary response variable indicating whether there is excess soil acidity (1 = excess, 0 = nominal).
\(\text{pH}_i\) is the soil pH value (in water) for the \(i\)th observation.
\(\text{Ca}_i\) is the exchangeable calcium value for the \(i\)th observation.
\(\beta_0\) is the overall intercept.
\(\beta_1\), \(\beta_2\), and \(\beta_3\) are the coefficients for the pH, Ca, and interaction (pH \(\times\) Ca) terms, respectively.
\(u_{\text{site}[i]}\) represents the random effect for the site, capturing between-site variability, and is assumed to be normally distributed: \(u_{\text{site}[i]} \sim \mathcal{N}(0, \sigma_{\text{site}}^2)\).

Similar to the MIR model, it can also be used to infer and update (spatial) remediation decisions that consider the uncertainties expressed in its posterior predictive distribution.

xHp_ptf <- brm(data = ecec,
           family = bernoulli(),
           formula = xHp_cut ~ pH * Ca + (1 | site),
           # prior = priors,
           iter = 4000, warmup = 1000, chains = 4, cores = 8,
           control = list(adapt_delta = .975, max_treedepth = 20),
           seed = 1235813)

saveRDS(xHp_ptf, "./results/Ref_xHp_ptf.rds")

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: xHp_cut ~ pH * Ca + (1 | site) 
##    Data: ecec (Number of observations: 1848) 
##   Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
##          total post-warmup draws = 12000
## 
## Multilevel Hyperparameters:
## ~site (Number of levels: 58) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     3.70      0.48     2.86     4.76 1.00     2076     3733
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept   -36.66      3.05   -42.85   -30.98 1.00     4745     7103
## pH            5.73      0.49     4.80     6.73 1.00     5419     7637
## Ca            4.68      0.51     3.64     5.65 1.00     4148     4767
## pH:Ca        -0.51      0.07    -0.64    -0.36 1.00     3976     4266
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

xHp_pred <- ecec %>%
  select(ssid, site, xHp, xHp_cut, pH, Ca) %>%
  filter(!is.na(xHp))

# Generate posterior predictions
xHp_pred_samples <- posterior_linpred(xHp_ptf, newdata = xHp_pred, transform = FALSE)

# Summarize the posterior predictions for each observation
post_summary <- apply(xHp_pred_samples, 2, function(x) {
  c(Estimate = mean(x),
    lower = quantile(x, 0.1),
    upper = quantile(x, 0.9))
})

# Convert the summary statistics into a data frame
xHp_post <- as.data.frame(t(post_summary))
colnames(xHp_post) <- c("Estimate", "Q10", "Q90")

# Bind to data frame
xHp_pred <- xHp_pred %>%
  bind_cols(xHp_post)

# Generate receiver-operator curve
roc <- roc(xHp_pred$xHp_cut, xHp_pred$Estimate)
auc <- auc(roc)
val_cut <- coords(roc, "best")

# Classify using ROC threshold
xHp_pred <- xHp_pred %>%
  mutate(
    xHp_class = ifelse(xHp_pred$Estimate > val_cut$threshold, "n", "e")
  )

# Print confusion matrix
confusionMatrix(as.factor(xHp_pred$xHp_cut), as.factor(xHp_pred$xHp_class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    e    n
##          e  558   24
##          n   82 1184
##                                          
##                Accuracy : 0.9426         
##                  95% CI : (0.931, 0.9528)
##     No Information Rate : 0.6537         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8706         
##                                          
##  Mcnemar's Test P-Value : 3.089e-08      
##                                          
##             Sensitivity : 0.8719         
##             Specificity : 0.9801         
##          Pos Pred Value : 0.9588         
##          Neg Pred Value : 0.9352         
##              Prevalence : 0.3463         
##          Detection Rate : 0.3019         
##    Detection Prevalence : 0.3149         
##       Balanced Accuracy : 0.9260         
##                                          
##        'Positive' Class : e              
##

Posteriors

# Extract the posterior samples for the random `site` intercepts
post_samp <- as_draws_df(xHp_ptf)
ran_eff <- post_samp %>% 
  select(starts_with("r_site"))

# Convert the random effects to a long format
ran_eff_long <- ran_eff %>%
  pivot_longer(cols = everything(),
               names_to = "site",
               values_to = "value")

# Clean up the site names
ran_eff_long <- ran_eff_long %>%
  mutate(site = gsub("r_site\\[|,Intercept\\]", "", site))

p <- ggplot(ran_eff_long, aes(x = value, y = site, fill = site)) +
  geom_density_ridges(alpha = 0.7, rel_min_height = 0.01, scale = 1.2) +
  scale_y_discrete(expand = expansion(add = c(1.5, 1.5))) +
  scale_x_continuous(limits = c(-16, 16)) +
  labs(x = "Intercept Value",
       y = "AfSIS Site") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12),
        axis.text.y = element_text(size = 10, angle = 0, hjust = 1))

Figure A1. Posterior distributions of site-level random intercepts of the Stopgap model.

Remediation recommendations

xHp_pred <- xHp_pred %>%
  mutate(
      inrange = 0 >= Q10 & 0 <= Q90, # 80% CI, adjust as needed
      xHp_rec = case_when(
          Estimate < 0 & inrange == FALSE ~ "remediate",
          inrange == TRUE ~ "measure",
          TRUE ~ "do_nothing"
      )
    )

site_summaries <- xHp_pred %>%
  group_by(site) %>%
  summarize(
    total_locations = n(),
    num_remediate = sum(xHp_rec == "remediate"),
    num_measure = sum(xHp_rec == "measure"),
    num_do_nothing = sum(xHp_rec == "do_nothing"),
    prop_remediate = num_remediate / total_locations,
    prop_measure = num_measure / total_locations,
    prop_do_nothing = num_do_nothing / total_locations,
    site_decision = case_when(
      prop_remediate > 0.5 ~ "Remediate site",
      prop_measure > 0.5 ~ "Measure",
      TRUE ~ "Do nothing"
    )
  )

# Create a more concise summary column
site_summaries <- site_summaries %>%
  mutate(
    summary = paste0(
      "Remediate: ", round(prop_remediate * 100, 1), "%, ",
      "Measure: ", round(prop_measure * 100, 1), "%, ",
      "Do Nothing: ", round(prop_do_nothing * 100, 1), "%"
    )
  ) %>%
  select(site, total_locations, summary, site_decision)

Session Info

## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Big Sur 11.7.10
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Denver
## tzcode source: internal
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DT_0.33              brms_2.21.0          Rcpp_1.0.12         
##  [4] pROC_1.18.5          glmnet_4.1-8         Matrix_1.7-0        
##  [7] xgboost_1.7.7.1      randomForest_4.7-1.1 doParallel_1.0.17   
## [10] iterators_1.0.14     foreach_1.5.2        caretEnsemble_2.0.3 
## [13] caret_6.0-94         lattice_0.22-6       ggridges_0.5.6      
## [16] cowplot_1.1.3        ggExtra_0.10.1       lubridate_1.9.3     
## [19] forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4         
## [22] purrr_1.0.2          readr_2.1.5          tidyr_1.3.1         
## [25] tibble_3.2.1         ggplot2_3.5.1        tidyverse_2.0.0     
## [28] osfr_0.2.9          
## 
## loaded via a namespace (and not attached):
##   [1] tensorA_0.36.2.1     rstudioapi_0.16.0    jsonlite_1.8.8      
##   [4] shape_1.4.6.1        magrittr_2.0.3       farver_2.1.2        
##   [7] rmarkdown_2.27       fs_1.6.4             vctrs_0.6.5         
##  [10] memoise_2.0.1        htmltools_0.5.8.1    distributional_0.4.0
##  [13] curl_5.2.1           sass_0.4.9           parallelly_1.37.1   
##  [16] StanHeaders_2.32.8   bslib_0.7.0          htmlwidgets_1.6.4   
##  [19] plyr_1.8.9           cachem_1.1.0         mime_0.12           
##  [22] lifecycle_1.0.4      pkgconfig_2.0.3      R6_2.5.1            
##  [25] fastmap_1.2.0        future_1.33.2        shiny_1.8.1.1       
##  [28] digest_0.6.35        colorspace_2.1-0     crosstalk_1.2.1     
##  [31] labeling_0.4.3       fansi_1.0.6          timechange_0.3.0    
##  [34] httr_1.4.7           abind_1.4-5          compiler_4.4.0      
##  [37] proxy_0.4-27         withr_3.0.0          backports_1.5.0     
##  [40] inline_0.3.19        highr_0.10           QuickJSR_1.1.3      
##  [43] pkgbuild_1.4.4       MASS_7.3-60.2        lava_1.8.0          
##  [46] loo_2.7.0            ModelMetrics_1.2.2.2 tools_4.4.0         
##  [49] httpuv_1.6.15        future.apply_1.11.2  nnet_7.3-19         
##  [52] glue_1.7.0           nlme_3.1-164         promises_1.3.0      
##  [55] grid_4.4.0           checkmate_2.3.1      Cubist_0.4.2.1      
##  [58] reshape2_1.4.4       generics_0.1.3       recipes_1.0.10      
##  [61] gtable_0.3.5         tzdb_0.4.0           class_7.3-22        
##  [64] data.table_1.15.4    hms_1.1.3            utf8_1.2.4          
##  [67] pillar_1.9.0         posterior_1.5.0      later_1.3.2         
##  [70] splines_4.4.0        survival_3.6-4       tidyselect_1.2.1    
##  [73] miniUI_0.1.1.1       pbapply_1.7-2        knitr_1.46          
##  [76] gridExtra_2.3        crul_1.4.2           stats4_4.4.0        
##  [79] xfun_0.44            bridgesampling_1.1-2 hardhat_1.3.1       
##  [82] timeDate_4032.109    matrixStats_1.3.0    rstan_2.32.6        
##  [85] stringi_1.8.4        yaml_2.3.8           evaluate_0.23       
##  [88] codetools_0.2-20     httpcode_0.3.0       cli_3.6.2           
##  [91] RcppParallel_5.1.7   rpart_4.1.23         xtable_1.8-4        
##  [94] munsell_0.5.1        jquerylib_0.1.4      globals_0.16.3      
##  [97] coda_0.19-4.1        rstantools_2.4.0     gower_1.0.1         
## [100] bayesplot_1.11.1     Brobdingnag_1.2-9    listenv_0.9.1       
## [103] mvtnorm_1.2-5        ipred_0.9-14         e1071_1.7-14        
## [106] scales_1.3.0         prodlim_2023.08.28   rlang_1.1.3

Excess soil acidity remediation decisions with MIR screening

M.G. Walsh. J.V. Silva and J. Chamberlin

20, August, 2024