1 Introduction

Soil aggregate stability refers to the ability of soil aggregates to resist disintegration when disruptive forces associated with e.g., tillage operations and erosion occur. Aggregate stability indicates how well soils can resist compaction, wind abrasion, rainfall detachment and atmospheric and/or overland transport. It is a dynamic soil physical chemistry property, which is important for water infiltration, retention and drainage, soil aeration, microbial activity, organic matter storage and stabilization and plant root growth, among others. When soil aggregates disintegrate e.g., during tillage operations or rainstorms, dispersed particles fill soil pore spaces causing plow layer compaction, hardpans and soil crusts to develop. Once such structures are formed, infiltration is reduced, which can result in increased runoff, erosion and reduced water availability for plant growth. Typically, the associated processes will occur in reinforcing feedback loops in landscapes. Changes in aggregate stability can be used as early indicators of soil degradation or recovery and to monitor the effects and impacts of land management interventions. However, conventional approaches that use techniques such as wet sieving and weighing, slake tests or hydrometer and pipetting methods are time consuming and labor intensive. This has largely precluded their operational use in landscape-level assessment and monitoring applications.


**Examples of extremely dispersive, sodic soils in Northern Tanzania.**

Examples of extremely dispersive, sodic soils in Northern Tanzania.

Laser diffraction particle size analysis (LDPSA) under different dispersion treatments in e.g., in air, water and/or sodium hexametaphosphate (calgon), with or without sonication can serve as indicators of soil particle size distributions and aggregate stability for environmental, ecological and engineering purposes. The big advantage of LDPSA is that it can be performed rapidly (in < 5 minutes) using small (< 5 g) quantities of soil. The method works by passing a soil sample through a laser beam, which scatters the incident light onto a Fourier lens. The lens focuses the scattered laser light onto a detector array using an inversion algorithm, and a particle size distribution is inferred from the collected data. Mie theory is applied to provide a compositionally-based distribution of particle sizes based on the correlation between the intensity and the angle of the light that is scattered. The approach generates compositional data (CoDa), which are data that sum up to a constant (e.g., 1 or 100%). Modeling CoDa requires log ratio transformations to retain this inherent closure constraint (see e.g., Greenacre, 2021).

The main goal of this notebook is to illustrate a workflow for assessing soil aggregate stability in landscapes. It is about using linear models to estimate experimental dispersion treatment effects and the uncertainty in those estimates. It does not go into the details of LDPSA data collections or the associated laboratory analysis procedures (for method references see e.g., Kerry et.al., (2009). Instead, it focuses on the associated statistical steps that are needed to generate potentially useful inferences and predictions for populations of LDPSA measurements in landscapes in this context, which take both the CoDa and the ordinal nature of the laboratory dispersal treatments into account.

2 Initial data setup

To actually run this notebook, you will need to load the packages indicated in the chunk directly below. The notebook itself is maintained on Github, and you can fork and modify it from there as you see fit.

# Package names
packages <- c("osfr", "tidyverse", "compositions", "leaflet", "ggtern", "ordinal", "metafor",
              "arm", "quantreg")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

The data needed for running this notebook can be obtained from the Africa Soil Information Service Open Science Framework repository here. The next chunk reads in the data, calculates the Isometric Log Ratio (ilr) transforms and the Mean Weight Diameter (mwd) for the soil particle size fractions in different dispersants (air, water or calgon) and sonication time treatments. These will be the LDPSA indicators that are used to model and interpret soil aggregate stability in this notebook.

# Create a data folder in your current working directory
dir.create("ldpsa", showWarnings = F)
setwd("./ldpsa")
dir.create("Results", showWarnings = F)

# LDPSA data download from https://osf.io/yr8ua/
osf_retrieve_file("yr8ua") %>% osf_download(conflicts="skip")
unzip("LDPSA_60.zip", overwrite = T)
ldps <- read.table("LDPSA_60.csv", header=T, sep=",")

# Main wrangling steps
fractions <- c("sand", "silt", "clay")
cdat <- as.data.frame(acomp(ldps[fractions], total=100))

# Isometric log ratio (ilr) transforms
bpart <- t(matrix(c( 1,-1,-1,
                     0, 1,-1), ncol=3, nrow=2, byrow=T))

idat <- as.data.frame(ilr(cdat, V=bpart))
names(idat) <- c("ilr1", "ilr2")
ldps <- cbind(ldps, idat)

# Mean Weight Diameter calculation (micrometers)
ldps$mwd <- exp(0.01*(ldps$clay*log(0.001) + ldps$silt*log(0.026) + ldps$sand*log(1.025)))*1000

# load and merge covariates
cvar <- read.table("LDPSA_covar.csv", header=T, sep=",")
oxid <- read.table("LDPSA_oxides.csv", header=T, sep=",")
ldps <- merge(ldps, cvar, by="ssid")
ldps <- merge(ldps, oxid, by="ssid")

# Display dataframe structure
glimpse(ldps)

2.1 Data dictionary

The data were sampled from 60, 10 × 10 km sentinel sites (or landscapes, Vagen et. al., 2015) in proportion to their occurrence in the major Köppen-Geiger climate zones of Africa excluding deserts, urban and other non-photosynthetically active land areas. Each sentinel site represents a 100 km2 area, each with a total of 160, georeferenced soil profile sampling locations within each sentinel site. A randomly selected ~10% subset of (1,891 / 19,200) soil samples were analyzed with LDPSA using 9 dispersion treatments described by:

  • dispersant, disp = in air, water or calgon, and
  • sonication time, stime = 0, 0.5, 2 or 4 minutes for samples that were dispersed in either water or calgon.

We shall be using only 3 laboratory treatments here to describe, model and interpret soil aggregate stability. Those treatments include:

  • dispersal in calgon (with 4 minutes of sonication, trt = c4). These are the housekeeping or (endogenous control) LDPSA measurements.
  • dispersal in water (without sonication, trt = w1), and
  • dispersal in air (without sonication, trt = a).

The data also contain the following main grouping factors for the random-effects and meta-regression analyses that are presented here:

  • sentinel site, site = 60
  • georeferenced soil profile id in site, pid = 955
  • soil sample id in profile, ssid = 1888

Finally, the data also contain the following covariate measurements that we shall use in subsequent mixed-effects modeling:

  • pH (in water)
  • electrical conductivity (ec, dS m-1)
  • extractable cation exchange capacity (ecec, cmolc kg-1)
  • soil organic carbon content (soc, g kg-1)
  • isometric log ratio of CaO + Na2O | Al2O3, K2O proportions (v1)
  • isometric log Al2O3 | K2O proportions (v2)

Note that the last two covariate measurements are oxide ratios, which are often used to describe chemical weathering or alteration patterns in soils (see e.g., Price and Velbel, 2003). You can find the code for doing the relevant calculations with X-Ray Fluorescence measurements at Github.

**Figure 1:** Compositional XRF measurements of metal oxides related to chemical weathering dynamics in soils in relation to compositional measurements of pure clay minerals (labeled crosses) and the Upper Continental Crust (UCC) standard.

Figure 1: Compositional XRF measurements of metal oxides related to chemical weathering dynamics in soils in relation to compositional measurements of pure clay minerals (labeled crosses) and the Upper Continental Crust (UCC) standard.

The ternary diagram in Figure 1 can be interpreted as potential soil chemical weathering sequences with an ultimate entropic end-state described by the Al2O3 archetype (in red). Soils in this category are geologically old and often problematic for agricultural management in terms of their mineralogy, nutrient reserves and physical properties.

2.2 Spatial distribution of soil samples

# Select soil sample  locations 
loc <- subset(ldps, trt == "c4", select=c(ssid, lat, lon))

# Plot sample locations
w <- leaflet() %>%
  setView(lng = mean(loc$lon), lat = mean(loc$lat), zoom = 4) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(loc$lon, loc$lat, clusterOptions = markerClusterOptions())
w ## plot widget 

2.3 Soil particle size fractions by treatment combinations

water <- subset(ldps, disp == "water", select=c(sand, clay, silt, stime, mwd)) 

wtex_son <- ggtern(
  data = water, aes(x = sand, y = clay, z = silt))+
  geom_point(aes(color = stime), size= 1, alpha = 0.5)+
  scale_color_continuous(low = "dodgerblue", high = "tomato")+
  labs(color = "Sonication time (minutes)")+
  theme_bw()+
  theme(legend.justification = c(0, 1), 
        legend.position      = c(0, 1))

wtex_mwd <- ggtern(
  data = water, aes(x = sand, y = clay, z = silt))+
  geom_point(aes(color = mwd), size = 1, alpha = 0.5)+
  scale_color_continuous(low="dodgerblue", high="tomato")+
  labs(color="MWD (microns)")+
  theme_bw()+
  theme(legend.justification = c(0, 1), 
        legend.position      = c(0, 1))

Note that an identical chunk is replicated for the samples that were dispersed in calgon. This is not included in the html here to keep the text concise. However, it is included in the actual markdown doc on Github. The chunks generate ternary plots using the ggtern package. The figures appear convincing in that as a whole they show that dispersion in both water and calgon work in the expected direction in response to the combination of the dispersant and the sonication treatments.

calgon <- subset(ldps, disp == "calgon", select=c(sand, clay, silt, stime, mwd))

ctex_son <- ggtern(
  data = calgon, aes(x = sand, y = clay, z = silt))+
  geom_point(aes(color = stime), size= 1, alpha = 0.5)+
  scale_color_continuous(low = "dodgerblue", high = "tomato")+
  labs(color = "Sonication time (minutes)")+
  theme_bw()+
  theme(legend.justification = c(0, 1), 
        legend.position      = c(0, 1))

ctex_mwd <- ggtern(
  data = calgon, aes(x = sand, y = clay, z = silt))+
  geom_point(aes(color = mwd), size = 1, alpha = 0.5)+
  scale_color_continuous(low="dodgerblue", high="tomato")+
  labs(color="MWD (microns)")+
  theme_bw()+
  theme(legend.justification = c(0, 1), 
        legend.position      = c(0, 1))
**Figure 2:** Ternary plots of the particle size distributions of LDPSA samples dispersed in either in water (top) or calgon (bottom). Color coded by mean weight diameter (`mwd`, left), and sonication time (`stime`, right).**Figure 2:** Ternary plots of the particle size distributions of LDPSA samples dispersed in either in water (top) or calgon (bottom). Color coded by mean weight diameter (`mwd`, left), and sonication time (`stime`, right).**Figure 2:** Ternary plots of the particle size distributions of LDPSA samples dispersed in either in water (top) or calgon (bottom). Color coded by mean weight diameter (`mwd`, left), and sonication time (`stime`, right).**Figure 2:** Ternary plots of the particle size distributions of LDPSA samples dispersed in either in water (top) or calgon (bottom). Color coded by mean weight diameter (`mwd`, left), and sonication time (`stime`, right).

Figure 2: Ternary plots of the particle size distributions of LDPSA samples dispersed in either in water (top) or calgon (bottom). Color coded by mean weight diameter (mwd, left), and sonication time (stime, right).

The next chunk produces cumulative empirical cumulative distributions of the variables of potential interest in differentiating between the main LDPSA dispersion treatments.

# select dispersion treatments
c4 <- subset(ldps, trt=="c4", select=c(sand, silt, clay, ilr1, ilr2, mwd)) 
w1 <- subset(ldps, trt=="w1", select=c(sand, silt, clay, ilr1, ilr2, mwd))
air <- subset(ldps, trt=="a", select=c(sand, silt, clay, ilr1, ilr2, mwd)) 

# Sand
par(mfrow=c(2,3), mar=c(4,2,2,2))
plot(ecdf(c4$sand), main="", xlab="Sand (%)", ylab="ECDF", xlim=c(0, 100),
     verticals=T, lty=1, lwd=1, cex.lab=1.3, col="tomato", do.points=F)
abline(0.5,0, lty=1, col="grey")
plot(ecdf(w1$sand), add=T, verticals=T, lty=1, lwd=1, col="dodgerblue", do.points=F)
plot(ecdf(air$sand), add=T, verticals=T, lty=1, lwd=1, col="dark green", do.points=F)

# Silt
plot(ecdf(c4$silt), main="", xlab="Silt (%)", ylab="", xlim=c(0, 100), verticals=T,
     lty=1, lwd=1, cex.lab=1.3, col="tomato", do.points=F)
abline(0.5,0, lty=1, col="grey")
plot(ecdf(w1$silt), add=T, verticals=T, lty=1, lwd=1, col="dodgerblue", do.points=F)
plot(ecdf(air$silt), add=T, verticals=T, lty=1, lwd=1, col="dark green", do.points=F)

# Clay
plot(ecdf(c4$clay), main="", xlab="Clay (%)", ylab="", xlim=c(0, 100), verticals=T,
     lty=1, lwd=1, cex.lab=1.3, col="tomato", do.points=F)
abline(0.5,0, lty=1, col="grey")
plot(ecdf(w1$clay), add=T, verticals=T, lty=1, lwd=1, col="dodgerblue", do.points=F)
plot(ecdf(air$clay), add=T, verticals=T, lty=1, lwd=1, col="dark green", do.points=F)

# Mean weight diameter
plot(ecdf(c4$mwd), main="", xlab="MWD (microns)", ylab="ECDF", xlim=c(0, 800),
     verticals=T, lty=1, lwd=1, cex.lab=1.3, col="tomato", do.points=F)
abline(0.5,0, lty=1, col="grey")
plot(ecdf(w1$mwd), add=T, verticals=T, lty=1, lwd=1, col="dodgerblue", do.points=F)
plot(ecdf(air$mwd), add=T, verticals=T, lty=1, lwd=1, col="dark green", do.points=F)

# ILR1 (sand | silt, clay)
plot(ecdf(c4$ilr1), main="", xlab="ILR1 (sand | silt, clay)", ylab="",
     xlim=c(-6, 6), verticals=T, lty=1, lwd=1, cex.lab=1.3, col="tomato", do.points=F)
abline(0.5,0, lty=1, col="grey")
plot(ecdf(w1$ilr1), add=T, verticals=T, lty=1, lwd=1, col="dodgerblue", do.points=F)
plot(ecdf(air$ilr1), add=T, verticals=T, lty=1, lwd=1, col="dark green", do.points=F)

# ILR2 (silt | clay)
plot(ecdf(c4$ilr2), main="", xlab="ILR2 (silt | clay)", ylab="",
     xlim=c(-4, 4), verticals=T, lty=1, lwd=1, cex.lab=1.3, col="tomato", do.points=F)
abline(0.5,0, lty=1, col="grey")
plot(ecdf(w1$ilr2), add=T, verticals=T, lty=1, lwd=1, col="dodgerblue", do.points=F)
plot(ecdf(air$ilr2), add=T, verticals=T, lty=1, lwd=1, col="dark green", do.points=F)
**Figure 3:** Empirical cumulative distributions of LDPSA treatment in calgon, water and air. Subsamples shown in red were dispersed in calgon with 4 minutes of sonication; subsamples in blue were dispersed in water without sonication; subsamples in green were dispersed in air.

Figure 3: Empirical cumulative distributions of LDPSA treatment in calgon, water and air. Subsamples shown in red were dispersed in calgon with 4 minutes of sonication; subsamples in blue were dispersed in water without sonication; subsamples in green were dispersed in air.

The main message conveyed by Figures 2 & 3 is that 4 minutes of sonication in calgon is quite effective in destroying both air-dry and water-dispersed soil aggregates. The two isometric log ratios of the particle size data (ilr1 & ilr2) and the mean weight diameter (mwd) graphs also demonstrate the ordinality of these treatments quite well. Subsamples analyzed in air were minimally dispersed; subsamples analyzed in water (without sonication) were moderately dispersed, and subsamples dispersed in calgon with 4 minutes of sonication (the endogenous controls) were maximally dispersed.

3 EDA with graphical interaction models

The purpose of this section of the notebook is to further clarify any existing relationships/dependencies, between our proposed dispersion/flocculation index and the main covariates that are used in this notebook. It is not intended as a formal causal analysis and is primarily to assist with variable selections in any subsequent statistical or predictive modeling tasks.

To run this section you will initially need to install the relevant packages. This is a bit of an art form depending on the version of R that you are currently running. Both the gRim and statnet packages that we shall be using for these analyses and their visualization have dependencies to other packages e.g., bioconductor, which can be a bit tricky depending on their versioning cycle on CRAN. The next chunks are what worked for us using our current R.Version(), but this might change and you may need to adjust accordingly depending on your setup.

# Install <bioconductor> first ... see: http://www.bioconductor.org/install/
# do this only once
# if (!requireNamespace("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")
# BiocManager::install(version = "3.13")

# note that this will query you in the console as to whether to update other associated packages ... (a/s/n). Reply with n if all of the listed packages are up to date.

# Then install and load gRbase, gRim and statnet 
packages <- c("gRbase", "gRim", "statnet")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

Similarly, the statnet package, which we will use for graphical model visualization currently may have some specific installation requirements that you can track on their Github.

We start with a saturated model and then fit the relevant interactions with a backward selection procedure.

# Select and transform variables
gvars <- del[ ,c(8,10:15)]
gvars$soc <- log(gvars$soc)
gvars$ec <- log(gvars$ec)
gvars$ecec <- log(gvars$ecec)

mm0 <- cmod(~.^., data = gvars)
mm1 <- backward(mm0, k=log(nrow(gvars)))
mmf <- ugList(terms(mm1), result="matrix")
net <- as.network(x = mmf, directed = FALSE, loops = FALSE, matrix.type = "adjacency")

This next chunk produces the associated interaction graph.

par(mar=c(0,0,0,0))
plot.network(net, vertex.col = "light grey", vertex.cex = 7, displaylabels = T, 
             label.pos = 5)
**Figure 5:** Interaction graph of covariate soil properties associated with the proposed dispersion/flocculation index (DFI).

Figure 5: Interaction graph of covariate soil properties associated with the proposed dispersion/flocculation index (DFI).

Under the given graphical interaction model assumptions, the corresponding maximal cliques (variable groupings) of the graph (shown in Figure 5) can be identified with:

getCliques(mmf)
## [[1]]
## [1] "ecec" "v1"   "ph"   "soc" 
## 
## [[2]]
## [1] "ecec" "v1"   "ph"   "v2"  
## 
## [[3]]
## [1] "ecec" "v1"   "dfi"  "soc" 
## 
## [[4]]
## [1] "ecec" "ec"   "dfi"

4 Landscape-level soil aggregate stabilities

4.1 Fixed-effects models

Fixed-effects regressions (similar to those used in Section 3) assume that there is no hierarchical structure in the data; i.e., instances where data points are not nested, grouped or correlated in higher order categories (e.g. soil profiles within landscapes). We initially model the values in air (Δ air) and in water (Δ water) to assess the population-level effects of the included covariates.

# Δ air
da0 <- lm(da~v1*v2, data = del)
da1 <- lm(da~ph+log(ec)+log(ecec)+log(soc), data = del)
da2 <- lm(da~ph+log(ec)+log(ecec)+log(soc)+v1*v2, data = del)
da_anova <- anova(da0, da1, da2)
del$da2_fit <- fitted(da2, del)

# Δ water
dw0 <- lm(dw~v1*v2, data = del)
dw1 <- lm(dw~ph+log(ec)+log(ecec)+log(soc), data = del)
dw2 <- lm(dw~ph+log(ec)+log(ecec)+log(soc)+v1*v2, data = del)
dw_anova <- anova(dw0, dw1, dw2)
del$dw2_fit <- fitted(dw2, del)

We can plot the selected model fits with:

par(pty="s", mar=c(4,4,1,1))
plot(dw~dw2_fit, xlim = c(-5,16), ylim = c(-5,16), 
     xlab = "Fitted Δ water",
     ylab = "Measured Δ water", cex.lab=1.3, data=del)
abline(c(0,1), col="tomato", lwd=2)

plot(da~da2_fit, xlim = c(-5,16), ylim = c(-5,16), 
     xlab = "Fitted Δ air",
     ylab = "Measured Δ air", cex.lab=1.3, data=del)
abline(c(0,1), col="tomato", lwd=2)
**Figure 6:** Fitted vs observed cumulative link index values in air or water relative to dispersion in the calgon with sonication (`c4`) treatment based on fixed effects models.**Figure 6:** Fitted vs observed cumulative link index values in air or water relative to dispersion in the calgon with sonication (`c4`) treatment based on fixed effects models.

Figure 6: Fitted vs observed cumulative link index values in air or water relative to dispersion in the calgon with sonication (c4) treatment based on fixed effects models.

4.2 Random and mixed-effects models

Fixed-effects models such as those shown in section 3.1 fail to recognize the intra-class correlations of clustering. Alternatively in random and mixed-effects (or multilevel) models the data being analysed are drawn from a hierarchy of different (sub)populations whose differences relate to that hierarchy. In this context, a mixed effects model that takes site (landscape) level differences can be as expressed as:

\[ \text{DFI}_{ij} = (\beta_0 + u_i) + \beta_1 \text{(dw2_fit)}_{ij} + \beta_2 \text{(da2_fit)}_{ij} + e_{ij} \\ u_i \sim N(0, \sigma^2_i), \ e_{ij} \sim N(0,\sigma^2_{ij}) \]

Note that the expression contains both fixed-effects fits (from the fixed-effects regression above), as well as two random-effects. The \(u_i\) term is a site-level random effect. It measures the difference between the average DFI value of the ith site and the average DFI value in the population of sites. The \(e_{ij}\) term is a profile-specific random (residual) effect. It expresses the difference between the jth profile DFI value from the average DFI of the ith site. We can calculate these differences and plot them with the following chunks.

# random effects models
dfi0 <- lmer(dfi~ 1 + (1|site), data = del)
dfi1 <- lmer(dfi~da2_fit+dw2_fit + (1|site), data = del)
dfi2 <- lmer(dfi~da2_fit*dw2_fit + (1|site), data = del)
mix_anova <- anova(dfi0, dfi1, dfi2)
del$dfit <- fitted(dfi1, del)
par(pty="s", mar=c(4,4,1,1))
plot(dfi~dfit, xlim=c(-11,6), ylim=c(-11,6), 
     xlab = expression(paste("Fitted DFI ",("Δc4"[water] - "Δc4"[air]))),
     ylab = expression(paste("Measured DFI ",("Δc4"[water] - "Δc4"[air]))), cex.lab=1.3,
     data = del)
abline(c(0,1), col="tomato")
**Figure 7:** Profile-level fitted vs observed DFI values. The blue lines represent the 95% quantile interval of the measured DFIs.

Figure 7: Profile-level fitted vs observed DFI values. The blue lines represent the 95% quantile interval of the measured DFIs.

site <- data.frame(ranef(dfi0))
site <- site[3:5]
names(site) <- c("site", "yi", "si")
site$vi <- site$si^2

# summarize random effects
ran <- rma(yi-1.5, vi, data = site)

We can also plot the site-level random intercepts with:

op <- par(cex=0.9, font=4)
par(mar=c(4,4,1,1))

forest(ran, header="Site", slab = site,
       xlab = expression(paste("Fitted DFI ",("Δc4"[water] - "Δc4"[air]))), digits = 1,
       addpred = TRUE)
**Figure 8:** Forest plot showing the site-level differences in modeled dispersion / flocculation index (DFI) values. Soils on sites with DFI > 0 will tend to flocculate in water. Soils on sites with DFI < 0 will tend to disperse.

Figure 8: Forest plot showing the site-level differences in modeled dispersion / flocculation index (DFI) values. Soils on sites with DFI > 0 will tend to flocculate in water. Soils on sites with DFI < 0 will tend to disperse.

5 Main takeaways

Any questions or comments are most welcome via AFSIS. The notebook itself is maintained on Github, and you can fork and modify it from there as you see fit.