Introduction
General data setup
Predicting maize grain Zn levels from soil properties with MLAs
Predicting maize grain Zn from soil and plant MIR spectra
Takeaways

Introduction

Micronutrients i.e., the essential vitamins and minerals, are vital for healthy child development as well as for the prevention of morbidity and disease in adults. With the exception of vitamin D, micronutrients are not produced in the body and must be obtained from the diet. Although people only need small amounts of micronutrients, consuming the recommended amounts is important, as micronutrient deficiencies (MNDs) can have severe consequences on both health and well-being. However, national surveys containing dietary micronutrient intake data are not routinely collected in most developing countries because they are expensive and time consuming to obtain, both in the field and the laboratory. Also, the databases that are needed to analyze dietary intakes should be geographically explicit to be effective, but they are often completely lacking and/or not up-to-date in Africa. Similarly, agronomic interventions that focus on plant micronutrients for improving cropland and livestock productivity are inadequately supported by spatial and temporal data.

Zinc (Zn) deficiency is among the 5 most widespread MNDs (including iron, vitamin A, iodine, folate, in addition to zinc) that are of global concern (Bailey et al., 2015). Diagnosing zinc deficiency in maize grain is the focus of this notebook. Zn is a trace element found in varying concentrations in all soils, plants and animals. It is essential for the normal growth of higher plants and animals including humans. While Zn is needed in only small quantities, if those amounts are inadequate, plants and/or animals will suffer from physiological stress brought about by the dysfunction of several enzyme systems and other metabolic functions in which Zn plays a key role (Wessels and Brown, 2012). In Malawi, Zn deficiency prevalence rates of ~60% in people have been reported based on a national survey of serum biomarker Zn concentrations (Likoswe et al., 2020). The prevalence may be even higher in rural areas (Siyame et al., 2013). The main causes of Zn deﬁciencies in food crops are soil-related including: total soil Zn content, redox potential and pH, mineralogy, the activity of soil microorganisms, presence of other nutrients, and climate conditions, see e.g., (Noulas et al., 2018).

Spectral signatures of soils, plants, animal tissues, and materials generally are defined by their reflectance or absorbance as a function of wavelength in the electromagnetic spectrum. Under laboratory conditions, the signatures result from electronic transitions of atoms and vibrational stretching and bending of structural groups of atoms that form molecules or crystals. Spectroscopy has been shown to provide highly repeatable, rapid and low cost measurements of many different soil, plant and animal tissue properties in numerous studies. The amount of light absorbed by a sample can be measured with minimal sample preparation (primarily drying and fine grinding) across a wide range of ultra-violet (UV), visible (VIS), near (NIR) and mid-infrared (MIR) wavelengths to provide a unique spectral signature of a soil or plant sample. An individual MIR measurement can be performed in about 30 seconds, in contrast to more conventional soil and plant tests, which are typically slow, labor-intensive, expensive and/or use hazardous chemicals, which have to be handled carefully.

In this notebook, we shall explore if Zn concentrations in maize grain can be reliably predicted from cheap, fast, non-destructive and non-hazardous MIR spectral data (features), using ICP-MS measurements as reference data. The analytical reference methods are described in (Gashu et al., 2021).

The two main questions posed are:

Can Zn concentrations in soils be used to predict Zn concentrations / deficiencies in maize grain? While there should be strong bio-kinetic links between soils and plants, the differential uptake and bioavailability of soil Zn in food crop tissues is complex and is not well quantified or predicted.
Can we use MIR spectra to predict Zn concentrations in maize grain reliably? If so, this would advance and enable localized Zn deficiency diagnostics and risk assessments that could potentially be carried out routinely, at low cost, across large geographical regions (ROI) and populations of interest.

The notebook is also intended for self-learning in R. It does not go into the details of the underlying spectroscopy, which is widely covered in the literature, but instead it focuses on Chemometrics and the associated computing workflows that are needed to generate useful predictions from a population of spectral signatures of (features) relative to their corresponding reference measurements (labels). Chemometric techniques are used extensively in analytical chemistry and metabolomics. In this particular example, we shall use maize grain Zn reference data, and MIR spectral (soil and crop) data that were collected as part of the ongoing GeoNutrition project in Malawi. The described workflows might also be transferable to diagnosing other MNDs in soils and crops and/or to other geographies via e.g., transfer learning and/or other approaches … tbd.

General data setup

To actually run the notebook, you will need to load the packages indicated in the chunk directly below. This allows you to assemble the reference wet chemistry and spectral data frames, providing a lot of options that generate spectral predictions of Zn concentrations in maize grain and soils. The notebook itself is maintained on Github, and you can fork and modify it from there as you see fit. Also, the html version of this will skip certain repetitive sections, but these are actually included in the markdown script for reference.

# Package names
packages <- c("osfr", "caret", "caretEnsemble", "MASS", "pls", "glmnet", "randomForest", "gbm", "xgboost", "Cubist", "quantreg", "leaflet", "htmlwidgets", "plyr", "dplyr", "doParallel")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

The next chunk downloads the data needed for running this example. It assembles the various soil and crop reference measurements collected across Malawi and links those to Bruker HTS-XT spectra. Note that this chunk is Mac or Linux specific and so the directory structures would need to be changed slightly to run on Windows machines.

# Create a data folder in your current working directory
dir.create("MW_Zn_data", showWarnings=F)
setwd("./MW_Zn_data")
dir.create("./Results")

# Data download
osf_retrieve_file("pdgk6") %>% osf_download(conflicts = "overwrite")

## # A tibble: 1 x 4
##   name           id                       local_path       meta            
##   <chr>          <chr>                    <chr>            <list>          
## 1 MW_Zn_data.zip 60d482a5e779a50089a09bf0 ./MW_Zn_data.zip <named list [3]>

unzip("MW_Zn_data.zip", overwrite = T)
wchem <- read.table("MW_Zn_data.csv", header=T, sep=",") 
sspec <- read.table("soil_MIR_spectra.csv", header=T, sep=",")
cspec <- read.table("plant_MIR_spectra.csv", header=T, sep=",")

# Merge the wet chemistry reference data frame with the corresponding spectra
sdata <- merge(wchem, sspec, by = "ssid") ## soil data frame
sdata <- sdata[complete.cases(sdata[ ,c(7:3594)]), ] ## removes incomplete cases
cdata <- merge(wchem, cspec, by = "csid") ## crop data frame
cdata <- cdata[complete.cases(cdata[ ,c(7:3594)]), ]

# Download figures
osf_retrieve_file("42gry") %>% osf_download(conflicts = "overwrite")

## # A tibble: 1 x 4
##   name        id                       local_path    meta            
##   <chr>       <chr>                    <chr>         <list>          
## 1 figures.zip 6045ff4dfbab5700d375b927 ./figures.zip <named list [3]>

unzip("figures.zip", overwrite = T)

The following chunk then writes out initial data frames MW_soil_Zn_data.csv and MW_crop_Zn_data.csv into your ./MW_Zn_data/Results directory if you’d like to process those outputs in software other than R. They are augmented with other variable and predictions in subsequent sections of the notebook. It also generates an overview map of where in Malawi those soil and plant samples were obtained. The spatial sampling frame that was used for collecting the samples is described in Gashu et al., (2021).

# Write out reference data frame
write.csv(sdata, "./MW_Zn_data/Results/MW_soil_Zn_data.csv", row.names = F)
write.csv(cdata, "./MW_Zn_data/Results/MW_crop_Zn_data.csv", row.names = F)

# Soil sample locations
w <- leaflet() %>%
  setView(lng = mean(wchem$lon), lat = mean(wchem$lat), zoom = 7) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(wchem$lon, wchem$lat, clusterOptions = markerClusterOptions())
w ## plot widget

Predicting maize grain Zn levels from soil properties with MLAs

This section introduces the main machine learning workflow of the notebook. It uses topsoil (0-20 cm) reference wet chemistry soil features to predict maize grain Zn concentrations. The heuristics that might be used for diagnosing Zn deficiencies in crop tissues with gold-standard reference criteria and measurements for both soils and crops are operationally constrained. Soils are more easily collected and measure in an appropriate laboratory setting over the course of any given year. Whereas, the collection and analysis of grain samples is restricted to very short time periods during or immediately after harvest. Both are expensive and time consuming. So, gold standard laboratory measurements would probably be quite hard to achieve in most operational and micronutrient monitoring settings in Africa, given the associated expense, time constraints and potential lab delays. Nonetheless, they provide a baseline to compare to the spectral models that are developed in the subsequent sections.

MLA training, validation and prediction workflow.

We will be using an algorithmic (ML), rather than a data modeling based approach (Breiman, 2001). See the figure directly above. To start the model fitting processes covered in this section, the next chunk scrubs some of the extraneous objects in memory, sets-up labels and features, and creates a randomized (80 / 20%) partition between the training and validation dataframes.

rm(list=setdiff(ls(), c("sdata", "cdata"))) ## scrubs extraneous objects in memory

# Set randomization seed
seed <- 1235813
set.seed(seed)

# Split data into calibration and validation sets
gsIndex <- createDataPartition(sdata$Zn, p = 8/10, list=F, times = 1)
cal <- sdata[ gsIndex,]
val <- sdata[-gsIndex,]

# Set calibration labels
labs <- c("Zn") ## maize grain label
lcal <- as.vector(t(cal[labs]))

# Calibration features
wcal <- cal[ ,7:17]

Model calibration with `caretEnsemble`

The soil features in this initial model are are wet chemistry based and include: pH, cation exchange capacity, soil organic carbon, metal oxalates and various soil Zn extractions see (Gashu et al., 2021). We calibrate 3 models using the caretEnsemble package with k-fold cross-validation and default-tuning of the relevant hyperparameters. Note that if you would like more control over the individual model fits, you can also use their respective tuneList arguments. Also note that all of the calculations can be (are) parallelized to make efficient use of either local, GPU or cloud-based computing resources.

# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")

# Fit 3 calibration models using the soil wet chemistry features
wlist <- caretList(wcal, lcal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("rf", "gbm", "cubist"),
                   preProcess = c("center","scale"),
                   metric = "RMSE")

stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_wlist.rds", sep = "")
saveRDS(wlist, fname)

Model stacking with `caret`

The point here is not to evaluate a best individual model but rather to evaluate the combination of the previously fitted models against a 20% hold-out validation dataset. This provides robust statistical estimates of how the different models should be weighted against one-another. This next chunk fits the model ensemble with the glmStepAIC function from the MASS library using the validation dataframe. You should explore other options here, but the current stacking option provides a reasonable combination and weighting of the 3 models that were produced in the previous ensemble training steps.

# Individual model predictions on the validation set
val$rf <- predict(wlist$rf, val)
val$gb <- predict(wlist$gbm, val)
val$cu <- predict(wlist$cubist, val)

# Set labels and features
lval <- as.vector(t(val[labs]))
fval <- val[ ,3595:3597] ## validation features

# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Model setup and fitting
set.seed(seed)
tc <- trainControl(method="repeatedcv", number = 10, repeats = 3, allowParallel=T)

sw <- train(fval, lval,
            method = "glmStepAIC",
            trControl = tc,
            metric = "RMSE")

val$sw <- predict(sw, val) ## stacked predictions
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_sw.rds", sep = "")
saveRDS(sw, fname)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.5967   -2.5884   -0.2892    2.4692   13.1087  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -3.5176     3.4709  -1.013  0.31161   
## rf            0.5021     0.1592   3.154  0.00177 **
## cu            0.6571     0.2273   2.891  0.00411 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 15.60032)
## 
##     Null deviance: 5912.1  on 319  degrees of freedom
## Residual deviance: 4945.3  on 317  degrees of freedom
## AIC: 1792.2
## 
## Number of Fisher Scoring iterations: 2

Ensemble prediction uncertainty estimates

There are many ways to quantify the uncertainty inherent in these predictions. We take a simple but quite robust approach here using quantile regression (quantreg). The main interest is in the overall spread of the ensemble predictions; that is, the 90% prediction intervals for the entire dataset.

sdata$rf <- predict(wlist$rf, sdata)
sdata$gb <- predict(wlist$gbm, sdata)
sdata$cu <- predict(wlist$cubist, sdata)
sdata$sw <- predict(sw, sdata)

# Quantile regression fit on whole dataset
swQ <- rq(Zn~sw, tau=c(0.05,0.5,0.95), data = sdata)

## 
## Call: rq(formula = Zn ~ sw, tau = c(0.05, 0.5, 0.95), data = sdata)
## 
## tau: [1] 0.05
## 
## Coefficients:
##             Value     Std. Error t value   Pr(>|t|) 
## (Intercept) -10.87148   1.14578   -9.48829   0.00000
## sw            1.30571   0.06048   21.58805   0.00000
## 
## Call: rq(formula = Zn ~ sw, tau = c(0.05, 0.5, 0.95), data = sdata)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             Value     Std. Error t value   Pr(>|t|) 
## (Intercept) -14.03015   0.75421  -18.60246   0.00000
## sw            1.65357   0.03587   46.10100   0.00000
## 
## Call: rq(formula = Zn ~ sw, tau = c(0.05, 0.5, 0.95), data = sdata)
## 
## tau: [1] 0.95
## 
## Coefficients:
##             Value     Std. Error t value   Pr(>|t|) 
## (Intercept) -16.17836   2.20727   -7.32956   0.00000
## sw            1.98099   0.09877   20.05669   0.00000

par(pty="s", mar=c(4,4,1,1))
plot(Zn~sw, xlab="Soil wet-chemistry ensemble grain Zn prediction (ppm)", ylab="Measured grain Zn (ppm)", cex.lab=1.2, xlim=c(10,50), ylim=c(10,50), sdata)
curve(swQ$coefficients[2]*x+swQ$coefficients[1], add=T, from=10, to=50, col="blue", lwd=2)
curve(swQ$coefficients[4]*x+swQ$coefficients[3], add=T, from=10, to=50, col="red", lwd=2)
curve(swQ$coefficients[6]*x+swQ$coefficients[5], add=T, from=10, to=50, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)

Relationship between predicted (based on soil wet chemistry properties) and measured maize grain zinc (Zn) levels. The blue lines are the 5% and 95% quantile regression estimates. The red line is the regression at the median,

The predictions are clearly offset, but we can adjust for that. Note that this does not distract from the notion that the wet chemistry soil properties do indeed appear to be quite useful for predicting maize grain Zn levels in Malawi, plus or minus some expected outlier noise. The next short chunk adjusts for the 50% quantile (median) bias and offset and is used to generate the associated figure below.

# Offset adjustment
sdata$sw50 <- swQ$coefficients[4]*sdata$sw+swQ$coefficients[3] ## 50% quantile adjustment

# Refit quantile regression
sw50Q <- rq(Zn~sw50, tau=c(0.05,0.5,0.95), data = sdata)

Relationship between bias and offset adjusted predictions from soil data and measured maize grain zinc (Zn) levels.

The takeaway from this section is that Zn levels in maize grain are reasonably well-predicted from “gold-standard”, wet chemistry soil properties. The residual differences may be due external, environmental factors at the time of sample collection, laboratory measurement variability and/or the inability of the MLAs that were used for calibration and validation to capture the underlying bio-kinetics. The randomForest model was a clear winner here, but it is unusual for a single MLA to outperform all other contrasting MLAs i.e., generalized boosting and cubist in this case.

Predicting maize grain Zn from soil and plant MIR spectra

One potentially problematic factor for many MLAs in chemometric applications is the “Curse of dimensionality”. In this particlar case the curse manifests in the high dimensionality of the spectral data: w = 3,575 individual wavebands for both the soil and plant spectra. Many of those wavebands are strongly correlated, particularly those proximal to one another on the NIR / MIR reflectance / absorbance spectrum. These types of data pose generalization and prediction challenges for MLAs that involve e.g., Bagging, Boosting, Bayesian and/or Deep learning approaches, and can lead to overfitting and its consequent model validation / generalization problems. Generally, the data would need to be dimensionally reduced and decorrelated to be reliably fit by these types of algorithms.

Other algorithms such as Partial least squares regression (PLS), Ridge regression and/or Lasso regression handle the high-dimensional data and collinearities reasonably well in a linear context but are frequently not very good at predicting the quite common non-linear and/or threshold relationships between the spectral signatures and their reference measurements. This section of the notebook covers workflows for both types of MLAs and looks at how those might be usefully combined (stacked).

Spectral feature conversions / pre-processing

There are some ways of mitigating MLA prediction trade-offs in the data wrangling and conversion steps of the workflow described here. The most common approach is to reduce the dimensionality and collinearity of the (spectral) data. There are a number of pre-processing techniques that can be applied in that context including: Principal components analysis (PCA), Independent components analysis (ICA) as well as other signal processing techniques such as Non-negative matrix factorization (NMF). We use PCA here … however, feel free to experiment. The next chunk appends the resulting principal component scores to the soil reference and spectral data frame, sdata.

# Soil spectral principal components
sspec.pca <- prcomp(sdata[ ,19:3594], center=T, scale=T) ## centered and scaled soil spectral PCAs
spcas <- predict(sspec.pca, sdata)
spcas <- spcas[ ,1:20] ## save the first 20 components, which explain ~95% of the total soil spectral variability
fname <- paste("./MW_Zn_data/Results/", "spec_spcas.rds", sep = "")
saveRDS(sspec.pca, fname) ## saves soil spectral PCA model

# Merge files
sdata <- cbind(sdata, spcas)
write.csv(sdata, "./MW_Zn_data/Results/MW_soil_Zn_data.csv", row.names = F)

Calibration and prediction with soil spectra

We initially calibrate 3 MLAs to the soil spectral principal components data with 10-fold cross-validation. Note that the calibration code pattern is similar to the wet-chemistry-based pattern that was generated above. We have just substituted the soil spectral principal components for the wet chemistry features but are running the same MLAs for comparison.

# Set randomization seed
seed <- 1235813
set.seed(seed)

# Split data into calibration and validation sets
gsIndex <- createDataPartition(sdata$Zn, p = 8/10, list=F, times = 1)
cal <- sdata[ gsIndex,]
val <- sdata[-gsIndex,]

# Set calibration labels
labs <- c("Zn") ## maize grain label
lcal <- as.vector(t(cal[labs]))

# Calibration features
scal <- cal[ ,3600:3619] ## soil spectral PCAs

# Start doParallel to parallelize model fitting
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method = "cv", number = 10, allowParallel = TRUE, savePredictions="final")

# Fit 3 calibration models using the spectral principal component features
slist <- caretList(scal, lcal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("rf", "gbm", "cubist"),
                   preProcess = c("center","scale"),
                   metric = "RMSE")

stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_slist.rds", sep = "")
saveRDS(slist, fname)

We ensemble the 3 models as before using the fitted slist models on the validation data frame …

# Individual model predictions on the validation set
val$rf_slist <- predict(slist$rf, val)
val$gb_slist <- predict(slist$gbm, val)
val$cu_slist <- predict(slist$cubist, val)

# Set labels and features
lval <- as.vector(t(val[labs]))
fval <- val[ ,3620:3622] ## validation features

# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Model setup and fitting
set.seed(seed)
tc <- trainControl(method="repeatedcv", number = 10, repeats = 3, allowParallel=T)

ss <- train(fval, lval,
            method = "glmStepAIC",
            trControl = tc,
            metric = "RMSE")

val$ss <- predict(ss, val) ## stacked predictions
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_sstack.rds", sep = "")
saveRDS(ss, fname)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -9.6490  -2.7355  -0.4478   2.6640  13.6802  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -9.3589     4.5879  -2.040  0.04219 * 
## rf_slist      0.6538     0.3544   1.845  0.06598 . 
## cu_slist      0.7711     0.2866   2.690  0.00752 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 15.9316)
## 
##     Null deviance: 5912.1  on 319  degrees of freedom
## Residual deviance: 5050.3  on 317  degrees of freedom
## AIC: 1799
## 
## Number of Fisher Scoring iterations: 2

This is the slist prediction plot below. Note that the intermediate chunks are not shown in this notebook; however, they are included in the markdown document should you wish to check.

Relationship between gain and offset adjusted spectral predictions and measured maize grain zinc (Zn) levels.

We can now also compare how the spectral predictions perform relative to the much more expensive and time consuming soil wet chemistry based predictions. The figure below shows that relationship.

Relationship between bias and offset adjusted wet chemistry and spectral predictions.

The takeaway from this section is that maize grain Zn levels are reliably predicted using soil MIR spectra and standard, quite fast MLAs. In this particular case there was actually no need to pursue much slower and/or computationally more challenging algorithms such as pls, glmnet and/or xgbLinear that are often used for regulariztion and variable selection in the analyses of high-dimensional, chemometric datasets. A simple PCA data reduction step of the raw MIR spectra appears to have been quite sufficient for both ensemble model calibration and validation.

Calibration and prediction with crop spectra

Again, we initially calibrate 3 MLAs to the soil spectral principal components data with 10-fold cross-validation. Note that the calibration and validation code patterns are similar to both the wet-chemistry based and the soil MIR spectral patterns that were generated above. We have just substituted the crop spectral principal components for the soil principal component features but are running the same MLAs for comparison. Because the script is very similar to that in the above sections, most of it is not shown in the notebook. But of course, you can find everything in the markdown document on Github, should you wish to explore more.

# Individual model predictions on the validation set
val$rf_clist <- predict(clist$rf, val)
val$gb_clist <- predict(clist$gbm, val)
val$cu_clist <- predict(clist$cubist, val)

# Set labels and features
lval <- as.vector(t(val[labs]))
fval <- val[ ,3615:3617] ## validation features

# Start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Model setup and fitting
set.seed(seed)
tc <- trainControl(method="repeatedcv", number = 10, repeats = 3, allowParallel=T)

sc <- train(fval, lval,
            method = "glmStepAIC",
            trControl = tc,
            metric = "RMSE")

val$sc <- predict(sc, val) ## stacked predictions
stopCluster(mc)
fname <- paste("./MW_Zn_data/Results/", labs, "_cstack.rds", sep = "")
saveRDS(sc, fname)

The initial grain Zn predictions using crop MIR spectra do not look as promising as those derived from the soil data. This is somewhat counterintuitive given that the spectra were measured on the grain itself. So, we will also try a variable regularization and selection approach here that fits 3 additional MLAs, pls, glmnet and xgbLinear, and will then re-ensemble all of the available individual predictions. Note that this next calibration step can take up to ~1 hour to run on a normal computer with 8 cores and 16 Gb of RAM … there are just a lot of spectral features to sort through. Again, you can find the script in the associated markdown document on Github.

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -11.4075   -2.7573   -0.1389    1.9735   16.2498  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.33426    3.81935  -0.611 0.541530    
## rf_clist    -0.51011    0.33486  -1.523 0.128679    
## gb_clist     0.64702    0.31913   2.027 0.043459 *  
## gl_rlist     0.82735    0.21540   3.841 0.000148 ***
## xl_rlist     0.14819    0.09496   1.560 0.119648    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 16.16485)
## 
##     Null deviance: 6091.3  on 319  degrees of freedom
## Residual deviance: 5091.9  on 315  degrees of freedom
## AIC: 1805.6
## 
## Number of Fisher Scoring iterations: 2

# All feature stack
fval <- select(cdata,19:3594) ## full spectral signatures
pval <- select(cdata,3595:3614) ## PCA scores

# Individual model predictions on the validation set
cdata$rf_clist <- predict(clist$rf, pval)
cdata$gb_clist <- predict(clist$gbm, pval)
cdata$cu_clist <- predict(clist$cubist, pval)
cdata$pl_rlist <- predict(rlist$pls, fval)
cdata$gl_rlist <- predict(rlist$glmnet, fval)
cdata$xl_rlist <- predict(rlist$xgbLinear, fval)
cdata$sc <- predict(sc, cdata)

# Quantile regression fit on whole dataset
scQ <- rq(Zn~sc, tau=c(0.05,0.5,0.95), data = cdata)
summary(scQ)

## 
## Call: rq(formula = Zn ~ sc, tau = c(0.05, 0.5, 0.95), data = cdata)
## 
## tau: [1] 0.05
## 
## Coefficients:
##             Value   Std. Error t value Pr(>|t|)
## (Intercept) 3.79252 2.06477    1.83678 0.06643 
## sc          0.52581 0.09445    5.56714 0.00000 
## 
## Call: rq(formula = Zn ~ sc, tau = c(0.05, 0.5, 0.95), data = cdata)
## 
## tau: [1] 0.5
## 
## Coefficients:
##             Value   Std. Error t value Pr(>|t|)
## (Intercept) 9.81017 1.63486    6.00063 0.00000 
## sc          0.53663 0.07538    7.11862 0.00000 
## 
## Call: rq(formula = Zn ~ sc, tau = c(0.05, 0.5, 0.95), data = cdata)
## 
## tau: [1] 0.95
## 
## Coefficients:
##             Value    Std. Error t value  Pr(>|t|)
## (Intercept) 19.59619  3.35017    5.84931  0.00000
## sc           0.45137  0.14737    3.06292  0.00223

Relationship between gain and offset adjusted spectral predictions and measured maize grain zinc (Zn) levels.

The takeaway from this section is that the measured Zn levels in maize grain are currently not well predicted by maize grain MIR spectra. We are not sure about the causes of this but it does not seem to be an underlying problem of the MLAs that were used. One possibility could be that VIS / NIR spectra might actually be more useful in this context, but this is certainly an open question. Another possibility is that the data are simply misaligned in the database. For the time being we essentially appear to be just fitting noise here. Useful dignostics in this context would be to cross-check the crop spectra against other elements e.g., grain phosphorus, calcium, magnesium, iron levels etc.

Takeaways

The main takeaways from this notebook are the following:

Maize grain Zn levels are well predicted from both soil wet-chemistry and MIR data with standard MLAs. The main benefit of using MIR in this case is to increase lab throughputs and to reduce the costs of obtaining risk factor data for micronutrient deficiency monitoring. This might advance localized Zn deficiency diagnostics and risk assessments that could potentially be carried out routinely, at low cost, and across large geographical regions and populations of interest.
Maize grain Zn levels could not be reliably predicted from the corresponding maize grain MIR spectra. This is a somewhat surprising and counterintuitive result that should be explored further, given the quite reasonable predictions that were obtained using the soil spectra. As a first check, we recommend that the data are examined again to ensure that the crop Zn labels are properly aligned with their corresponding MIR features in the database that was available to us.
There is scope within the GeoNutrition project to also test this approach for other crop mineral micronutrient levels (e.g., calcium, iron, iodine and selenium) that might of concern in particular regions of interest and/or in other crops and crop tissues. Additional surveys of the geographical distributions of Indicator plants, other than maize, might also be useful for determining where specific micronutrient deficiencies occur in cropland landscapes.
Computationally, all of the code presented in this notebook runs fairly fast and probably could be automated, such that when a (soil) MIR measurement is received in a laboratory, a maize grain Zn prediction could be issued in near-real time. Note that for other spectrometer setups the associated calibration / validation transfer steps would need to be taken into account. This is also a good reason to properly curate any physical soil reference samples.

Any questions or comments about this notebook are most welcome via AFSIS.

Chemometric workflows for predicting zinc levels in maize grain

M.G. Walsh, P.C. Nalivata, D. Gashu, S. Gameda, E.K. Towett, S.P. MacGrath, R.M. Lark and M.R. Broadley

Last compiled on 10, July, 2021

Introduction

General data setup

Predicting maize grain Zn levels from soil properties with MLAs

Model calibration with `caretEnsemble`

Model stacking with `caret`

Ensemble prediction uncertainty estimates

Predicting maize grain Zn from soil and plant MIR spectra

Spectral feature conversions / pre-processing

Calibration and prediction with soil spectra

Calibration and prediction with crop spectra

Takeaways

Chemometric workflows for predicting zinc levels in maize grain

M.G. Walsh, P.C. Nalivata, D. Gashu, S. Gameda, E.K. Towett, S.P. MacGrath, R.M. Lark and M.R. Broadley

Last compiled on 10, July, 2021

Introduction

General data setup

Predicting maize grain Zn levels from soil properties with MLAs

Model calibration with caretEnsemble

Model stacking with caret

Ensemble prediction uncertainty estimates

Predicting maize grain Zn from soil and plant MIR spectra

Spectral feature conversions / pre-processing

Calibration and prediction with soil spectra

Calibration and prediction with crop spectra

Takeaways

Model calibration with `caretEnsemble`

Model stacking with `caret`