Introduction

This notebook develops an example of a spatial crop yield potential model. The approach consists of a combination of a standard Production function (PF) that considers outputs relative to inputs and a novel Site index (SI), which describes the quality of the production environment for the purpose of growing a particular crop. The combined model can be expressed as: y ~ f (PF, SI), and the explicit multilevel model version of this can be used to gauge the potential productivity of croplands and to provide a comparative frame of reference for evaluating management options.

The SI is the spatial part, which we predict using remote sensing and GIS data with machine learning in R. We intend for the approach to be generalizable to other crop productivity surveys and experiments in which both input/output relationships as well as variability in site quality are deemed important for crop management as well as for program impact monitoring and evaluation.

The example maize yield data we shall use were collected over two cropping seasons (2016 & 2017) by One Acre Fund (OAF) in Western Kenya as part of their annual program monitoring activities. This is a fairly large dataset with 5,987 georeferenced plots distributed over the two cropping seasons and across 333 Level-3 administrative areas i.e., Locations. The data are split between plots belonging to farmers that were participating in the OAF maize program (treated) and those who were not participating (controls) at the time. Data on fertilizer input use on each plot (Calcium Ammonium Nitrate & Diammonium Phosphate) and estimated plot size are also included.

Other than GPS coordinates, the original OAF dataset does not contain any data related to site quality for maize production. We will be using a raster stack with 46 layers covering anthropic, climate, organismal/vegetation, relief/topographical, and geological/soil factors. Though these data are regularly updated, the emphasis is on variables that are expected to change relatively slowly e.g., land form and terrain, climate, soils, parent material, infrastructure, long-term vegetation and reflectance averages, among others. The main reason for this is that these slow variables can become key controlling factors for the faster variables such as crop yields, but over longer time scales. It is also opportune that most of the feature layers have been harmonized and are available for all of Africa.

The notebook is available and will be maintained on MGWs Github. To actually run it to calculate and see additional outputs, you will need to load the packages indicated in the chunk directly below. The authors of these packages are gratefully acknowledged for all of their hard work.

# package names
packages <- c("downloader", "rgdal", "sp", "raster", "quantreg", "arm", "leaflet", "htmlwidgets", "devtools", "caret", "mgcv", "MASS", "randomForest", "gbm", "nnet", "plyr", "doParallel", "dismo")

# install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(packages[!installed_packages])
}

# load packages
invisible(lapply(packages, library, character.only = TRUE))

General data setup

This first section of the notebook sets up the survey data and links these to current remote sensing and GIS data for Kenya that are maintained by AfSIS (2020). The following chunk fetches the data and links these to the remote sensing and GIS layers represented by grids variable stack.

# Data setup --------------------------------------------------------------
# Create a data folder in your current working directory
dir.create("OAF_data", showWarnings=F)
setwd("./OAF_data")
dir.create("./Results")

# download OAF yield data
download("https://www.dropbox.com/s/lhniws8prfvjw9r/OAF_yield_data.csv.zi?raw=1", "OAF_yield_data.csv.zip", mode = "wb")
unzip("OAF_yield_data.csv.zip", overwrite = T)
yield <- read.table("OAF_yield_data.csv", header = T, sep = ",")
yield$trt <- ifelse(yield$trt == 1, "oaf", "control")
# yield <- yield[!duplicated(yield), ] ## removes duplicates if needed

# download GADM-L3 shapefile (@ http://www.gadm.org)
download("https://www.dropbox.com/s/otspr9b9jtuyneh/KEN_adm3.zip?raw=1", "KEN_adm3.zip", mode = "wb")
unzip("KEN_adm3.zip", overwrite = T)
shape <- shapefile("KEN_adm3.zip")

# download raster stack
download("https://osf.io/4jvnu?raw=1", "KE_250m_2020.zip", mode = "wb")
unzip("KE_250m_2020.zip", overwrite = T)
glist <- list.files(pattern="tif", full.names = T)
grids <- stack(glist)

As the yield dataframe only covers portions of Western Kenya, we initially define the relevant Region of Interest based on the min/max survey coordinates and crop all of the grids raster stack to that bounding box. This is mostly to reduce computing times.

# set Region of Interest grid extent
ext <- data.frame(lat = c(-1.2,-1.2,1.2,1.2), lon = c(33.9,35.5,33.9,35.5)) ## set ROI extent in degrees
names(ext) <- c("lat","lon")
coordinates(ext) <- ~ lon + lat
proj4string(ext) <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84")
ext <- spTransform(ext, CRS("+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs"))
bb <- extent(ext)
grids <- crop(grids, bb)

This next chunk extracts the GIS and grid data to the yield dataframe and also does a bit of outlier cleaning. It generates a new dataframe called gsdat, which will the main focus for fitting and validating the various machine learning models that are presented in subsequent sections.

# attach GADM-L3 admin unit names from shape
coordinates(yield) <- ~lon+lat
projection(yield) <- projection(shape)
gadm <- yield %over% shape
yield <- as.data.frame(yield)
yield <- cbind(gadm[ ,c(5,7,9)], yield)
colnames(yield) <- c("district","division","location","id","lat","lon","year","trt","can","dap","fsize","yield")

# project survey coords to grid CRS
yield.proj <- as.data.frame(project(cbind(yield$lon, yield$lat), "+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs"))
colnames(yield.proj) <- c("x","y")
yield <- cbind(yield, yield.proj)
coordinates(yield) <- ~x+y
projection(yield) <- projection(yield)

# extract gridded variables at survey locations
yieldgrid <- extract(grids, yield)
gsdat <- as.data.frame(cbind(yield, yieldgrid)) 

# clean data
gsdat <- gsdat[complete.cases(gsdat[,c(1:3,13:44)]), ] ## removes incomplete cases
gsdat <- gsdat[which(gsdat$can < 100 & gsdat$dap < 100), ] ## removes outlier fertilizer treatments
gsdat <- gsdat[which(gsdat$fsize > 0), ] ## removes field size = 0
quant <- quantile(gsdat$yield, probs=c(0.025,0.975))
gsdat <- gsdat[which(gsdat$yield > quant[1]), ] ## removes low outlier yields (<2.5% quantile)
gsdat <- gsdat[which(gsdat$yield < quant[2]), ] ## removes high outlier yields (>97.5% quantile)

# write out clean dataframe
write.csv(gsdat, "./OAF_data/Results/OAF_gsdat.csv", row.names = F)

# yield survey location map widget
w <- leaflet() %>%
  setView(lng = mean(gsdat$lon), lat = mean(gsdat$lat), zoom = 8) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(gsdat$lon, gsdat$lat, clusterOptions = markerClusterOptions())
saveWidget(w, 'OAF_yield_survey.html', selfcontained = T) ## save widget
w ## plot widget

Fit an aggregate production function

We use Quantile regression to fit a simple Cobb-Douglas type function to the yield survey data … without any additional spatial features. Median regression (i.e. 50th quantile regression) is preferred to mean regression in this case, because it is quite robust to outlying observations. The residuals of this function are then evaluated and classified into 2 groups of plots: group A for which observed yields lie above the regression line and group B for which observed yields lie below it. This creates a new variable silab in the gsdat dataframe.

# this is a basic production function using quantile regression
si.rq <- rq(log(yield)~year+factor(trt)+log(dap+1)*log(can+1), tau = 0.5, data = gsdat)
gsdat$silab <- as.factor(ifelse(exp(predict(si.rq, gsdat)) > gsdat$yield, "B", "A")) ## classify the residuals
summary(si.rq)

## 
## Call: rq(formula = log(yield) ~ year + factor(trt) + log(dap + 1) * 
##     log(can + 1), tau = 0.5, data = gsdat)
## 
## tau: [1] 0.5
## 
## Coefficients:
##                           Value    Std. Error t value  Pr(>|t|)
## (Intercept)                0.50997  0.03526   14.46249  0.00000
## year                       0.20049  0.01598   12.54382  0.00000
## factor(trt)oaf             0.14166  0.01967    7.20057  0.00000
## log(dap + 1)               0.06416  0.01624    3.95019  0.00008
## log(can + 1)               0.15479  0.01941    7.97540  0.00000
## log(dap + 1):log(can + 1) -0.02296  0.00573   -4.00656  0.00006

As might be expected (and as shown in the boxplot below), there are substantial differences in the observed maize yields between the two, what we shall refer to as Site Index labels, which are not attributable to differences between years, treatments (OAF treated vs control), nor the application of fertilizers. In the next sections we explore to what degree these differences may be attributable to the spatial characteristics of the production environment … in other words the SI as indicated by silab.

Differences in observed maize yields (t/ha) between Site Index labels.

Machine-learning-based predictive mapping of SI

The following chunks generate silab predictions using different machine learning algorithms (MLAs) with varying remote sensing and GIS (covariate) features. The main idea is to train a number of potentially contrasting models with k-fold cross-validation. At the end of the model training processes, the various models will be ensembled (stacked) on an independent validation dataset. When consistently applied over time and space, this is a form of Reinforcement learning, which should produce increasingly accurate predictions as new field and remote data or different MLAs are obtained and run.

Note that you are not limited to only the models that we use. The caret package, which we rely on here, offers many compelling alternatives. We encourage you to explore and experiment with those.

Model setups

The following chunk scrubs some of the objects in memory and creates a randomized partition between the training and validation dataframes.

rm(list=setdiff(ls(), c("gsdat","grids","glist"))) ## scrubs extraneous objects in memory

# set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)

# split data into calibration and validation sets
gsIndex <- createDataPartition(gsdat$silab, p = 4/5, list = F, times = 1)
gs_cal <- gsdat[ gsIndex,]
gs_val <- gsdat[-gsIndex,]

# Site index calibration labels
labs <- c("silab")
lcal <- as.vector(t(gs_cal[labs]))

# raster calibration features
fcal <- gs_cal[,13:31,35:58]

Note that in running the models below everything is parallelized to facilitate efficient use of either local or cloud-based computing resources. Note that there are also other options available for this (e.g., foreach, among others.

Spatial trend model (mgcv)

This is a simple spatially smoothed generalized additive model applying the gam function on the SI label at different sampling locations in Western Kenya, based only on their georeference. It is similar to ordinary indicator kriging with cross-validation … but it is simpler and much faster to compute in this context.

# select central place covariates
gf_cpv <- gs_cal[,32:34]

# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T, 
                   summaryFunction = twoClassSummary, allowParallel = T)

# model training
gm <- train(gf_cpv, lcal, 
            method = "gam",
            preProc = c("center","scale"), 
            family = "binomial",
            metric = "ROC",
            trControl = tc)

# model outputs & predictions
summary(gm)
gm.pred <- predict(grids, gm, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_gm.rds", sep = "")
saveRDS(gm, fname)

Central place model (MASS)

Central places are influential variables in places where human impacts occur. They are correlated with both extraction and deposition of soil nutrients and toxic elements, soil erosion and deposition, acidification and many other soil disturbance processes. The model below focuses on central place indicators such as distances to roads and settlements, surface water sources, cell towers and electricity networks among others.

# select central place covariates
gf_cpv <- gs_cal[,20:31]

# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T,
                   summaryFunction = twoClassSummary, allowParallel = T)

# model training
gl1 <- train(gf_cpv, lcal, 
             method = "glmStepAIC",
             family = "binomial",
             preProc = c("center","scale"), 
             trControl = tc,
             metric ="ROC")

# model outputs & predictions
gl1.pred <- predict(grids, gl1, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_gl1.rds", sep = "")
saveRDS(gl1, fname)

GLM model with all the spatial features (MASS)

This model is very similar to the Central place model above, but it contains all of the 46 spatial features and then backward selects from those to generate a prediction via a generalized linear model. Alternatively (or additionally) you could also try regularized regression with e.g., glmnet here.

# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T,
                   summaryFunction = twoClassSummary, allowParallel = T)

# model training
gl2 <- train(fcal, lcal, 
             method = "glmStepAIC",
             family = "binomial",
             preProc = c("center","scale"), 
             trControl = tc,
             metric ="ROC")

# model outputs & predictions
gl2.pred <- predict(grids, gl2, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_gl2.rds", sep = "")
saveRDS(gl2, fname)

Random forest (randomForest)

The below is a bagging chunk that uses Breiman & Cutler’s algorithm with all of the feature data. A good, short article to look at for reference in context here is Barnard et al..

# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T,
                   summaryFunction = twoClassSummary, allowParallel = T)
tg <- expand.grid(mtry = seq(1,5, by=1)) ## model tuning steps

# model training
rf <- train(fcal, lcal,
            preProc = c("center","scale"),
            method = "rf",
            ntree = 501,
            metric = "ROC",
            tuneGrid = tg,
            trControl = tc)

# model outputs & predictions
rf.pred <- predict(grids, rf, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_rf.rds", sep = "")
saveRDS(rf, fname)

Generalized boosting (gbm)

This next chunk represents one of the boosting techniques that can be used for both regression or classification. It is similar to the randomForest above, but uses a boosting technique that emphasizes successful predictions rather than penalizing poor predictions via bagging. There is a wide-array of literature around the so-called “greedy algorithms”. Very good descriptions of these are provided in Hastie et al, 2009.

# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T, summaryFunction = twoClassSummary,
                   allowParallel = T)

## for initial <gbm> tuning guidelines see @ https://stats.stackexchange.com/questions/25748/what-are-some-useful-guidelines-for-gbm-parameters
tg <- expand.grid(interaction.depth = seq(2,5, by=1), shrinkage = 0.01, n.trees = seq(101,501, by=50),
                  n.minobsinnode = 50) ## model tuning steps

# model training
gb <- train(fcal, lcal, 
            method = "gbm", 
            preProc = c("center", "scale"),
            trControl = tc,
            tuneGrid = tg,
            metric = "ROC")

# model outputs & predictions
gb.pred <- predict(grids, gb, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_gb.rds", sep = "")
saveRDS(gb, fname)

Neural network (nnet)

With the last model we fit here is a simple feed-forward neural network i.e., a “single layer perceptron”. This is a type of linear classifier, which combines a set of weights with the feature vector. Note that more complex multilayer network structures, such as Deep-learning could be applied, but we leave those options for you to explore.

mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T,
                   summaryFunction = twoClassSummary, allowParallel = T)
tg <- expand.grid(size = seq(2,10, by=2), decay = c(0.001, 0.01, 0.1)) ## model tuning steps

# model training
nn <- train(fcal, lcal, 
            method = "nnet",
            preProc = c("center","scale"), 
            tuneGrid = tg,
            trControl = tc,
            metric ="ROC")

# model outputs & predictions
nn.pred <- predict(grids, nn, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_nn.rds", sep = "")
saveRDS(nn, fname)

SI ensemble predictions

The main point here is not to evaluate a best individual model but rather to evaluate the combination of the previously fitted models against a 20% hold-out validation dataset. This provides robust statistical estimates of how the different models should be weighted against one-another in an ensemble. It also prevents most overfitting problems.

Stacking model setup on validation set

# Model stacking setup ----------------------------------------------------
preds <- stack(gm.pred, gl1.pred, gl2.pred, rf.pred, gb.pred, nn.pred)
names(preds) <- c("gm","gl1","gl2","rf","gb","nn")
# plot(preds, axes = F)

# extract model predictions
coordinates(gs_val) <- ~x+y
projection(gs_val) <- projection(preds)
gspred <- extract(preds, gs_val)
gspred <- as.data.frame(cbind(gs_val, gspred))

# stacking model validation labels and features
gs_val <- as.data.frame(gs_val)
lval <- as.vector(t(gs_val[labs]))
fval <- gspred[,60:65] ## subset validation features

Model stacking

The following chunk fits the model ensemble with the glmStepAIC function from the MASS library. You could explore other options here, but we find that this provides a reasonable combination and weighting of the 6 models that were produced in the individual model training steps.

# start doParallel to parallelize model fitting
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# control setup
set.seed(1385321)
tc <- trainControl(method = "cv", classProbs = T, 
                   summaryFunction = twoClassSummary, allowParallel = T)

# model training
si <- train(fval, lval,
            method = "glmStepAIC",
            family = "binomial",
            metric = "ROC",
            trControl = tc)

# model outputs & predictions
si.pred <- predict(preds, si, type = "prob") ## spatial predictions
stopCluster(mc)
fname <- paste("./OAF_data/Results/", labs, "_si.rds", sep = "")
saveRDS(si, fname)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.01376  -1.01259  -0.07114   1.05821   1.80634  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.8304     0.3807   4.807 1.53e-06 ***
## gm           -2.4022     0.8182  -2.936 0.003326 ** 
## gl1           2.1924     0.9245   2.371 0.017722 *  
## rf           -1.3214     0.3780  -3.496 0.000472 ***
## gb           -2.0871     0.8549  -2.441 0.014633 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1574.8  on 1135  degrees of freedom
## Residual deviance: 1429.9  on 1131  degrees of freedom
## AIC: 1439.9
## 
## Number of Fisher Scoring iterations: 4

SI theoretically takes on values between 0 - 1, just like a probability. The chunk below generates the prediction map of the SI across the OAF Area of Interest in Western Kenya.

# project si.pred to EPSG:3857
sill <- projectRaster(si.pred, crs="+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs")

# set color pallette
pal <- colorBin("Greens", domain = 0:1) 

# render map
w <- leaflet() %>% 
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addRasterImage(sill, colors = pal, opacity = 0.5) %>%
  addLegend(pal = pal, values = values(sill), title = "Site index")
w ## plot widget

Prediction checks

This next chunk provides some prediction checks on the ensemble model based on both the validation, as well as the complete dataframes. It also writes out all of the prediction grids to a geotif file, which you can import into a GIS of your choosing for display, query and additional analyses.

# validation set receiver-operator characteristics ------------------------
cp_pre <- predict(si, fval, type="prob")
cp_val <- cbind(lval, cp_pre)
cpa <- subset(cp_val, cp_val=="A", select=c(A))
cpb <- subset(cp_val, cp_val=="B", select=c(A))
cp_eval <- evaluate(p=cpa[,1], a=cpb[,1]) ## calculate ROC's on test set (~0.71)
# plot(cp_eval, 'ROC') ## plot ROC curve

# generate feature mask ---------------------------------------------------
t <- threshold(cp_eval) ## calculate classification threshold based on validation set ROC
r <- matrix(c(0, t[,1], 0, t[,1], 1, 1), ncol=3, byrow = T) ## set threshold value <kappa>
mask <- reclassify(si.pred, r) ## reclassify stacked predictions

# write prediction grids to geotif ----------------------------------------
gspreds <- stack(preds, si.pred, mask)
names(gspreds) <- c("gm","gl1","gl2","rf","gb","nn","si","mk")
fname <- paste("./OAF_data/Results/","OAF_", labs, "_preds_2020.tif", sep = "")
writeRaster(gspreds, filename=fname, datatype="FLT4S", options="INTERLEAVE=BAND", overwrite=T)

# Site Index prediction check ---------------------------------------------
coordinates(gsdat) <- ~x+y
projection(gsdat) <- projection(grids)
gspre <- extract(gspreds, gsdat)
gsout <- as.data.frame(cbind(gsdat, gspre))
gsout$mzone <- as.factor(ifelse(gsout$mk == 1, "A", "B"))
confusionMatrix(gsout$mzone, gsout$silab) ## overall prediction accuracy stats

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B
##          A 2362  797
##          B  480 2043
##                                           
##                Accuracy : 0.7753          
##                  95% CI : (0.7642, 0.7861)
##     No Information Rate : 0.5002          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5505          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8311          
##             Specificity : 0.7194          
##          Pos Pred Value : 0.7477          
##          Neg Pred Value : 0.8098          
##              Prevalence : 0.5002          
##          Detection Rate : 0.4157          
##    Detection Prevalence : 0.5560          
##       Balanced Accuracy : 0.7752          
##                                           
##        'Positive' Class : A               
##

Multilevel model of maize yield potentials

We use a Multilevel model to combine the predicted SIs with the previously presented aggregate production function using the arm package. In this particular example we specify a random intercept model with in which the model intercepts i.e., the efficiency parameters of the combined model, are allowed to vary by year and administrative location.

yld.lme <- lmer(log(yield)~factor(trt)*log(si/(1-si))+log(can+1)*log(dap+1)+(1|year)+(1|location), data = gsout)
gsout$yldf <- exp(fitted(yld.lme, gsout))
summary(yld.lme) ## mixed model yield estimate results

## Linear mixed model fit by REML ['lmerMod']
## Formula: log(yield) ~ factor(trt) * log(si/(1 - si)) + log(can + 1) *  
##     log(dap + 1) + (1 | year) + (1 | location)
##    Data: gsout
## 
## REML criterion at convergence: 4753.9
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.9302 -0.5510  0.0789  0.6418  3.8937 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  location (Intercept) 0.03146  0.1774  
##  year     (Intercept) 0.01496  0.1223  
##  Residual             0.12318  0.3510  
## Number of obs: 5682, groups:  location, 333; year, 2
## 
## Fixed effects:
##                                  Estimate Std. Error t value
## (Intercept)                      0.678748   0.089457   7.587
## factor(trt)oaf                   0.156138   0.011612  13.447
## log(si/(1 - si))                 0.491290   0.011551  42.532
## log(can + 1)                     0.091452   0.011559   7.912
## log(dap + 1)                     0.037891   0.008783   4.314
## factor(trt)oaf:log(si/(1 - si)) -0.035383   0.011700  -3.024
## log(can + 1):log(dap + 1)       -0.009537   0.003485  -2.737
## 
## Correlation of Fixed Effects:
##             (Intr) fctr() l(/(-s lg(c+1) lg(d+1) f():-s
## factr(trt)f  0.020                                     
## lg(s/(1-s))  0.010  0.041                              
## log(can+1)  -0.138 -0.231 -0.055                       
## log(dap+1)  -0.184 -0.080 -0.056  0.348                
## f():(/(1-s) -0.026 -0.067 -0.547  0.087   0.073        
## lg(+1):(+1)  0.149  0.022  0.040 -0.865  -0.646  -0.052

Uncertainty estimates

There are many ways to quantify the uncertainty inherent in these predictions. We take a simple but quite robust approach using quantile regression with (quantreg). The plot below shows the spread of the ROI-wide predictions (sensu, their 95% probable intervals).

stQ <- rq(yield~yldf, tau=c(0.025,0.5,0.975), data=gsout) ## quantile regression fit
print(stQ)

## Call:
## rq(formula = yield ~ yldf, tau = c(0.025, 0.5, 0.975), data = gsout)
## 
## Coefficients:
##             tau= 0.025  tau= 0.500 tau= 0.975
## (Intercept) -0.4208390 0.002452232   1.740640
## yldf         0.6182988 1.027190821   1.219315
## 
## Degrees of freedom: 5682 total; 5680 residual

par(pty="s", mar=c(4,4,1,1))
plot(yield~yldf, xlab="Yield potential (t/ha)", ylab="Measured yield (t/ha)", cex.lab=1.4, 
     xlim=c(0,8), ylim=c(0,8), gsout)
curve(stQ$coefficients[2]*x+stQ$coefficients[1], add=T, from=0, to=8, col="blue", lwd=2)
curve(stQ$coefficients[4]*x+stQ$coefficients[3], add=T, from=0, to=8, col="red", lwd=2)
curve(stQ$coefficients[6]*x+stQ$coefficients[5], add=T, from=0, to=8, col="blue", lwd=2)
abline(c(0,1), col="grey", lwd=1)

Quantile regression plot of modeled potential vs observed maize yields (t/ha) in Western Kenya. The blue lines are the 2.5% and 97.5% quantile regression estimates.

We can now write out the complete output dataframe, including all of the predictions for reproducibility, reference and reuse.

# Write output data frame -------------------------------------------------
write.csv(gsout, "./OAF_data/Results/OAF_gsout.csv", row.names = F)

Predicting spatial yield potentials from survey data

M.G. Walsh, J. Chamberlin, J. Silva and S. Aston

10, November, 2021