Introduction
Irrigation plays a prominent role in shaping the cropland landscapes of Africa, a continent known for its diverse ecosystems, climatic variations, and farming challenges. With a rapidly growing population and unpredictable weather patterns exacerbated by climate change, the importance of irrigation in Africa cannot be overstated.
Importance of irrigation
Irrigation will be essential for improving food security in Africa. Rainfed agriculture remains susceptible to droughts leading to crop failures, food shortages, and poor nutritional quality produce. By providing controlled water supplies to crops, irrigation mitigates the risks associated with climatic uncertainties. Consistent water access allows for multiple cropping seasons, improved crop yields, improved nutrient use efficiencies, and a diversified range of high(er) value crops.
Irrigation contributes to economic growth and/or poverty alleviation. Agriculture remains the mainstay of African economies, employing a large portion of smallholder farmers. Improved irrigation systems and practices can increase agricultural productivity, potentially leading to high quality surplus produce that may be sold in local and international markets. This generates farmer income and stimulates economic growth at various levels, within rural communities and national economies.
Irrigation can play a crucial role in environmental and ecological sustainability. Sustainable irrigation practices can lead to efficient water usage, preventing over-extraction of groundwater and reducing the strain on natural water sources. Properly managed irrigation can minimize soil erosion and degradation, preserving soil quality for future generations. Additionally, irrigation opens opportunities for controlled water management, enabling the cultivation of crops that are generally well-suited to local conditions, but which may not thrive under rainfed agriculture.
However, it is important to note that while irrigation offers numerous benefits, it should be managed carefully. Improper irrigation practices can lead to water pollution, waterlogging, soil salinization, aquifer depletion, and other forms of environmental degradation. It is crucial to implement evidence-based, sustainable irrigation practices that embrace modern information technology and knowledge sharing to ensure the responsible use of water resources.
Objectives of this notebook
The main objectives of this notebook are to introduce R code for labeling, exploration and discovery of the spatial distribution of smallholder irrigation in Uganda. This markdown notebook is maintained and updated on OSF here, and you can fork and alter it from there for your reference and use.
Data setup
To actually run this notebook, you will need to install and load the R-packages indicated in the chunk directly below.
# Package names
packages <- c("osfr", "htmlwidgets", "leaflet", "DT", "rgdal", "raster", "doParallel", "caret",
"caretEnsemble", "randomForest", "xgboost", "glmnet", "pROC", "caTools")
# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
utils::install.packages(packages[!installed_packages])
}
# Load packages
invisible(lapply(packages, library, character.only = TRUE))
Load field survey data
The data we shall be using were collected by Government of Uganda field staff in 2023. The next chunk loads the irrigation and crop survey data from your working directory. We also include observations regarding the distribution of commonly irrigated crops. Modeling the distribution of irrigated crops is the subject of a separate (companion) notebook. Note that at the time of the writing of this notebook these data were not publicly available. We will try to convince the powers that be and the data owners that making the data publicly available is a useful thing to do that adds to their value.
plots <- read.table("./data/UG_irrigated_plots.csv", header = T, sep = ",")
plots$irrigated <- ifelse(plots$ir1 == 'yes' & plots$ws1 == 'yes', 'a', 'b') ## a = present, b = absent
plots$irrigated <- as.factor(plots$irrigated)
crops <- read.table("./data/UG_irrigated_crops.csv", header = T, sep = ",")
crops$onion.1 <- NULL ## removes the duplicated onion column
irdat <- merge(plots, crops, by="sid")
An overview map of where the irrigation observations were collected in Uganda is generated in the next chunk. You can click and zoom into the individual locations that have been recorded thus far. Locations where irrigation is present are shown in blue, and locations where irrigation is absent are shown in red.
col <- ifelse(irdat$irrigated == 'a', "blue", "red")
# Sample locations
w <- leaflet() %>%
setView(lng = mean(irdat$lon), lat = mean(irdat$lat), zoom = 7) %>%
addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
addCircleMarkers(irdat$lon, irdat$lat,
color = col,
stroke = FALSE,
fillOpacity = 0.8,
radius = 5,
clusterOptions = markerClusterOptions(maxClusterRadius = 30))
saveWidget(w, 'UG_irrigation_locs.html', selfcontained = T) ## save leaflet map
w ## plot widget
Load and extract raster features
You can download all of the raster files that are needed for running the next chunks from our OSF repository here. If you use that option ensure that all of the files are put into the ./grids
directory.
dir.create("grids", showWarnings = FALSE)
dir.create("figures", showWarnings = FALSE)
dir.create("learners", showWarnings = FALSE)
dir.create("maps", showWarnings = FALSE)
dir.create("data", showWarnings = FALSE)
dir.create("sae", showWarnings = FALSE)
# alternatively you can also use **osfr** for the download
# osf_retrieve_node("szw7q") %>%
# osf_ls_files() %>%
# osf_download(path = "./grids", conflicts = "overwrite")
glist <- list.files(path="./grids", pattern="tif", full.names = TRUE)
grids <- stack(glist)
# Survey variable selection and projection
vars <- c('sid', 'lon', 'lat', 's2', 's3', 'ir1', 'ws1', 'irrigated', 'tomato',
'cabbage', 'eggplant', 'greens', 'peppers', 'pulses', 'onion', 'nuts',
'banana', 'melon', 'passion', 'taro', 'potato', 'beans', 'cane', 'sunflower',
'papaya', 'pineapple', 'rice')
irdat <- irdat[vars] ## select variables
ir.proj <- project(cbind(irdat$lon, irdat$lat), "+proj=laea +ellps=WGS84
+lon_0=20 +lat_0=5
+units=m +no_defs")
colnames(ir.proj) <- c('x','y')
irdat <- cbind(irdat, ir.proj)
coordinates(irdat) <- ~x+y
projection(irdat) <- projection(grids)
# Extract gridded variables at survey locations
irgrid <- extract(grids, irdat)
irdat <- as.data.frame(cbind(irdat, irgrid))
# Write out `irdat` dataframe for reuse
write.csv(irdat, "./data/UG_irigation_all.csv", row.names = FALSE)
The pre-processed Uganda raster data (in the grids
raster stack) we will be using were derived and projected (to CRS = +proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs) from their primary open sources. Short descriptions of the 34 included rasters, original source links are shown in the table below.
We will also be using the most recent GeoSurvey to based land cover map of Uganda. Note that the areas highlighted by the red boxes in the legend of Figure 1 are of primary interest here because they identify the main cropland cover types in the country, which define our overall Region of Interest (ROI) in Uganda. Any areas falling outside of the ROI are masked out of subsequent spatial predictions. You can find out how the constituent layers of this map were created by downloading one of the AfSIS land cover classification notebooks from our OSF repository here. It is also on the AfSIS website at https://africasoils.info.
Predicting irrigation
We are using a stacked generalization approach (Wolpert, 1992) in this section of the notebook. Stacked Generalization, often referred to as stacking, is an ensemble learning technique that seeks to improve model predictions by combining the outputs of multiple models, potentially of different types, in a strategic manner. This is how it works:
Split the data: into representative training (calibration) and test (validation) sets.
Base Models: You start by training several different predictive models using your training dataset. These models can be of any type and are often diverse, such as decision trees, support vector machines, neural networks, etc. They are known as base (or level-0) models.
Hold-Out or Cross-Validation Predictions: Next, you use these base models to make predictions on a separate dataset, which can either be a hold-out validation set or generated through cross-validation. Essentially, you’re interested in capturing the predictions each model makes on data it hasn’t seen before.
Stacking: The predictions made by the base models are then used as input features to train a higher-level model, known as a meta-model or a level-1 model. The idea is that this meta-model learns how to best combine the predictions from the base models to make a final prediction.
Final Prediction: When you need to make predictions on new data, you first run the data through all the base models, take their predictions, and then input those predictions into the meta-model. The meta-model’s output is the final prediction.
The intuition behind stacking is that while individual models may have particular strengths and weaknesses, a meta-model may be able to learn the best way to combine their predictions, to achieve better overall performance. Stacking has frequently been shown to outperform other ensemble methods in various tasks, given a careful choice of features, base-models, and stacking approach (Kaggle AI report, 2023).
Data split
This first chunk sets an 80/20% calibration and validation partition. It is always best to do this as early as possible to avoid any data leakage. The function createDataPartition
in the caret
package can be used to generate randomized splits of the data. If the argument to this function is a factor, random sampling occurs within each class and will preserve the overall class distribution of the data. We partition on the irrigated
variable here.
# Set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)
irdat <- irdat[sample(1:nrow(irdat)), ] # randomize observations
irdat <- na.omit(irdat)
# Split data into calibration and validation sets
irIndex <- createDataPartition(irdat$irrigated, p = 4/5, list = F, times = 1)
ir_cal <- irdat[ irIndex,]
ir_val <- irdat[-irIndex,]
# Select calibration labels and gridded features
lcal <- ir_cal[ ,8]
fcal <- ir_cal[ ,c(28:41, 43:61)]
# Select validation labels and gridded features
lval <- ir_val[ ,8]
fval <- ir_val[ ,c(28:41, 43:61)]
Base-model training
This chunk fits 3 presence/absence base-models using the gridded calibration data with 10-fold cross-validation. Learn more about how these MLAs work by following looking at: randomForest, xgboost, and glmnet. The 3 base-models represent some of the main approaches to ML from tabular data (Kaggle AI report, 2023) i.e., bootstrap aggregation, gradient boosting, and L1/L2 regularization. You can use caretEnsemble
instead of caret
as long as the feature variables (the grids
in this case), and the trainControl
methods are the same for each model in the caretList
function. This shortens the length of this notebook but does not otherwise affect overall caret
functionality. Note however that the calculations take a bit of time to run on a normal 8-core, 32 Gb memory computer, even when engaging all cores. We initially fit the models with default-tuning of the relevant hyperparameters. Should the need arise, we can always do more fine-tuning later.
# Start doParallel
seed <- 12358
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Specify model training controls
tc <- trainControl(method = "cv", number = 10, classProbs = TRUE,
summaryFunction = twoClassSummary, allowParallel = TRUE,
savePredictions = "final")
# Train 3 base-learners using the calibration set
blist <- caretList(fcal, lcal,
trControl = tc,
tuneList = NULL,
methodList = c("rf", "xgbTree", "glmnet"),
metric = "ROC")
# Generate predictions for the validation set
preds <- as.data.frame(predict(blist, newdata = fval))
# Save fitted model objects
stopCluster(mc)
fname <- paste("./learners/", "base_blist.rds", sep = "")
saveRDS(blist, fname)
The next chunk generates relative variable importance values of the 3 base-learner models for the calibration set.
Model stacking and validation
# Start doParallel
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)
# Specify model training controls
tc <- trainControl(method = "cv", number = 10, classProbs = TRUE, allowParallel = TRUE,
summaryFunction = twoClassSummary, savePredictions = "final")
# Stack the 3 base classifiers using the validation set
glm_stack <- train(preds, lval,
trControl = tc,
tuneList = NULL,
method = "glmnet",
family = "binomial",
metric = "ROC")
# Save fitted model
stopCluster(mc)
fname <- paste("./learners/", "glm_stack.rds", sep = "")
saveRDS(glm_stack, fname)
# Generate predictions for the validation set
stack <- as.vector(predict(glm_stack, newdata = preds, type = "prob"))
preds <- cbind(preds, stack)
preds <- preds[c('rf', 'xgbTree', 'glmnet', 'a')]
names(preds)[names(preds) == 'a'] <- 'stacked'
This generates the areas under the receiver operator curves (AUC) for the validation set …
# calculate ROC curves for the validation set
# rf model
roc_rf <- roc(lval, preds$rf)
auc_rf <- auc(roc_rf)
# xgbTree model
roc_xgbTree <- roc(lval, preds$xgbTree)
auc_xgbTree <- auc(roc_xgbTree)
# glmnet model
roc_glmnet <- roc(lval, preds$glmnet)
auc_glmnet <- auc(roc_glmnet)
# stacked model
roc_stacked <- roc(lval, preds$stacked)
auc_stacked <- auc(roc_stacked)
We can then use the stacked model ROC to derive a classification using the stacked
model predictions to generate a confusion matrix for the validation set.
# generate confusion matrix for the validation set
val_cut <- coords(roc_stacked, "best")
preds$ir_class <- as.factor(ifelse(preds$stacked > 0.59, 'a', 'b'))
confusionMatrix(lval, preds$ir_class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction a b
## a 2019 398
## b 329 1541
##
## Accuracy : 0.8304
## 95% CI : (0.8188, 0.8415)
## No Information Rate : 0.5477
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6567
##
## Mcnemar's Test P-Value : 0.01167
##
## Sensitivity : 0.8599
## Specificity : 0.7947
## Pos Pred Value : 0.8353
## Neg Pred Value : 0.8241
## Prevalence : 0.5477
## Detection Rate : 0.4710
## Detection Prevalence : 0.5638
## Balanced Accuracy : 0.8273
##
## 'Positive' Class : a
##
Spatial predictions
# Generate spatial predictions
rfo.map <- predict(grids, blist$rf, type = "prob")
xgb.map <- predict(grids, blist$xgbTree, type = "prob")
glm.map <- predict(grids, blist$glmnet, type = "prob")
spreds <- stack(rfo.map, xgb.map, glm.map)
names(spreds) <- c("rf","xgbTree", "glmnet")
sta.map <- predict(spreds, glm_stack, type = "prob")
# Write out irrigation prediction maps
spreds <- stack(rfo.map, xgb.map, glm.map, sta.map)
fname <- paste("./maps/", "UG_irrigation_preds.tif", sep = "")
writeRaster(spreds, filename = fname, datatype = "FLT4S", options = "INTERLEAVE=BAND", overwrite = TRUE)
writeRaster(sta.map, filename = "./maps/stacked_pred.tif", datatype = "FLT4S", overwrite = TRUE)
The stacked-model irrigation prediction maps for Uganda are shown in Figure 3, following a few GIS cosmetics in GRASS. Superficially these look similar, but there are some differences in the receiver operator characteristics between them. At the moment the rf
model dominates the weighting in the stacked
model with a small contribution xgbTree
model. Both the xgbTree
and the glmnet
could benefit from additional model tuning during base-learner training. This might provide small overall performance improvements. While this may be computationally worth it in the context of e.g., a Kaggle competition, it is not deemed necessary here, given the already very high AUC performance of the stacked
model on the validation set.