1 Introduction

Irrigation plays a prominent role in shaping the cropland landscapes of Africa, a continent known for its diverse ecosystems, climatic variations, and farming challenges. With a rapidly growing population and unpredictable weather patterns exacerbated by climate change, the importance of irrigation in Africa cannot be overstated.

1.1 Importance of irrigation

  • Irrigation will be essential for improving food security in Africa. Rainfed agriculture remains susceptible to droughts leading to crop failures, food shortages, and poor nutritional quality produce. By providing controlled water supplies to crops, irrigation mitigates the risks associated with climatic uncertainties. Consistent water access allows for multiple cropping seasons, improved crop yields, improved nutrient use efficiencies, and a diversified range of high(er) value crops.

  • Irrigation contributes to economic growth and/or poverty alleviation. Agriculture remains the mainstay of African economies, employing a large portion of smallholder farmers. Improved irrigation systems and practices can increase agricultural productivity, potentially leading to high quality surplus produce that may be sold in local and international markets. This generates farmer income and stimulates economic growth at various levels, within rural communities and national economies.

  • Irrigation can play a crucial role in environmental and ecological sustainability. Sustainable irrigation practices can lead to efficient water usage, preventing over-extraction of groundwater and reducing the strain on natural water sources. Properly managed irrigation can minimize soil erosion and degradation, preserving soil quality for future generations. Additionally, irrigation opens opportunities for controlled water management, enabling the cultivation of crops that are generally well-suited to local conditions, but which may not thrive under rainfed agriculture.

However, it is important to note that while irrigation offers numerous benefits, it should be managed carefully. Improper irrigation practices can lead to water pollution, waterlogging, soil salinization, aquifer depletion, and other forms of environmental degradation. It is crucial to implement evidence-based, sustainable irrigation practices that embrace modern information technology and knowledge sharing to ensure the responsible use of water resources.

1.2 Key information gaps

There are several information gaps, which if closed would contribute significantly to the agricultural research and development agendas in Uganda. Here are some of them:

  • Data-driven identification of existing and suitable irrigation areas: There’s a need for advanced data analytics to identify and monitor areas in Uganda where irrigation currently exists and/or would be beneficial. This involves analyzing a wide range earth science data (e.g., field and stakeholder surveys, remote sensing and GIS data), to pinpoint locations where irrigation systems could improve agricultural production and sustainability.

  • Environmental impact assessments of irrigation practices: Given the concerns about the potential negative environmental impacts of irrigation, such as waterlogging, salinization, and biodiversity loss, there is a research gap in developing spatially explicit models that assess the environmental footprint of different irrigation practices. This could include e.g., studies on water use efficiency, soil condition, and long-term ecological impacts.

  • Irrigation and economic growth analyses: While irrigation can contribute to economic growth, detailed data-driven studies are needed to quantify this potential impact. This involves analyzing how improvements in irrigation systems affect agricultural productivity, market prices, farmer incomes, and overall economic development at the local and national levels.

  • Predictive analytics for climate change impacts: Due to unpredictable weather patterns because of climate change, there is a gap in predictive models that forecast the impacts of these changes on irrigation needs and water availability. Developing models that can predict future climatic conditions and their implications on irrigation requirements would be valuable.

  • Policy impact studies: There is a need for research and information on the impact of different irrigation policies and programs. This includes assessing the effectiveness of current policies and exploring data-driven approaches to decision making and policy formulation.

Closing these gaps would provide a broad scope for data and evidence-driven research for development that could significantly contribute to improvimg irrigation practices, enhancing food security, supporting economic viability and social equity, and ensuring environmental sustainability in Uganda.

1.3 Objectives of this notebook

The main objectives of this notebook are to introduce R code for labeling, exploration and discovery of the spatial distribution of smallholder irrigation in Uganda. This markdown notebook is maintained and updated on OSF here, and you can fork and alter it from there for your reference and use.

2 Data setup

To actually run this notebook, you will need to install and load the R-packages indicated in the chunk directly below.

# Package names
packages <- c("osfr", "htmlwidgets", "leaflet", "DT", "rgdal", "raster", "doParallel", "caret", 
              "caretEnsemble", "randomForest", "xgboost", "glmnet", "pROC", "caTools")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
    utils::install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

2.1 Load field survey data

The data we shall be using were collected by Government of Uganda field staff in 2023. The next chunk loads the irrigation and crop survey data from your working directory. We also include observations regarding the distribution of commonly irrigated crops. Modeling the distribution of irrigated crops is the subject of a separate (companion) notebook. Note that at the time of the writing of this notebook these data were not publicly available. We will try to convince the powers that be and the data owners that making the data publicly available is a useful thing to do that adds to their value.

plots <- read.table("./data/UG_irrigated_plots.csv", header = T, sep = ",")
plots$irrigated <- ifelse(plots$ir1 == 'yes' & plots$ws1 == 'yes', 'a', 'b') ## a = present, b = absent
plots$irrigated <- as.factor(plots$irrigated)
crops <- read.table("./data/UG_irrigated_crops.csv", header = T, sep = ",")
crops$onion.1 <- NULL ## removes the duplicated onion column
irdat <- merge(plots, crops, by="sid")

An overview map of where the irrigation observations were collected in Uganda is generated in the next chunk. You can click and zoom into the individual locations that have been recorded thus far. Locations where irrigation is present are shown in blue, and locations where irrigation is absent are shown in red.

col <- ifelse(irdat$irrigated == 'a', "blue", "red")

# Sample locations
w <- leaflet() %>%
  setView(lng = mean(irdat$lon), lat = mean(irdat$lat), zoom = 7) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(irdat$lon, irdat$lat,
                   color = col,
                   stroke = FALSE,
                   fillOpacity = 0.8,
                   radius = 5,
                   clusterOptions = markerClusterOptions(maxClusterRadius = 30))

saveWidget(w, 'UG_irrigation_locs.html', selfcontained = T) ## save leaflet map
w ## plot widget 

2.2 Load and extract raster features

You can download all of the raster files that are needed for running the next chunks from our OSF repository here. If you use that option ensure that all of the files are put into the ./grids directory.

dir.create("grids", showWarnings = FALSE)
dir.create("figures", showWarnings = FALSE)
dir.create("learners", showWarnings = FALSE)
dir.create("maps", showWarnings = FALSE)
dir.create("data", showWarnings = FALSE)
dir.create("sae", showWarnings = FALSE)

# alternatively you can also use **osfr** for the download
# osf_retrieve_node("szw7q") %>%
#  osf_ls_files() %>%
#  osf_download(path = "./grids", conflicts = "overwrite")

glist <- list.files(path="./grids", pattern="tif", full.names = TRUE)
grids <- stack(glist)

# Survey variable selection and projection
vars <- c('sid', 'lon', 'lat', 's2', 's3', 'ir1', 'ws1', 'irrigated', 'tomato',
          'cabbage', 'eggplant', 'greens', 'peppers', 'pulses', 'onion', 'nuts',
          'banana', 'melon', 'passion', 'taro', 'potato', 'beans', 'cane', 'sunflower',
          'papaya', 'pineapple', 'rice')
irdat <- irdat[vars] ## select variables

ir.proj <- project(cbind(irdat$lon, irdat$lat), "+proj=laea +ellps=WGS84
                          +lon_0=20 +lat_0=5
                   +units=m +no_defs")
colnames(ir.proj) <- c('x','y')
irdat <- cbind(irdat, ir.proj)
coordinates(irdat) <- ~x+y
projection(irdat) <- projection(grids)

# Extract gridded variables at survey locations
irgrid <- extract(grids, irdat)
irdat <- as.data.frame(cbind(irdat, irgrid))

# Write out `irdat` dataframe for reuse
write.csv(irdat, "./data/UG_irigation_all.csv", row.names = FALSE)

The pre-processed Uganda raster data (in the grids raster stack) we will be using were derived and projected (to CRS = +proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs) from their primary open sources. Short descriptions of the 34 included rasters, original source links are shown in the table below.


We will also be using the most recent GeoSurvey to based land cover map of Uganda. Note that the areas highlighted by the red boxes in the legend of Figure 1 are of primary interest here because they identify the main cropland cover types in the country, which define our overall Region of Interest (ROI) in Uganda. Any areas falling outside of the ROI are masked out of subsequent spatial predictions. You can find out how the constituent layers of this map were created by downloading one of the AfSIS land cover classification notebooks from our OSF repository here. It is also on the AfSIS website at https://africasoils.info.

**Figure 1.** GeoSurvey-based land cover map and area estimates for Uganda (2020).

Figure 1. GeoSurvey-based land cover map and area estimates for Uganda (2020).

3 Predicting irrigation

We are using a stacked generalization approach (Wolpert, 1992) in this section of the notebook. Stacked Generalization, often referred to as stacking, is an ensemble learning technique that seeks to improve model predictions by combining the outputs of multiple models, potentially of different types, in a strategic manner. This is how it works:

  • Split the data: into representative training (calibration) and test (validation) sets.

  • Base Models: You start by training several different predictive models using your training dataset. These models can be of any type and are often diverse, such as decision trees, support vector machines, neural networks, etc. They are known as base (or level-0) models.

  • Hold-Out or Cross-Validation Predictions: Next, you use these base models to make predictions on a separate dataset, which can either be a hold-out validation set or generated through cross-validation. Essentially, you’re interested in capturing the predictions each model makes on data it hasn’t seen before.

  • Stacking: The predictions made by the base models are then used as input features to train a higher-level model, known as a meta-model or a level-1 model. The idea is that this meta-model learns how to best combine the predictions from the base models to make a final prediction.

  • Final Prediction: When you need to make predictions on new data, you first run the data through all the base models, take their predictions, and then input those predictions into the meta-model. The meta-model’s output is the final prediction.

The intuition behind stacking is that while individual models may have particular strengths and weaknesses, a meta-model may be able to learn the best way to combine their predictions, to achieve better overall performance. Stacking has frequently been shown to outperform other ensemble methods in various tasks, given a careful choice of features, base-models, and stacking approach (Kaggle AI report, 2023).

3.1 Data split

This first chunk sets an 80/20% calibration and validation partition. It is always best to do this as early as possible to avoid any data leakage. The function createDataPartition in the caret package can be used to generate randomized splits of the data. If the argument to this function is a factor, random sampling occurs within each class and will preserve the overall class distribution of the data. We partition on the irrigated variable here.

# Set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)
irdat <- irdat[sample(1:nrow(irdat)), ] # randomize observations
irdat <- na.omit(irdat)

# Split data into calibration and validation sets
irIndex <- createDataPartition(irdat$irrigated, p = 4/5, list = F, times = 1)
ir_cal <- irdat[ irIndex,]
ir_val <- irdat[-irIndex,]

# Select calibration labels and gridded features
lcal <- ir_cal[ ,8]
fcal <- ir_cal[ ,c(28:41, 43:61)]

# Select validation labels and gridded features
lval <- ir_val[ ,8]
fval <- ir_val[ ,c(28:41, 43:61)]

3.2 Base-model training

This chunk fits 3 presence/absence base-models using the gridded calibration data with 10-fold cross-validation. Learn more about how these MLAs work by following looking at: randomForest, xgboost, and glmnet. The 3 base-models represent some of the main approaches to ML from tabular data (Kaggle AI report, 2023) i.e., bootstrap aggregation, gradient boosting, and L1/L2 regularization. You can use caretEnsemble instead of caret as long as the feature variables (the grids in this case), and the trainControl methods are the same for each model in the caretList function. This shortens the length of this notebook but does not otherwise affect overall caret functionality. Note however that the calculations take a bit of time to run on a normal 8-core, 32 Gb memory computer, even when engaging all cores. We initially fit the models with default-tuning of the relevant hyperparameters. Should the need arise, we can always do more fine-tuning later.

# Start doParallel
seed <- 12358
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method = "cv", number = 10, classProbs = TRUE,
                   summaryFunction = twoClassSummary, allowParallel = TRUE,
                   savePredictions = "final")

# Train 3 base-learners using the calibration set
blist <- caretList(fcal, lcal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("rf", "xgbTree", "glmnet"),
                   metric = "ROC")

# Generate predictions for the validation set
preds <- as.data.frame(predict(blist, newdata = fval))

# Save fitted model objects
stopCluster(mc)
fname <- paste("./learners/", "base_blist.rds", sep = "")
saveRDS(blist, fname)

The next chunk generates relative variable importance values of the 3 base-learner models for the calibration set.

**Figure 2.** Relative variable (feature) importance plots of the 3 base-learner models for the calibration set. Top 10 features only.**Figure 2.** Relative variable (feature) importance plots of the 3 base-learner models for the calibration set. Top 10 features only.**Figure 2.** Relative variable (feature) importance plots of the 3 base-learner models for the calibration set. Top 10 features only.

Figure 2. Relative variable (feature) importance plots of the 3 base-learner models for the calibration set. Top 10 features only.

3.3 Model stacking and validation

# Start doParallel
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method = "cv", number = 10, classProbs = TRUE, allowParallel = TRUE,
                   summaryFunction = twoClassSummary, savePredictions = "final")

# Stack the 3 base classifiers using the validation set
glm_stack <- train(preds, lval,
                   trControl = tc,
                   tuneList = NULL,
                   method = "glmnet",
                   family = "binomial",
                   metric = "ROC")

# Save fitted model
stopCluster(mc)
fname <- paste("./learners/", "glm_stack.rds", sep = "")
saveRDS(glm_stack, fname)

# Generate predictions for the validation set
stack <- as.vector(predict(glm_stack, newdata = preds, type = "prob"))
preds <- cbind(preds, stack)
preds <- preds[c('rf', 'xgbTree', 'glmnet', 'a')]
names(preds)[names(preds) == 'a'] <- 'stacked'

This generates the areas under the receiver operator curves (AUC) for the validation set …

# calculate ROC curves for the validation set
# rf model
roc_rf <- roc(lval, preds$rf)
auc_rf <- auc(roc_rf)

# xgbTree model
roc_xgbTree <- roc(lval, preds$xgbTree)
auc_xgbTree <- auc(roc_xgbTree)

# glmnet model
roc_glmnet <- roc(lval, preds$glmnet)
auc_glmnet <- auc(roc_glmnet)

# stacked model
roc_stacked <- roc(lval, preds$stacked)
auc_stacked <- auc(roc_stacked)

We can then use the stacked model ROC to derive a classification using the stacked model predictions to generate a confusion matrix for the validation set.

# generate confusion matrix for the validation set
val_cut <- coords(roc_stacked, "best")
preds$ir_class <- as.factor(ifelse(preds$stacked > 0.59, 'a', 'b'))
confusionMatrix(lval, preds$ir_class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    a    b
##          a 2019  398
##          b  329 1541
##                                           
##                Accuracy : 0.8304          
##                  95% CI : (0.8188, 0.8415)
##     No Information Rate : 0.5477          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6567          
##                                           
##  Mcnemar's Test P-Value : 0.01167         
##                                           
##             Sensitivity : 0.8599          
##             Specificity : 0.7947          
##          Pos Pred Value : 0.8353          
##          Neg Pred Value : 0.8241          
##              Prevalence : 0.5477          
##          Detection Rate : 0.4710          
##    Detection Prevalence : 0.5638          
##       Balanced Accuracy : 0.8273          
##                                           
##        'Positive' Class : a               
## 

3.4 Spatial predictions

# Generate spatial predictions
rfo.map <- predict(grids, blist$rf, type = "prob")
xgb.map <- predict(grids, blist$xgbTree, type = "prob")
glm.map <- predict(grids, blist$glmnet, type = "prob")
spreds <- stack(rfo.map, xgb.map, glm.map)
names(spreds) <- c("rf","xgbTree", "glmnet")
sta.map <- predict(spreds, glm_stack, type = "prob")

# Write out irrigation prediction maps
spreds <- stack(rfo.map, xgb.map, glm.map, sta.map)
fname <- paste("./maps/", "UG_irrigation_preds.tif", sep = "")
writeRaster(spreds, filename = fname, datatype = "FLT4S", options = "INTERLEAVE=BAND", overwrite = TRUE)
writeRaster(sta.map, filename = "./maps/stacked_pred.tif", datatype = "FLT4S", overwrite = TRUE)

**Figure 3.** Irrigation prediction maps for the croplands of Uganda (2023).

Figure 3. Irrigation prediction maps for the croplands of Uganda (2023).

The stacked-model irrigation prediction maps for Uganda are shown in Figure 3, following a few GIS cosmetics in GRASS. Superficially these look similar, but there are some differences in the receiver operator characteristics between them. At the moment the rf model dominates the weighting in the stacked model with a small contribution xgbTree model. Both the xgbTree and the glmnet could benefit from additional model tuning during base-learner training. This might provide small overall performance improvements. While this may be computationally worth it in the context of e.g., a Kaggle competition, it is not deemed necessary here, given the already very high AUC performance of the stacked model on the validation set.

4 Main takeaways

  1. The field data that were used are remarkable in their geographic scope, the sheer number of georeferenced field observations (n > 23,000), the short time period in which they were collected at the beginning of 2023, and the relatively low cost at which the field survey was completed. The observations were deliberately oversampled. Field enumerators were encouraged to actively seek out irrigated locations, which were initially thought to be rare, comprising less than 15% of the overall cropland region of interest. Hence for the purpose of calculating irrigation prevalence, on either an ROI or a small-area wide basis, the data need to be post-adjusted. We show how this can be done in a companion notebook which you can download from the project’s Open Science Framework (OSF) repository here, and display in your browser.

  2. The stacked model fits the data well and validates on unseen observations, as evidenced by its Area Under the Receiver Operator Curve (AUC = 0.91) statistic. There is no evidence of either under- or over-fitting, which was controlled for with both k-fold cross-validation and the use of a separate validation set. Nonetheless, the underlying 3 base-learners would probably benefit from additional model parameter tuning. This may also influence the final weighting of the base-learners in the stacked model. One might also consider using different model stacking algorithms. A forthcoming notebook will explore the use of Bayesian multilevel regression on a Discrete Global Grid in this context. Should those results prove promising, we will include them in an Appendix to this notebook.

  3. The most important spatial predictors, apart from geographical coordinates derived from the MERIT DEM, include:

    • Compound Topographic Index (CTI) derived from the MERIT DEM.
    • Predicted Building Density (PBD) from the most recent Uganda GeoSurvey (2020)
    • Mean Annual Temperature Range (BIO7) from the CHELSA climatology
    • Average MODIS Night Time Land Surface Temperature (LSTN)
    • Average MODIS Reflectance Values (MB1 & MB2)
    • Distance to "Black Marble Nightlights (DNLT)
    • Distance to Internationally Protected Areas (DIPA) derived from Protected Planet shapefiles
    • Distance to Open Water Sources (DOWS), derived from the Global Surface Water Explorer.

    You may want to experiment with this more restricted feature subset, and/or include additional raster features of your choosing, to see if the overall validation set predictions can be further improved over the current ones. Feature selection and transformation is probably the single most effective path to improved models.

Session info

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] caTools_1.18.2      pROC_1.18.0         glmnet_4.1-2       
##  [4] Matrix_1.5-1        xgboost_1.4.1.1     randomForest_4.6-14
##  [7] caretEnsemble_2.0.1 caret_6.0-89        lattice_0.20-44    
## [10] ggplot2_3.3.5       doParallel_1.0.16   iterators_1.0.13   
## [13] foreach_1.5.1       raster_3.4-13       rgdal_1.5-27       
## [16] sp_1.4-5            DT_0.19             leaflet_2.0.4.1    
## [19] htmlwidgets_1.6.2   osfr_0.2.9         
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-152            bitops_1.0-7            fs_1.6.2               
##  [4] lubridate_1.7.10        httr_1.4.2              tools_4.1.1            
##  [7] bslib_0.3.0             utf8_1.2.2              R6_2.5.1               
## [10] rpart_4.1-15            colorspace_2.0-2        nnet_7.3-16            
## [13] withr_2.4.2             tidyselect_1.2.0        gridExtra_2.3          
## [16] curl_4.3.2              compiler_4.1.1          cli_3.6.1              
## [19] sass_0.4.6              scales_1.2.1            proxy_0.4-26           
## [22] pbapply_1.5-0           stringr_1.4.0           digest_0.6.33          
## [25] rmarkdown_2.11          pkgconfig_2.0.3         htmltools_0.5.7        
## [28] parallelly_1.28.1       highr_0.9               fastmap_1.1.1          
## [31] rlang_1.1.2             rstudioapi_0.13         httpcode_0.3.0         
## [34] shape_1.4.6             jquerylib_0.1.4         generics_0.1.0         
## [37] jsonlite_1.7.2          crosstalk_1.1.1         dplyr_1.1.2            
## [40] ModelMetrics_1.2.2.2    magrittr_2.0.3          Rcpp_1.0.7             
## [43] munsell_0.5.0           fansi_0.5.0             lifecycle_1.0.3        
## [46] stringi_1.7.12          yaml_2.2.1              MASS_7.3-54            
## [49] plyr_1.8.6              recipes_0.1.17          grid_4.1.1             
## [52] listenv_0.8.0           splines_4.1.1           knitr_1.36             
## [55] pillar_1.9.0            future.apply_1.8.1      reshape2_1.4.4         
## [58] codetools_0.2-18        stats4_4.1.1            crul_1.1.0             
## [61] glue_1.6.2              evaluate_0.14           leaflet.providers_1.9.0
## [64] data.table_1.14.2       vctrs_0.6.2             gtable_0.3.0           
## [67] purrr_0.3.4             future_1.22.1           cachem_1.0.6           
## [70] xfun_0.39               gower_0.2.2             prodlim_2019.11.13     
## [73] e1071_1.7-9             class_7.3-19            survival_3.2-11        
## [76] timeDate_3043.102       tibble_3.2.1            memoise_2.0.0          
## [79] lava_1.6.10             globals_0.14.0          ellipsis_0.3.2         
## [82] ipred_0.9-12