1 Background

1.1 Discrete Global Grids

The main problem with using administrative divisions (AD) such as districts, counties, wards and parishes, for sampling, reporting and communicating survey statistics are that their boundaries can change due redistricting and gerrymandering. ADs also differ in size, shape and complexity from country to country, and sometimes even within a country, which is not conducive to international comparisions of land survey results. Hence, land surveyors use Discrete Global Grids (DGGs) and the associated geocodes for mapping and land assessment. The advantages are:

Accuracy and precision: DGGs allow surveyors to precisely and accurately pinpoint survey areas and sampling locations anywhere on the earth’s surface. This is essential for creating detailed maps, area estimates and monitoring products at high resolution over large land areas.
Standardization: Using a DGG system provides a standardized method of measurement that can be universally understood and applied. Unlike latitude and longitude lines, which converge at the poles, DGGs provide uniform coverage and cell size across the globe. This uniformity is particularly useful for most geostatistical analyses, as it removes some of the distortions associated with spherical coordinates.
Scalability: DGGs can be designed at different resolutions, allowing for both broad global analyses and detailed local studies. Scalability is essential for applications ranging from global climate and economic models to local planning efforts.
Data sharing: With standardized geocodes, surveyors can easily share and compare data with other professionals e.g., urban and rural planners, engineers, foresters, rangeland and wildlife managers, climate modelers. national statistical bureaus, and international development agencies (e.g. FAO), among many others. This should improve consistency in planning and development projects.
Efficiency in field work: Using geocodes allows surveyors to efficiently plan, prioritize and execute field work. They can quickly locate and assess specific points of access to a site, saving time and resources.
Historical data integration: Geocodes enable surveyors to integrate historical data with new data. Fortunately, if the historical data are georeferenced, standardized DGGs can be retrofitted to them. This is important in areas where land use has changed over time, or where historical maps need to be compared with current ones. For environmental studies and land use planning, grid coordinates help in accurately assessing land features and their changes over time, aiding in sustainable development and conservation efforts.

There are a number of ISO compliant DGGs in world-wide use e.g., Google’s Open Location Codes (Plus Codes), the Military Grid Reference System (MGRS), and What3Words, among others. Because of our focus of work in Africa, we generally use a custom Lambert Azimuthal Equal Area based grid (i.e., CRS = +proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs) that corresponds to the Coordinate Reference System that is used for all of the standard AfSIS raster stacks. Note that while this works well accross Africa, it is not really suitable for use in other parts of the world without reprojection. This arguably renders it a Discrete “non-Global” Grid … so perhaps just the term Discrete Grid (DG) suffices here.

1.2 Multilevel regression and poststratification (MRP)

Multilevel Regression and Poststratification (Gelman and Hill (2017)) combines the strengths of multilevel regression modeling with the technique of poststratification to provide more accurate predictions for subgroups within a broader population or Region of Interest (ROI). It does so in 2 steps:

In the first step, a multilevel model is fit to the survey data. This model captures both individual-level responses and the variations across different DGG geocodes. These models are particularly useful when data from smaller groups is sparse because they allow for “borrowing strength” from the larger dataset by modeling shared patterns across groups. Essentially, information from larger, more representative areas inform predictions about smaller, less observed subgroups. A typical multilevel model with 1-level of grouping, as typically used in land survey contexts is:

\[ \ y_{ij} = x_{ij}\beta_{i} + z_{ij}u_i + \epsilon_{ij} \ \]

Where:

\(y_{ij}\) refers to the estimated oucomes in the j^th observation in the i^th subdivision on a DG.
\(x_{ij}\) is a design matrix for fixed effects associated with \(Y_{ij}\).
\(\beta_{i}\) refers to a matrix of ROI (fixed effects) parameters.
\(z_{ij}\) is a design matrix for groups associated with \(Y_{ij}\).
\(u_i\) refers to a vector of random effects for the i^th small area or Geocode.
\(\epsilon_{ij}\) refers to an independent error term for the j^th unit in the i^th small area or geocode.

After fitting a suitable MLM, poststratification involves several steps to ensure that the survey results are adjusted to be more representative of the overall Region of Interest (ROI). Here’s how this process typically works:

Generate predictions for each level of the DG: Based on the specified DG resolution (preferably in equal area units), the ROI is divided into various nested geocodes. The fitted Multilevel Model is used to predict the outcomes for each (nested) level of geocode. If direct data is not available for a specific geocode, the model ‘borrows strength’ from the overall data to make an informed estimate.
Weighting the predictions: ROI specific marginals are used to understand the actual proportion of each geocode in the ROI. The predictions for each geocode are then weighted according to their proportions in the ROI. This step adjusts the predictions to account for over- or under-sampling in the survey data relative to their actual occurrence in the ROI.
Aggregating the adjusted estimates: After weighting the predictions for each geocode, these are combined to produce an overall estimate. This aggregate estimate is more representative of the target ROI, as it accounts for geocode variations. The final aggregated results reflect not just the raw survey data but are adjusted to mirror the structure and characteristics of the entire ROI.

2 Introduction

This notebook examines the use of Multilevel Regression and Poststratification (MRP) in situations when adjustments are needed to correct for over- or under-representation of certain places in a specific ROI (sensu population). MRP combines multilevel modeling with poststratification adjustments, enhancing predictions in smaller geographical areas and improving the reproducibility, comprehensiveness and interpretation of land survey results. We use a Bayesian multilevel regression approach, because it provides a fuller picture of uncertainty. Unlike more traditional point estimates, the posterior distributions that are generated provide a range of plausible values for new observations, highlighting the uncertainty inherent in any predictions.

In land resource survey analysis, reliance on administrative divisions (e.g., districts, counties, wards, and parishes, as in Uganda) or census tracts is challenging due to their changing boundaries and variable land areas. A solution to these issues lies in the use of geocoded Discrete Grids, which offer more precision, standardization, and efficiency than traditional land delineations. Furthermore, they facilitate historical data integration and the use of prior domain knowledge. This notebook explores how a Discrete Grid-based approach can reveal scale-dependent variations in survey results and enhance their precision and accuracy for mapping and monitoring.

This document serves as a supplementary resource to the spatial irrigation prediction notebook for Uganda, accessible on the Open Science Framework, at https://osf.io/kxsqm. The results from that analysis confirm that the nation-wide smallholder irrigation survey data from 2023 were over-sampled and require adjustment to correct for this bias. This R markdown document is maintained and versioned on the same repository here, from where it can be downloaded and modified. Please ensure to cite this work appropriately, as it is protected under a CC-By-4.0 International Public License.

3 Setup

To actually run this notebook, you will need to install and load the R-packages indicated in the chunk directly below.

# Package names
packages <- c("tidyverse", "leaflet", "htmlwidgets", "brms", "ggthemes", "DT", "sf")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
    utils::install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

The data we will be using was commissioned by Columbia World Projects (CWP) under a project titled “Using Data to Catalyze Energy Investments”, and implemented in partnership with the United Nations Development Fund and the Government of Uganda. The actual survey was conducted in 2023 by Government of Uganda field staff and covers the croplands of Uganda with over 23,000 georeferenced survey locations. At each location, the presence/absence of irrigation was recorded. The recordings were based on field observations of irrigation infrastructure and the availability of suitable water sources. The next chunk loads the 2023 survey data. We apologize that this part of the data is not publicly available yet, and we will update thedocument with the appropriate download link, once it is.

dir.create("sae")
survey <- read.table("./sae/mrpdat.csv", header = T, sep = ",")
survey$ir_mask <- ifelse(survey$stacked >= 0.635, 1, 0)
# survey$X <- NULL

The first thing to do is to specify geocodes (GIDs) at different scales of the DGG. The next chunk sets these up in hierarchically nested levels. Note that the resulting GIDs are a geocode on one of many possible DGGs but are readily converted to any other rectangular, equal area DGG.

# specify vector of resolution values (in meters) for each DGG level
resolution <- c(100000, 50000, 25000, 12500)

# initialize an empty list to store the results
gids <- list()

# loop over each resolution value
for (i in 1:length(resolution)) {
    res.pixel <- resolution[i]
    xgid <- ceiling(abs(survey$x)/res.pixel)
    ygid <- ceiling(abs(survey$y)/res.pixel)
    gidx <- ifelse(survey$x < 0, paste("W", xgid, sep=""), paste("E", xgid, sep=""))
    gidy <- ifelse(survey$y < 0, paste("S", ygid, sep=""), paste("N", ygid, sep=""))
    gids[[i]] <- paste(gidx, gidy, sep="")
}

# append GIDs (geocodes) and write out `survey` dataframe for reuse
gids <- data.frame(gids)
names(gids) <- c("L1", "L2", "L3", "L4")
survey <- cbind(gids, survey)
write.csv(survey, "./sae/saedat.csv")

This is the appended survey dataframe that we shall be using for the analyses covered by this notebook. The key variables are the GIDs L1-L4, the presence/absence of irrigation recorded by the (dependent) irrigated variable, and the ir_mask covariable, which is based on the stacked irrigation probability map shown in Figure 1 (below). The companion notebook, showing how the irrigation probabilities were derived, is available in the project’s OSF repository here. Simply download the .html and display it in your browser.

## 'data.frame':    23256 obs. of  14 variables:
##  $ L1       : chr  "E14S4" "E15S4" "E14S5" "E14S6" ...
##  $ L2       : chr  "E28S7" "E29S8" "E28S10" "E27S12" ...
##  $ L3       : chr  "E56S14" "E58S16" "E55S19" "E53S24" ...
##  $ L4       : chr  "E111S27" "E115S32" "E110S38" "E105S48" ...
##  $ sid      : chr  "f9339663" "0715e73f" "95466ecf" "cbd87542" ...
##  $ lon      : num  32.5 32.9 32.3 31.7 31.8 ...
##  $ lat      : num  1.909 1.342 0.644 -0.398 1.668 ...
##  $ x        : num  1383772 1432848 1364871 1302278 1306064 ...
##  $ y        : num  -330705 -392958 -471584 -588239 -358720 ...
##  $ irrigated: int  0 1 1 1 0 1 0 0 1 1 ...
##  $ s2       : num  12 30 10 15 30 10 10 100 15 7 ...
##  $ s3       : num  18 50 15 50 300 10 10 100 20 19 ...
##  $ stacked  : num  0.0579 0.895 0.8967 0.8963 0.132 ...
##  $ ir_mask  : num  0 1 1 1 0 1 0 1 1 1 ...

The specified GID levels L1, L2, L3, and L4 respectively are 10,000, 2,500, 625, and 156.25 km² in size. While somewhat arbitrary, we have found that these levels describe the scale dependence inherent in large area survey data reasonably well. By scale dependence we mean that geographic phenomena can appear differently at various spatial scales. For example, a river’s course might seem relatively straight when viewed from a high-altitude satellite but is highly meandering when observed at ground level. You can always change the number and/or the resolution of those levels, if you wish.

Figure 1. Irrigation prediction maps for the croplands of Uganda (2023). See the notebook at: https://osf.io/kxsqm

4 Multilevel models

The next chunks fit and compare two Bayesian multilevel regression models using the bmrs package (Bürkner et al., 2023).

4.1 Random means model

The structure of the first model with 4 nested GID levels is as follows:

\[ Pr(y_{ijklm} = 1) = logit^{-1} (\mu + L_{i} + L_{ij} + L_{ijk} + L_{ijkl} + \epsilon_{ijklm}) \]

Where:

\(y_{ijklm}\) i.e., the response variable, is the probability of observing irrigation at a field location.
\(\mu\) is the mean irrigation prevalence for the Region of Interest.
\(L_{i}\), \(L_{ij}\), \(L_{ijk}\) and \(L_{ijkl}\) are estimates of the nested random effects at the 4 specified GID levels.
\(\epsilon_{ijklm}\) is an independent error term.

This is a random means model, which can be fit with:

priors = c(prior(normal(-3, 2), class = Intercept),
           prior(normal(7, 3), class = beta),
           prior(cauchy(0, 2), class = sd, group = L1),
           prior(cauchy(0, 1), class = sd, group = L2),
           prior(cauchy(0, 1), class = sd, group = L3),
           prior(cauchy(0, 1), class = sd, group = L4))

# fit random intercept model
sur0 <- brm(data = survey,
            family = bernoulli,
            formula = irrigated ~ 1 + (1 | L1/L2/L3/L4),
            # prior = priors,
            iter = 4000, warmup = 1000, chains = 4, cores = 6,
            control = list(adapt_delta = .975, max_treedepth = 20),
            seed = 1235813)

# saveRDS(sur0, "./sae/dgg_survey.rds") # save model object

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: irrigated ~ 1 + (1 | L1/L2/L3/L4) 
##    Data: survey (Number of observations: 23256) 
##   Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
##          total post-warmup draws = 12000
## 
## Group-Level Effects: 
## ~L1 (Number of levels: 32) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.78      0.29     1.30     2.43 1.00     3739     5755
## 
## ~L1:L2 (Number of levels: 103) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.51      0.12     0.26     0.76 1.00     1871     2076
## 
## ~L1:L2:L3 (Number of levels: 323) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.59      0.09     0.42     0.75 1.00     1449     2263
## 
## ~L1:L2:L3:L4 (Number of levels: 993) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.01      0.05     0.92     1.11 1.00     3232     6298
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.22      0.33    -0.41     0.89 1.00     2340     4012
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

From the model summary table for the sur0 model, the mean survey irrigation prevalence estimate is circa 55.5% with a 95% credible interval of (39.9 – 70.9%) of the overall cropland ROI that was covered by the 2023 survey. This figure is quite high and not plausible, but is also not unexpected. During the 2023 field campaign, enumerators were instructed to seek out and oversample irrigated fields. This is standard land survey practice when conducting comparative studies. If the survey involves comparing different (land use or other) subgroups, especially when one or more of those subgroups are thought to be rare, oversampling can provide enough data for meaningful comparisons and classifications. In these situations poststratification is needed to adjust survey results.

4.2 Mixed effects model

The structure of the second model we will test here is:

\[ Pr(y_{ijklm} = 1) = logit^{-1} ( \beta_0 + \beta_1x_{ijklm} + L_{i} + L_{ij} + L_{ijk} + L_{ijkl} + \epsilon_{ijklm}) \]

Where:

\(y_{ijklm}\) i.e., the response variable, is the probability of the presence of irrigation at a location.
\(\beta_0\) and \(\beta_1\) are the coefficients for the intercept and the fixed effect, respectively.
\(x_{ijklm}\) represents the fixed effect of the irrigation mask classifier shown in Figure 1.
\(L_{i}\), \(L_{ij}\), \(L_{ijk}\) and \(L_{ijkl}\) are estimates of the nested random effects at the 4 specified GID levels.
\(\epsilon_{ijklm}\) is an independent error term.

It is fit with:

priors = c(prior(normal(-3, 2), class = Intercept),
           prior(normal(7, 3), class = beta),
           prior(cauchy(0, 2), class = sd, group = L1),
           prior(cauchy(0, 1), class = sd, group = L2),
           prior(cauchy(0, 1), class = sd, group = L3),
           prior(cauchy(0, 1), class = sd, group = L4))

# fit random intercept model
sur1 <- brm(data = survey,
            family = bernoulli,
            formula = irrigated ~ ir_mask + (1 | L1/L2/L3/L4),
            # prior = priors,
            iter = 4000, warmup = 1000, chains = 4, cores = 6,
            control = list(adapt_delta = .975, max_treedepth = 20),
            seed = 1235813)

# saveRDS(sur1, "./sae/sur1_dgg_survey.rds") # save model object

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: irrigated ~ ir_mask + (1 | L1/L2/L3/L4) 
##    Data: survey (Number of observations: 23256) 
##   Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
##          total post-warmup draws = 12000
## 
## Group-Level Effects: 
## ~L1 (Number of levels: 32) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.48      0.10     0.31     0.70 1.00     4012     6378
## 
## ~L1:L2 (Number of levels: 103) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.20      0.08     0.03     0.35 1.00     1287     1704
## 
## ~L1:L2:L3 (Number of levels: 323) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.09      0.06     0.00     0.22 1.00     1851     2848
## 
## ~L1:L2:L3:L4 (Number of levels: 993) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.52      0.04     0.44     0.61 1.00     4186     6813
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -2.21      0.11    -2.42    -2.00 1.00     4273     5962
## ir_mask       5.01      0.06     4.89     5.13 1.00    12267     9250
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

The model summary table shows a large fixed effect of the ir_mask classifier. The mean ROI wide prevalence of irrigation is circa 94.3% when the ir_mask variable indicates the presence of irrigation (i.e., the blue area in Figure 1) versus 9.9% when it indicates its absence. The intercepts will vary at different GID levels and their geocodes, and we will discuss the associated scale-dependent random variation in Section 5 below.

Before we do that, the model_weights function in the brms package can used for model comparison and selection, as well for Bayesian Model Averaging (Hinne et al., 2020). In practical terms, after fitting multiple models using brms, you can use model_weights to help decide which model(s) to use for inference or prediction. The weights provide a quantitative measure to guide this decision.

model_wgts <- model_weights(sur0, sur1, weights = "stacking")
model_wgts

##       sur0       sur1 
## 0.02111347 0.97888653

The high weight of the sur1 model indicates that it provides a much better fit to the data than the sur0 alternative. So, in the remainder of this document we will focus on using the sur1 model for generating predictions and understanding the uncertainty associated with those.

5 Posteriors

Posterior distributions are a fundamental concept in Bayesian statistics, serving as the basis for making predictions and understanding the uncertainty in the predictions. The posterior predictive distribution is a probability distribution that represents what we expect for new, unseen data points, given the data we have already observed. In Bayesian terms, it combines our prior beliefs (the prior distribution) and the evidence from the observed data (the likelihood) to yield the posterior distribution. The posterior predictive distribution takes this one step further by integrating all possible parameter values, weighted by how likely each parameter value is according to the posterior distribution. Its purposes and uses are for:

Making predictions: It is used to make predictions about future observations. By considering the range of plausible values for the model parameters, the posterior predictive distribution can provide predictions that incorporate uncertainty.
Model checking: It helps in assessing the model’s fit. By comparing the predictions of the posterior predictive distribution to actual observed data, we can check how well the model captures the underlying process.
Understanding uncertainty: It provides a fuller picture of uncertainty. Unlike point estimates, this distribution gives a range of plausible values for new observations, highlighting the uncertainty inherent in any predictions.

In essence, posterior predictive distributions allow us to make informed predictions and understand the limits of predictions in a Bayesian framework. The next chunk draws 3,000 random samples from the posterior distribution of the sur1 model parameters.

# draw posterior parameter samples (3,000) in this case)
sur1_draws <- as.data.frame(as_draws(sur1))

# function to simplify the sam0_draws column names
simplify_variable_name <- function(name) {
  
  # split the name by periods or underscores
  parts <- unlist(strsplit(name, split = "[._]"))
  
  # define the repetitive parts to be removed
  common_parts <- c("X1", "r", "L1", "L2", "L3", "L4", "Intercept")
  
  # remove the common parts and keep the unique parts
  unique_parts <- parts[!parts %in% common_parts]
  
  # join the unique parts back together
  simplified_name <- paste(unique_parts, collapse = "-")
  
  return(simplified_name)
}

variable_names <- colnames(sur1_draws)
simplified_names <- sapply(variable_names, simplify_variable_name)

# Rename the columns in your dataframe
colnames(sur1_draws) <- simplified_names
names(sur1_draws)[c(1:6)] <- c('Intercept', 'ir_mask', 'L1', 'L2', 'L3', 'L4')

# reshape data to long format
sur1_draws <- sur1_draws[ ,c(1:6)]
sur1_draws_long <- gather(sur1_draws, key = "GID_level", value = "value", 
                          Intercept, ir_mask, L1, L2, L3, L4)

This is what the posterior parameter distributions for the sur1 model look like …

# plot the posterior parameter distributions
ggplot(sur1_draws_long, aes(x = value, fill = GID_level)) +
  geom_density(alpha = 0.8) +
  scale_fill_brewer(palette = "Set2") +
  labs(x = "Parameter values", y = "Density", fill = "Parameter") +
  theme_few()

Figure 2: Posterior distributions of the main model parameters over the range of variation at both the population and GID levels.

Figure 2 illustrates the variation in the sur1 model’s parameters, aiding in understanding how much of the spatial variability in the data is attributable to each nested level of GID in the DG hierarchy. It is noteworthy that the L1-L4 random effects parameters do not exhibit a systematic ordering along the x-axis, which is unusual in our experience. According to Tobler’s Law, one would typically expect to see an ordered pattern for nested random effects, where L1, representing a 10,000 km² grid cell, should exhibit more variability and uncertainty than L4, representing a 156 km² cell. We suspect this irregularity may be an artifact of the oversampling in the survey. In Section 7, we explore whether poststratification can help attenuate this effect.

As an aside, Tobler’s Law, often summarized as ’everything is related to everything else, but near things are more related than distant things, was introduced by geographer Waldo Tobler (1970). This principle is foundational in spatial analysis, suggesting that the degree of interdependence between spatial entities or phenomena decreases with increasing distance. It emphasizes the importance of considering both the method of sampling and the aggregation of data in spatial studies. Furthermore, it underscores the significance of the scale of analysis, demonstrating how variations in scale (such as aggregating data at different levels on a DG) can influence the observed relationships in the data.

6 Prediction

As previously mentioned, brms models are also used for prediction, and there are three main methods to consider:

predict() provides point estimates along with uncertainty measures, such as credible intervals. It aligns well with the expectations for future surveys
posterior_predict() is useful for exploring a range of potential outcomes based on the observed data and the model.
The posterior_epred() function assists in determining the expected value of the response variable for each observation by integrating over the posterior distribution of the parameters, excluding residual variability.

In this particular case, we use posterior_predict(), specifically posterior_linpred, which computes posterior draws of the linear predictor, that is before applying any link functions or other transformations. Our choice was based on the method’s effectiveness in representing the uncertainty of survey outcomes.

# setup new dataframe
gid <- survey %>% 
  group_by(L1, L2, L3, L4) %>%
  summarize(ir_mask = mean(ir_mask), n_survey = n())
gid <- as.data.frame(gid)

# generate posterior predictions
pred <- as.data.frame(posterior_linpred(sur1, newdata = gid, ndraws = 100))

# calculate median and credible intervals of the posterior distribution
quantiles <- pred %>%
  reframe(across(everything(), ~quantile(.x, probs = c(0.05, 0.5, 0.95), na.rm = TRUE)))

# transpose quantiles dataframe to long and append to gid
quant_long <- t(quantiles)
colnames(quant_long) <- c("Q5", "median", "Q95")
survey_pred <- cbind(gid, quant_long)

# write out the prediction file
write.csv(survey_pred, "./sae/survey_pred.csv", row.names = FALSE)

7 Poststratification

Poststratification is a statistical technique used in survey analysis to adjust the sample to better represent an ROI. The main steps are:

Identifying subgroups: The first step in poststratification is identifying relevant subgroups in the population. In our case the subgroups are based on the geocodes from a DG. The geocodes cover the entire ROI and were applied to the survey data based on their georeference.
Obtaining ROI wide proportions: Then, you obtain the known proportions of the geocodes in the overall ROI. This information usually comes from agricultural census data or other reliable studies. Unfortunately, those typically do not exit in Africa, at least not at a suitable spatial resolution.
Weighting: Now, for the crucial part - you compare proportion of observations per geocode (i) in the overall ROI with the proportion of observations in the sample i.e., as: \(wgt_i = \frac {p_i}{s_i}\).
Adjusted estimates: Finally, the weights are used to adjust the survey sample estimates. For example, if you’re calculating a mean or a proportion based on your survey data, you’d do this calculation using the weighted data rather than the raw data.

This method improves the representativeness of your survey results, making them more generalizable to the ROI. There are some small problems with this however. The first problem is that we don’t actually know the proportion of observations in the ROI (p). The second problem occurs when some of the geocodes, possibly many, are missing from the survey sample (s). In that case, a combination of either redefining geocodes, using statistical methods, incorporating external data, and acknowledging limitations in your study can help in making adjustments in future surveys.

7.1 Spatially balanced sampling of the ROI

We use a spatially balanced sample of our specified ROI to calculate the “population” proportions that are needed to calculate the weights for adjusting the survey sample estimates. This addresses the first problem (above). There is a separate notebook, which details the calculations. You can download it from the OSF repository here. The rendered .html document, which you can read in browser, is also available here. Among other things the notebook produces a posterior prediction table that we will use to calculate the weights that are needed to poststratify and adjust the survey results. It is available at: https://osf.io/65fke. Once you have downloaded it, place it into your ./sae subdirectory.

# read the table
roi_prop <- read.csv("./sae/sam0_post_pred.csv", header = TRUE, sep = ",")
roi_prop <- roi_prop[, -c(6,8)]
colnames(roi_prop) <- c("L1", "L2", "L3", "L4", "n_roi", "ir_mask")
roi_prop$ir_mask <- exp(roi_prop$ir_mask) / (exp(roi_prop$ir_mask) + 1)

7.2 Weighting survey results

# poststratification table setup
post_strat_tab <- left_join(roi_prop, survey_pred, by = "L4")
post_strat_tab <- post_strat_tab[, c(1:6, 11)]
colnames(post_strat_tab) <- c("L1", "L2", "L3", "L4", "n_roi", "ir_mask", "n_survey")
post_strat_tab[is.na(post_strat_tab)] <- 1 # NAs are set to 1 for missing survey geocodes

# calculate weights
post_strat_tab$wgt <- (post_strat_tab$n_roi / nrow(post_strat_tab)) / (post_strat_tab$n_survey / nrow(post_strat_tab))

Note that replacing the 147 NAs with 1’s is a small hack that allows us to calculate the complete set of weights for all of the geocodes that might/should have been covered by the survey. Their influence on the poststratification results is small however. The missing geocodes also point to where some followup survey work might be conducted in the future.

7.3 Generating posterior irrigation prevalence predictions

The roi_prop table contains table contains 1140 geocodes. By comparison the survey_pred table that was generated in Section 6 only contains 993 geocodes. This brings us to the second problem … of performing poststratification with missing geocodes. It is very common that some of the cells in a poststratification table are empty, because they are geographically remote and difficult to get to, they do not contain variable that are readily available, or they were not included in the sampling plan for some reason. The general solution is to generate new posterior predictions i.e., with the additional geocodes from the post_strat_tab table as follows:

# regenerate posterior predictions which include the missing geocodes from the `roi_prop` table
pred <- as.data.frame(posterior_linpred(sur1, newdata = post_strat_tab, ndraws = 100, allow_new_levels = TRUE))

# calculate median and credible intervals of the posterior distribution
quantiles <- pred %>%
  reframe(across(everything(), ~quantile(.x, probs = c(0.05, 0.5, 0.95), na.rm = TRUE)))

# transpose quantiles dataframe to long and append to the roi_pred dataframe
quant_long <- t(quantiles)
colnames(quant_long) <- c("Q5", "median", "Q95")
post_strat_tab <- cbind(post_strat_tab, quant_long)

# write out `post_strat_tab`
write.csv(post_strat_tab, "./sae/post_strat_tab.csv", row.names = FALSE)

This is the structure of the poststratification table with which we can generate an adjusted estimate of the prevalence of smallholder irrigation prevalence in Uganda. You can download the post_strat_tab table for reference and reuse here.

## 'data.frame':    1140 obs. of  11 variables:
##  $ L1      : chr  "E11S6" "E11S6" "E11S6" "E11S6" ...
##  $ L2      : chr  "E22S11" "E22S11" "E22S11" "E22S11" ...
##  $ L3      : chr  "E44S22" "E44S22" "E44S22" "E44S22" ...
##  $ L4      : chr  "E87S43" "E87S44" "E88S43" "E88S44" ...
##  $ n_roi   : int  3 9 2 28 6 9 8 17 7 31 ...
##  $ ir_mask : num  0.00174 0.0024 0.0025 0.00183 0.00466 ...
##  $ n_survey: num  2 29 3 31 7 8 4 13 5 29 ...
##  $ wgt     : num  1.5 0.31 0.667 0.903 0.857 ...
##  $ Q5      : num  -4 -4.16 -3.97 -4.02 -3.83 ...
##  $ median  : num  -2.91 -3.26 -2.86 -3.02 -2.85 ...
##  $ Q95     : num  -1.75 -2.3 -1.9 -2.26 -1.93 ...

The last thing to be done is to calculate the ROI wide weighted irrigation prevalence and its 95% credible intervals with:

Q50 <- sum(post_strat_tab$median * post_strat_tab$wgt) / sum(post_strat_tab$wgt)
prev50 <- round(exp(Q50)/ (exp(Q50) + 1) * 100)

Q5 <- sum(post_strat_tab$Q5 * post_strat_tab$wgt) / sum(post_strat_tab$wgt)
prev5 <- round(exp(Q5)/ (exp(Q5) + 1) * 100)

Q95 <- sum(post_strat_tab$Q95 * post_strat_tab$wgt) / sum(post_strat_tab$wgt)
prev95 <- round(exp(Q95)/ (exp(Q95) + 1) * 100)

So, by our best current estimate the median prevalence of irrigation in Uganda’s croplands is 27%, with a 95% credible interval of: 13 – 47%. This figure is now much closer to survey results one might obtain when not oversampling. One could of course also calculate this for specific small(er) cropland areas in Uganda, but we shall leave that for you to explore.

7.4 Reverse geocoding and mapping

This section is a bit of an aside. It is about reverse geocoding and using the resulting coordinates for mapping the posterior irrigation prevalence predictions. We do this at the L4 grid level, but any other level might be handled in a similar manner. The next chunk sets this up based on the post_strat_tab dataframe. We also include a shapefile of Uganda’s national boundary.

# load L4 posterior prevalence predictions
L4_prev <- post_strat_tab
L4_prev <- L4_prev[, c(4,10)]
colnames(L4_prev) <- c("L4", "odds")
L4_prev$prev_perc <- (exp(L4_prev$odds) / (exp(L4_prev$odds) + 1)) * 100

# read the Uganda country boundary shapefile (courtesy https://gadm.org)
UG_boundary <- st_read("./sae/UG_boundary.shp")

## Reading layer `UG_boundary' from data source 
##   `/Users/MWalsh/Dropbox/My Mac (Markuss-MacBook-Pro-3.local)/Documents/Code/QSEL/sae/UG_boundary.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 1065555 ymin: -710798.1 xmax: 1664877 ymax: -68832.23
## Projected CRS: +proj=laea +lat_0=5 +lon_0=20 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs

# extract the bounding box
bbox <- st_bbox(UG_boundary)

We then calculate the reverse geocodes at GID L4 to provide their center coordinates as follows …

# stringsplit L4 geocode
split <- L4_prev %>%
  mutate(
    EW = str_extract(L4, "^[A-Z]"),
    x = as.numeric(gsub("[^0-9]+", "", str_extract(L4, "^[A-Z][0-9]+"))),
    NS = str_extract(gsub("^[A-Z][0-9]+", "", L4), "^[A-Z]"),
    y = as.numeric(gsub("[^0-9]", "", gsub("^[A-Z][0-9]+[A-Z]", "", L4)))
  )

# reverse geocode center coordinates of L4 GIDs
res.pixel <- 12500

L4_prev <- split %>%
  mutate(
    x = ifelse(EW == "E", (x*res.pixel) - (0.5 * res.pixel), "0"),
    y = ifelse(NS == "S", (-1*y * res.pixel) - (0.5 * res.pixel), "0")
  )

We use the sf package to construct the L4 grid for Uganda and join the posterior prevalence predictions to it.

# convert L4_prev to a `sf` object
L4_prev <- st_as_sf(L4_prev, coords = c("x", "y"), crs = st_crs(UG_boundary))

# Create a grid
grid <- st_make_grid(UG_boundary, cellsize = c(res.pixel, res.pixel))
clipped_grid <- st_intersection(grid, UG_boundary)
grid_sf <- st_sf(geometry = st_sfc(clipped_grid))

# Join point data to the grid
L4_join <- st_join(grid_sf, L4_prev, join = st_intersects)

The next step is to plot the posterior irrigation prevalence prediction grid. Note that one could similarly map the credible intervals or other L4 variables, but we shall leave those for you to explore.

# Define colors in RGB (hexadecimal format)
color_start <- "#B40426"
color_mid <- "#DDDDDD"
color_end <- "#3B4CC0"

# plotting with tiles and superimposing country borders
L4_grid_plot <- ggplot() +
  geom_sf(data = L4_join, aes(fill = prev_perc)) +
  geom_sf(data = UG_boundary, color = "black", fill = NA) +
  scale_fill_gradientn(colors = c(color_start, color_mid, color_end)) +
  theme_minimal() +
  coord_sf(xlim = c(bbox$xmin, bbox$xmax), ylim = c(bbox$ymin, bbox$ymax), datum = NA) +
  labs(fill = "Prevalence (%)")

L4_grid_plot

Figure 3. Posterior irrigation prevalence prediction map of the croplands of Uganda (2023).

# save the plot
ggsave("./sae/L4_grid_plot.png", plot = L4_grid_plot, width = 10, height = 8, dpi = 300)

# Export the grid as a shapefile for reference and reuse
# st_write(grid_sf, './sae/grid_shapefile.shp')

8 Use cases

Multilevel Regression and Poststratification (MRP) can be an invaluable tool in making data-driven management decisions and policies based on survey data. In the context of agriculture it can combine information about environmental conditions, crop requirements, agricultural energy needs, food security, and socio-economic factors. Here are some examples of how MRP can be applied in irrigated cropping systems:

Monitoring change: Agricultural monitoring is essential for several reasons, especially in the context of changing environmental conditions, growing populations and increasing demand for food security. MRP can analyze complex datasets that could include climate models, crop responses to various climatic conditions, regional environmental changes, and specific irrigation practices, among others.
Planning energy integration: Irrigation requires energy. This is of central interest in the ongoing CWP project titled “Using Data to Catalyze Energy Investments”. MRP can help in determining suitable locations for e.g., mini-grids and agrivoltaic systems by analyzing data on solar radiation, crop types, and water availability. This helps in identifying areas where the combination of solar energy generation and irrigated agriculture would be most effective.
Guiding policy decisions: Data-driven insights from MRP can inform policymakers about the most effective strategies for irrigation systems, helping in the allocation of resources, subsidies, and support for farmers adopting these practices.
Assessing long-term sustainability: MRP can aid in understanding the long-term impacts of irrigation systems on soil health, ground and surface water resources, GHG emissions, crop choice, crop productivity, and farm economics, which are all essential for sustainable agricultural decision making.
Enhancing fertilizer use efficiency: MRP can be used analyze data on soil composition, crop types and water usage to recommend the most effective fertilizer types and application rates. This would help to ensure that crops receive the necessary nutrients without the risk of over-fertilization, which can lead to environmental issues like water pollution.
Adapting to climate change: In areas with fluctuating climatic conditions, MRP can be used to model how changes in rainfall, temperature, and solar radiation affect irrigation needs for dynamic, including in-season adjustment of farming practices.

These examples show that MRP offers significant advantages in optimizing agricultural practices, particularly in areas of irrigation management, sustainable energy integration, and fertilizer utilization, thereby potentially enhancing both productivity and environmental sustainability.

Session info

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Big Sur 11.7.10
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Denver
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] sp_2.1-0          sf_1.0-14         DT_0.30           ggthemes_4.2.4   
##  [5] brms_2.20.4       Rcpp_1.0.11       htmlwidgets_1.6.2 leaflet_2.2.0    
##  [9] lubridate_1.9.3   forcats_1.0.0     stringr_1.5.0     dplyr_1.1.3      
## [13] purrr_1.0.2       readr_2.1.4       tidyr_1.3.0       tibble_3.2.1     
## [17] ggplot2_3.4.3     tidyverse_2.0.0  
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.1.3                gridExtra_2.3            inline_0.3.19           
##   [4] rlang_1.1.1              magrittr_2.0.3           e1071_1.7-13            
##   [7] matrixStats_1.0.0        compiler_4.3.1           loo_2.6.0               
##  [10] systemfonts_1.0.4        callr_3.7.3              vctrs_0.6.3             
##  [13] reshape2_1.4.4           pkgconfig_2.0.3          crayon_1.5.2            
##  [16] fastmap_1.1.1            backports_1.4.1          ellipsis_0.3.2          
##  [19] labeling_0.4.3           utf8_1.2.3               threejs_0.3.3           
##  [22] promises_1.2.1           rmarkdown_2.25           markdown_1.9            
##  [25] tzdb_0.4.0               ps_1.7.5                 ragg_1.2.5              
##  [28] xfun_0.40                cachem_1.0.8             jsonlite_1.8.7          
##  [31] later_1.3.1              parallel_4.3.1           prettyunits_1.2.0       
##  [34] R6_2.5.1                 dygraphs_1.1.1.6         RColorBrewer_1.1-3      
##  [37] bslib_0.5.1              stringi_1.7.12           StanHeaders_2.26.28     
##  [40] jquerylib_0.1.4          rstan_2.32.3             knitr_1.44              
##  [43] zoo_1.8-12               base64enc_0.1-3          leaflet.providers_1.13.0
##  [46] bayesplot_1.10.0         httpuv_1.6.11            Matrix_1.5-4.1          
##  [49] igraph_1.5.1             timechange_0.2.0         tidyselect_1.2.0        
##  [52] rstudioapi_0.15.0        abind_1.4-5              yaml_2.3.7              
##  [55] codetools_0.2-19         miniUI_0.1.1.1           curl_5.1.0              
##  [58] processx_3.8.2           pkgbuild_1.4.2           lattice_0.21-8          
##  [61] plyr_1.8.9               shiny_1.7.5.1            withr_2.5.1             
##  [64] bridgesampling_1.1-2     posterior_1.5.0          coda_0.19-4             
##  [67] evaluate_0.22            units_0.8-4              proxy_0.4-27            
##  [70] RcppParallel_5.1.7       xts_0.13.1               pillar_1.9.0            
##  [73] tensorA_0.36.2           KernSmooth_2.23-21       checkmate_2.3.0         
##  [76] stats4_4.3.1             shinyjs_2.1.0            distributional_0.3.2    
##  [79] generics_0.1.3           hms_1.1.3                rstantools_2.3.1.1      
##  [82] munsell_0.5.0            scales_1.2.1             gtools_3.9.4            
##  [85] xtable_1.8-4             class_7.3-22             glue_1.6.2              
##  [88] tools_4.3.1              shinystan_2.6.0          colourpicker_1.3.0      
##  [91] mvtnorm_1.2-3            grid_4.3.1               QuickJSR_1.0.7          
##  [94] crosstalk_1.2.0          colorspace_2.1-0         nlme_3.1-162            
##  [97] cli_3.6.1                textshaping_0.3.6        fansi_1.0.5             
## [100] Brobdingnag_1.2-9        V8_4.4.0                 gtable_0.3.4            
## [103] sass_0.4.7               digest_0.6.33            classInt_0.4-10         
## [106] farver_2.1.1             htmltools_0.5.6          lifecycle_1.0.3         
## [109] mime_0.12                shinythemes_1.2.0

Bayesian multilevel regression and irrigation survey poststratification on a discrete grid in Uganda

M.G. Walsh, V. Modi and H. Siddiqui

15, December, 2023

1 Background

1.1 Discrete Global Grids

1.2 Multilevel regression and poststratification (MRP)

2 Introduction

3 Setup

4 Multilevel models

4.1 Random means model

4.2 Mixed effects model

5 Posteriors

6 Prediction

7 Poststratification

7.1 Spatially balanced sampling of the ROI

7.2 Weighting survey results

7.3 Generating posterior irrigation prevalence predictions

7.4 Reverse geocoding and mapping

8 Use cases

Session info