The main problem with using administrative divisions (AD) such as districts, counties, wards and parishes, for sampling, reporting and communicating survey statistics are that their boundaries can change due redistricting and gerrymandering. ADs also differ in size, shape and complexity from country to country, and sometimes even within a country, which is not conducive to international comparisions of land survey results. Hence, land surveyors use Discrete Global Grids (DGGs) and the associated geocodes for mapping and land assessment. The advantages are:
Accuracy and precision: DGGs allow surveyors to precisely and accurately pinpoint survey areas and sampling locations anywhere on the earth’s surface. This is essential for creating detailed maps, area estimates and monitoring products at high resolution over large land areas.
Standardization: Using a DGG system provides a standardized method of measurement that can be universally understood and applied. Unlike latitude and longitude lines, which converge at the poles, DGGs provide uniform coverage and cell size across the globe. This uniformity is particularly useful for most geostatistical analyses, as it removes some of the distortions associated with spherical coordinates.
Scalability: DGGs can be designed at different resolutions, allowing for both broad global analyses and detailed local studies. Scalability is essential for applications ranging from global climate and economic models to local planning efforts.
Data sharing: With standardized geocodes, surveyors can easily share and compare data with other professionals e.g., urban and rural planners, engineers, foresters, rangeland and wildlife managers, climate modelers. national statistical bureaus, and international development agencies (e.g. FAO), among many others. This should improve consistency in planning and development projects.
Efficiency in field work: Using geocodes allows surveyors to efficiently plan, prioritize and execute field work. They can quickly locate and assess specific points of access to a site, saving time and resources.
Historical data integration: Geocodes enable surveyors to integrate historical data with new data. Fortunately, if the historical data are georeferenced, standardized DGGs can be retrofitted to them. This is important in areas where land use has changed over time, or where historical maps need to be compared with current ones. For environmental studies and land use planning, grid coordinates help in accurately assessing land features and their changes over time, aiding in sustainable development and conservation efforts.
There are a number of ISO compliant DGGs in world-wide use e.g., Google’s Open Location Codes (Plus Codes), the Military Grid Reference System (MGRS), and What3Words, among others. Because of our focus of work in Africa, we generally use a custom Lambert Azimuthal Equal Area based grid (i.e., CRS = +proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs) that corresponds to the Coordinate Reference System that is used for all of the standard AfSIS raster stacks. Note that while this works well accross Africa, it is not really suitable for use in other parts of the world without reprojection. This arguably renders it a Discrete “non-Global” Grid … so perhaps just the term Discrete Grid (DG) suffices here.
Multilevel Regression and Poststratification (Gelman and Hill (2017)) combines the strengths of multilevel regression modeling with the technique of poststratification to provide more accurate predictions for subgroups within a broader population or Region of Interest (ROI). It does so in 2 steps:
In the first step, a multilevel model is fit to the survey data. This model captures both individual-level responses and the variations across different DGG geocodes. These models are particularly useful when data from smaller groups is sparse because they allow for “borrowing strength” from the larger dataset by modeling shared patterns across groups. Essentially, information from larger, more representative areas inform predictions about smaller, less observed subgroups. A typical multilevel model with 1-level of grouping, as typically used in land survey contexts is:
\[ \ y_{ij} = x_{ij}\beta_{i} + z_{ij}u_i + \epsilon_{ij} \ \]
Where:
After fitting a suitable MLM, poststratification involves several steps to ensure that the survey results are adjusted to be more representative of the overall Region of Interest (ROI). Here’s how this process typically works:
Generate predictions for each level of the DG: Based on the specified DG resolution (preferably in equal area units), the ROI is divided into various nested geocodes. The fitted Multilevel Model is used to predict the outcomes for each (nested) level of geocode. If direct data is not available for a specific geocode, the model ‘borrows strength’ from the overall data to make an informed estimate.
Weighting the predictions: ROI specific marginals are used to understand the actual proportion of each geocode in the ROI. The predictions for each geocode are then weighted according to their proportions in the ROI. This step adjusts the predictions to account for over- or under-sampling in the survey data relative to their actual occurrence in the ROI.
Aggregating the adjusted estimates: After weighting the predictions for each geocode, these are combined to produce an overall estimate. This aggregate estimate is more representative of the target ROI, as it accounts for geocode variations. The final aggregated results reflect not just the raw survey data but are adjusted to mirror the structure and characteristics of the entire ROI.
This notebook examines the use of Multilevel Regression and Poststratification (MRP) in situations when adjustments are needed to correct for over- or under-representation of certain places in a specific ROI (sensu population). MRP combines multilevel modeling with poststratification adjustments, enhancing predictions in smaller geographical areas and improving the reproducibility, comprehensiveness and interpretation of land survey results. We use a Bayesian multilevel regression approach, because it provides a fuller picture of uncertainty. Unlike more traditional point estimates, the posterior distributions that are generated provide a range of plausible values for new observations, highlighting the uncertainty inherent in any predictions.
In land resource survey analysis, reliance on administrative divisions (e.g., districts, counties, wards, and parishes, as in Uganda) or census tracts is challenging due to their changing boundaries and variable land areas. A solution to these issues lies in the use of geocoded Discrete Grids, which offer more precision, standardization, and efficiency than traditional land delineations. Furthermore, they facilitate historical data integration and the use of prior domain knowledge. This notebook explores how a Discrete Grid-based approach can reveal scale-dependent variations in survey results and enhance their precision and accuracy for mapping and monitoring.
This document serves as a supplementary resource to the spatial irrigation prediction notebook for Uganda, accessible on the Open Science Framework, at https://osf.io/kxsqm. The results from that analysis confirm that the nation-wide smallholder irrigation survey data from 2023 were over-sampled and require adjustment to correct for this bias. This R markdown document is maintained and versioned on the same repository here, from where it can be downloaded and modified. Please ensure to cite this work appropriately, as it is protected under a CC-By-4.0 International Public License.
To actually run this notebook, you will need to install and load the R-packages indicated in the chunk directly below.
# Package names
packages <- c("tidyverse", "leaflet", "htmlwidgets", "brms", "ggthemes", "DT", "sf")
# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
utils::install.packages(packages[!installed_packages])
}
# Load packages
invisible(lapply(packages, library, character.only = TRUE))
The data we will be using was commissioned by Columbia World Projects (CWP) under a project titled “Using Data to Catalyze Energy Investments”, and implemented in partnership with the United Nations Development Fund and the Government of Uganda. The actual survey was conducted in 2023 by Government of Uganda field staff and covers the croplands of Uganda with over 23,000 georeferenced survey locations. At each location, the presence/absence of irrigation was recorded. The recordings were based on field observations of irrigation infrastructure and the availability of suitable water sources. The next chunk loads the 2023 survey data. We apologize that this part of the data is not publicly available yet, and we will update thedocument with the appropriate download link, once it is.
dir.create("sae")
survey <- read.table("./sae/mrpdat.csv", header = T, sep = ",")
survey$ir_mask <- ifelse(survey$stacked >= 0.635, 1, 0)
# survey$X <- NULL
The first thing to do is to specify geocodes (GIDs) at different scales of the DGG. The next chunk sets these up in hierarchically nested levels. Note that the resulting GIDs are a geocode on one of many possible DGGs but are readily converted to any other rectangular, equal area DGG.
# specify vector of resolution values (in meters) for each DGG level
resolution <- c(100000, 50000, 25000, 12500)
# initialize an empty list to store the results
gids <- list()
# loop over each resolution value
for (i in 1:length(resolution)) {
res.pixel <- resolution[i]
xgid <- ceiling(abs(survey$x)/res.pixel)
ygid <- ceiling(abs(survey$y)/res.pixel)
gidx <- ifelse(survey$x < 0, paste("W", xgid, sep=""), paste("E", xgid, sep=""))
gidy <- ifelse(survey$y < 0, paste("S", ygid, sep=""), paste("N", ygid, sep=""))
gids[[i]] <- paste(gidx, gidy, sep="")
}
# append GIDs (geocodes) and write out `survey` dataframe for reuse
gids <- data.frame(gids)
names(gids) <- c("L1", "L2", "L3", "L4")
survey <- cbind(gids, survey)
write.csv(survey, "./sae/saedat.csv")
This is the appended survey
dataframe that we shall be
using for the analyses covered by this notebook. The key variables are
the GIDs L1-L4
, the presence/absence of irrigation recorded
by the (dependent) irrigated
variable, and the
ir_mask
covariable, which is based on the
stacked
irrigation probability map shown in Figure 1
(below). The companion notebook, showing how the irrigation
probabilities were derived, is available in the project’s OSF repository
here. Simply download the .html and
display it in your browser.
## 'data.frame': 23256 obs. of 14 variables:
## $ L1 : chr "E14S4" "E15S4" "E14S5" "E14S6" ...
## $ L2 : chr "E28S7" "E29S8" "E28S10" "E27S12" ...
## $ L3 : chr "E56S14" "E58S16" "E55S19" "E53S24" ...
## $ L4 : chr "E111S27" "E115S32" "E110S38" "E105S48" ...
## $ sid : chr "f9339663" "0715e73f" "95466ecf" "cbd87542" ...
## $ lon : num 32.5 32.9 32.3 31.7 31.8 ...
## $ lat : num 1.909 1.342 0.644 -0.398 1.668 ...
## $ x : num 1383772 1432848 1364871 1302278 1306064 ...
## $ y : num -330705 -392958 -471584 -588239 -358720 ...
## $ irrigated: int 0 1 1 1 0 1 0 0 1 1 ...
## $ s2 : num 12 30 10 15 30 10 10 100 15 7 ...
## $ s3 : num 18 50 15 50 300 10 10 100 20 19 ...
## $ stacked : num 0.0579 0.895 0.8967 0.8963 0.132 ...
## $ ir_mask : num 0 1 1 1 0 1 0 1 1 1 ...
The specified GID levels L1, L2, L3, and L4 respectively are 10,000, 2,500, 625, and 156.25 km2 in size. While somewhat arbitrary, we have found that these levels describe the scale dependence inherent in large area survey data reasonably well. By scale dependence we mean that geographic phenomena can appear differently at various spatial scales. For example, a river’s course might seem relatively straight when viewed from a high-altitude satellite but is highly meandering when observed at ground level. You can always change the number and/or the resolution of those levels, if you wish.
The next chunks fit and compare two Bayesian multilevel regression models using the bmrs package (Bürkner et al., 2023).
The structure of the first model with 4 nested GID levels is as follows:
\[ Pr(y_{ijklm} = 1) = logit^{-1} (\mu + L_{i} + L_{ij} + L_{ijk} + L_{ijkl} + \epsilon_{ijklm}) \]
Where:
This is a random means model, which can be fit with:
priors = c(prior(normal(-3, 2), class = Intercept),
prior(normal(7, 3), class = beta),
prior(cauchy(0, 2), class = sd, group = L1),
prior(cauchy(0, 1), class = sd, group = L2),
prior(cauchy(0, 1), class = sd, group = L3),
prior(cauchy(0, 1), class = sd, group = L4))
# fit random intercept model
sur0 <- brm(data = survey,
family = bernoulli,
formula = irrigated ~ 1 + (1 | L1/L2/L3/L4),
# prior = priors,
iter = 4000, warmup = 1000, chains = 4, cores = 6,
control = list(adapt_delta = .975, max_treedepth = 20),
seed = 1235813)
# saveRDS(sur0, "./sae/dgg_survey.rds") # save model object
## Family: bernoulli
## Links: mu = logit
## Formula: irrigated ~ 1 + (1 | L1/L2/L3/L4)
## Data: survey (Number of observations: 23256)
## Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
## total post-warmup draws = 12000
##
## Group-Level Effects:
## ~L1 (Number of levels: 32)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 1.78 0.29 1.30 2.43 1.00 3739 5755
##
## ~L1:L2 (Number of levels: 103)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.51 0.12 0.26 0.76 1.00 1871 2076
##
## ~L1:L2:L3 (Number of levels: 323)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.59 0.09 0.42 0.75 1.00 1449 2263
##
## ~L1:L2:L3:L4 (Number of levels: 993)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 1.01 0.05 0.92 1.11 1.00 3232 6298
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.22 0.33 -0.41 0.89 1.00 2340 4012
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
From the model summary table for the sur0
model, the
mean survey irrigation prevalence estimate is circa 55.5% with a 95%
credible interval of (39.9 – 70.9%) of the overall cropland ROI that was
covered by the 2023 survey. This figure is quite high and not plausible,
but is also not unexpected. During the 2023 field campaign, enumerators
were instructed to seek out and oversample irrigated fields. This is
standard land survey practice when conducting comparative studies. If
the survey involves comparing different (land use or other) subgroups,
especially when one or more of those subgroups are thought to be rare,
oversampling can provide enough data for meaningful comparisons and
classifications. In these situations poststratification is needed to
adjust survey results.
The structure of the second model we will test here is:
\[ Pr(y_{ijklm} = 1) = logit^{-1} ( \beta_0 + \beta_1x_{ijklm} + L_{i} + L_{ij} + L_{ijk} + L_{ijkl} + \epsilon_{ijklm}) \]
Where:
It is fit with:
priors = c(prior(normal(-3, 2), class = Intercept),
prior(normal(7, 3), class = beta),
prior(cauchy(0, 2), class = sd, group = L1),
prior(cauchy(0, 1), class = sd, group = L2),
prior(cauchy(0, 1), class = sd, group = L3),
prior(cauchy(0, 1), class = sd, group = L4))
# fit random intercept model
sur1 <- brm(data = survey,
family = bernoulli,
formula = irrigated ~ ir_mask + (1 | L1/L2/L3/L4),
# prior = priors,
iter = 4000, warmup = 1000, chains = 4, cores = 6,
control = list(adapt_delta = .975, max_treedepth = 20),
seed = 1235813)
# saveRDS(sur1, "./sae/sur1_dgg_survey.rds") # save model object
## Family: bernoulli
## Links: mu = logit
## Formula: irrigated ~ ir_mask + (1 | L1/L2/L3/L4)
## Data: survey (Number of observations: 23256)
## Draws: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
## total post-warmup draws = 12000
##
## Group-Level Effects:
## ~L1 (Number of levels: 32)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.48 0.10 0.31 0.70 1.00 4012 6378
##
## ~L1:L2 (Number of levels: 103)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.20 0.08 0.03 0.35 1.00 1287 1704
##
## ~L1:L2:L3 (Number of levels: 323)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.09 0.06 0.00 0.22 1.00 1851 2848
##
## ~L1:L2:L3:L4 (Number of levels: 993)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.52 0.04 0.44 0.61 1.00 4186 6813
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept -2.21 0.11 -2.42 -2.00 1.00 4273 5962
## ir_mask 5.01 0.06 4.89 5.13 1.00 12267 9250
##
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The model summary table shows a large fixed effect of the
ir_mask
classifier. The mean ROI wide prevalence of
irrigation is circa 94.3% when the ir_mask
variable
indicates the presence of irrigation (i.e., the blue area in Figure 1)
versus 9.9% when it indicates its absence. The intercepts will vary at
different GID levels and their geocodes, and we will discuss the
associated scale-dependent random variation in Section 5 below.
Before we do that, the model_weights
function in the
brms package can used for model comparison and
selection, as well for Bayesian Model Averaging (Hinne
et al., 2020). In practical terms, after fitting multiple models
using brms, you can use model_weights
to
help decide which model(s) to use for inference or prediction. The
weights provide a quantitative measure to guide this decision.
model_wgts <- model_weights(sur0, sur1, weights = "stacking")
model_wgts
## sur0 sur1
## 0.02111347 0.97888653
The high weight of the sur1
model indicates that it
provides a much better fit to the data than the sur0
alternative. So, in the remainder of this document we will focus on
using the sur1
model for generating predictions and
understanding the uncertainty associated with those.
Posterior distributions are a fundamental concept in Bayesian statistics, serving as the basis for making predictions and understanding the uncertainty in the predictions. The posterior predictive distribution is a probability distribution that represents what we expect for new, unseen data points, given the data we have already observed. In Bayesian terms, it combines our prior beliefs (the prior distribution) and the evidence from the observed data (the likelihood) to yield the posterior distribution. The posterior predictive distribution takes this one step further by integrating all possible parameter values, weighted by how likely each parameter value is according to the posterior distribution. Its purposes and uses are for:
Making predictions: It is used to make predictions about future observations. By considering the range of plausible values for the model parameters, the posterior predictive distribution can provide predictions that incorporate uncertainty.
Model checking: It helps in assessing the model’s fit. By comparing the predictions of the posterior predictive distribution to actual observed data, we can check how well the model captures the underlying process.
Understanding uncertainty: It provides a fuller picture of uncertainty. Unlike point estimates, this distribution gives a range of plausible values for new observations, highlighting the uncertainty inherent in any predictions.
In essence, posterior predictive distributions allow us to make
informed predictions and understand the limits of predictions in a
Bayesian framework. The next chunk draws 3,000 random samples from the
posterior distribution of the sur1
model parameters.
# draw posterior parameter samples (3,000) in this case)
sur1_draws <- as.data.frame(as_draws(sur1))
# function to simplify the sam0_draws column names
simplify_variable_name <- function(name) {
# split the name by periods or underscores
parts <- unlist(strsplit(name, split = "[._]"))
# define the repetitive parts to be removed
common_parts <- c("X1", "r", "L1", "L2", "L3", "L4", "Intercept")
# remove the common parts and keep the unique parts
unique_parts <- parts[!parts %in% common_parts]
# join the unique parts back together
simplified_name <- paste(unique_parts, collapse = "-")
return(simplified_name)
}
variable_names <- colnames(sur1_draws)
simplified_names <- sapply(variable_names, simplify_variable_name)
# Rename the columns in your dataframe
colnames(sur1_draws) <- simplified_names
names(sur1_draws)[c(1:6)] <- c('Intercept', 'ir_mask', 'L1', 'L2', 'L3', 'L4')
# reshape data to long format
sur1_draws <- sur1_draws[ ,c(1:6)]
sur1_draws_long <- gather(sur1_draws, key = "GID_level", value = "value",
Intercept, ir_mask, L1, L2, L3, L4)
This is what the posterior parameter distributions for the
sur1
model look like …
# plot the posterior parameter distributions
ggplot(sur1_draws_long, aes(x = value, fill = GID_level)) +
geom_density(alpha = 0.8) +
scale_fill_brewer(palette = "Set2") +
labs(x = "Parameter values", y = "Density", fill = "Parameter") +
theme_few()
Figure 2 illustrates the variation in the sur1
model’s
parameters, aiding in understanding how much of the spatial variability
in the data is attributable to each nested level of GID in the DG
hierarchy. It is noteworthy that the L1-L4
random effects
parameters do not exhibit a systematic ordering along the x-axis, which
is unusual in our experience. According to Tobler’s Law, one would
typically expect to see an ordered pattern for nested random effects,
where L1
, representing a 10,000 km2 grid cell,
should exhibit more variability and uncertainty than L4
,
representing a 156 km2 cell. We suspect this irregularity may
be an artifact of the oversampling in the survey. In Section 7, we
explore whether poststratification can help attenuate this effect.
As an aside, Tobler’s Law, often summarized as ’everything is related to everything else, but near things are more related than distant things, was introduced by geographer Waldo Tobler (1970). This principle is foundational in spatial analysis, suggesting that the degree of interdependence between spatial entities or phenomena decreases with increasing distance. It emphasizes the importance of considering both the method of sampling and the aggregation of data in spatial studies. Furthermore, it underscores the significance of the scale of analysis, demonstrating how variations in scale (such as aggregating data at different levels on a DG) can influence the observed relationships in the data.
As previously mentioned, brms models are also used for prediction, and there are three main methods to consider:
predict()
provides point estimates along with
uncertainty measures, such as credible intervals. It aligns well with
the expectations for future surveys
posterior_predict()
is useful for exploring a range
of potential outcomes based on the observed data and the model.
The posterior_epred()
function assists in
determining the expected value of the response variable for each
observation by integrating over the posterior distribution of the
parameters, excluding residual variability.
In this particular case, we use posterior_predict()
,
specifically posterior_linpred
, which computes posterior
draws of the linear predictor, that is before applying any link
functions or other transformations. Our choice was based on the method’s
effectiveness in representing the uncertainty of survey outcomes.
# setup new dataframe
gid <- survey %>%
group_by(L1, L2, L3, L4) %>%
summarize(ir_mask = mean(ir_mask), n_survey = n())
gid <- as.data.frame(gid)
# generate posterior predictions
pred <- as.data.frame(posterior_linpred(sur1, newdata = gid, ndraws = 100))
# calculate median and credible intervals of the posterior distribution
quantiles <- pred %>%
reframe(across(everything(), ~quantile(.x, probs = c(0.05, 0.5, 0.95), na.rm = TRUE)))
# transpose quantiles dataframe to long and append to gid
quant_long <- t(quantiles)
colnames(quant_long) <- c("Q5", "median", "Q95")
survey_pred <- cbind(gid, quant_long)
# write out the prediction file
write.csv(survey_pred, "./sae/survey_pred.csv", row.names = FALSE)
Poststratification is a statistical technique used in survey analysis to adjust the sample to better represent an ROI. The main steps are:
Identifying subgroups: The first step in poststratification is identifying relevant subgroups in the population. In our case the subgroups are based on the geocodes from a DG. The geocodes cover the entire ROI and were applied to the survey data based on their georeference.
Obtaining ROI wide proportions: Then, you obtain the known proportions of the geocodes in the overall ROI. This information usually comes from agricultural census data or other reliable studies. Unfortunately, those typically do not exit in Africa, at least not at a suitable spatial resolution.
Weighting: Now, for the crucial part - you compare proportion of observations per geocode (i) in the overall ROI with the proportion of observations in the sample i.e., as: \(wgt_i = \frac {p_i}{s_i}\).
Adjusted estimates: Finally, the weights are used to adjust the survey sample estimates. For example, if you’re calculating a mean or a proportion based on your survey data, you’d do this calculation using the weighted data rather than the raw data.
This method improves the representativeness of your survey results, making them more generalizable to the ROI. There are some small problems with this however. The first problem is that we don’t actually know the proportion of observations in the ROI (p). The second problem occurs when some of the geocodes, possibly many, are missing from the survey sample (s). In that case, a combination of either redefining geocodes, using statistical methods, incorporating external data, and acknowledging limitations in your study can help in making adjustments in future surveys.
We use a spatially balanced sample of our specified ROI to calculate the “population” proportions that are needed to calculate the weights for adjusting the survey sample estimates. This addresses the first problem (above). There is a separate notebook, which details the calculations. You can download it from the OSF repository here. The rendered .html document, which you can read in browser, is also available here. Among other things the notebook produces a posterior prediction table that we will use to calculate the weights that are needed to poststratify and adjust the survey results. It is available at: https://osf.io/65fke. Once you have downloaded it, place it into your ./sae subdirectory.
# read the table
roi_prop <- read.csv("./sae/sam0_post_pred.csv", header = TRUE, sep = ",")
roi_prop <- roi_prop[, -c(6,8)]
colnames(roi_prop) <- c("L1", "L2", "L3", "L4", "n_roi", "ir_mask")
roi_prop$ir_mask <- exp(roi_prop$ir_mask) / (exp(roi_prop$ir_mask) + 1)
# poststratification table setup
post_strat_tab <- left_join(roi_prop, survey_pred, by = "L4")
post_strat_tab <- post_strat_tab[, c(1:6, 11)]
colnames(post_strat_tab) <- c("L1", "L2", "L3", "L4", "n_roi", "ir_mask", "n_survey")
post_strat_tab[is.na(post_strat_tab)] <- 1 # NAs are set to 1 for missing survey geocodes
# calculate weights
post_strat_tab$wgt <- (post_strat_tab$n_roi / nrow(post_strat_tab)) / (post_strat_tab$n_survey / nrow(post_strat_tab))
Note that replacing the 147 NAs with 1’s is a small hack that allows us to calculate the complete set of weights for all of the geocodes that might/should have been covered by the survey. Their influence on the poststratification results is small however. The missing geocodes also point to where some followup survey work might be conducted in the future.
The roi_prop
table contains table contains 1140
geocodes. By comparison the survey_pred
table that was
generated in Section 6 only contains 993 geocodes. This brings us to the
second problem … of performing poststratification with
missing geocodes. It is very common that some of the cells in a poststratification
table are empty, because they are geographically remote and
difficult to get to, they do not contain variable that are readily
available, or they were not included in the sampling plan for some
reason. The general solution is to generate new posterior predictions
i.e., with the additional geocodes from the post_strat_tab
table as follows:
# regenerate posterior predictions which include the missing geocodes from the `roi_prop` table
pred <- as.data.frame(posterior_linpred(sur1, newdata = post_strat_tab, ndraws = 100, allow_new_levels = TRUE))
# calculate median and credible intervals of the posterior distribution
quantiles <- pred %>%
reframe(across(everything(), ~quantile(.x, probs = c(0.05, 0.5, 0.95), na.rm = TRUE)))
# transpose quantiles dataframe to long and append to the roi_pred dataframe
quant_long <- t(quantiles)
colnames(quant_long) <- c("Q5", "median", "Q95")
post_strat_tab <- cbind(post_strat_tab, quant_long)
# write out `post_strat_tab`
write.csv(post_strat_tab, "./sae/post_strat_tab.csv", row.names = FALSE)
This is the structure of the poststratification table with which we
can generate an adjusted estimate of the prevalence of smallholder
irrigation prevalence in Uganda. You can download the
post_strat_tab
table for reference and reuse here.
## 'data.frame': 1140 obs. of 11 variables:
## $ L1 : chr "E11S6" "E11S6" "E11S6" "E11S6" ...
## $ L2 : chr "E22S11" "E22S11" "E22S11" "E22S11" ...
## $ L3 : chr "E44S22" "E44S22" "E44S22" "E44S22" ...
## $ L4 : chr "E87S43" "E87S44" "E88S43" "E88S44" ...
## $ n_roi : int 3 9 2 28 6 9 8 17 7 31 ...
## $ ir_mask : num 0.00174 0.0024 0.0025 0.00183 0.00466 ...
## $ n_survey: num 2 29 3 31 7 8 4 13 5 29 ...
## $ wgt : num 1.5 0.31 0.667 0.903 0.857 ...
## $ Q5 : num -4 -4.16 -3.97 -4.02 -3.83 ...
## $ median : num -2.91 -3.26 -2.86 -3.02 -2.85 ...
## $ Q95 : num -1.75 -2.3 -1.9 -2.26 -1.93 ...
The last thing to be done is to calculate the ROI wide weighted irrigation prevalence and its 95% credible intervals with:
Q50 <- sum(post_strat_tab$median * post_strat_tab$wgt) / sum(post_strat_tab$wgt)
prev50 <- round(exp(Q50)/ (exp(Q50) + 1) * 100)
Q5 <- sum(post_strat_tab$Q5 * post_strat_tab$wgt) / sum(post_strat_tab$wgt)
prev5 <- round(exp(Q5)/ (exp(Q5) + 1) * 100)
Q95 <- sum(post_strat_tab$Q95 * post_strat_tab$wgt) / sum(post_strat_tab$wgt)
prev95 <- round(exp(Q95)/ (exp(Q95) + 1) * 100)
So, by our best current estimate the median prevalence of irrigation in Uganda’s croplands is 27%, with a 95% credible interval of: 13 – 47%. This figure is now much closer to survey results one might obtain when not oversampling. One could of course also calculate this for specific small(er) cropland areas in Uganda, but we shall leave that for you to explore.
This section is a bit of an aside. It is about reverse geocoding and
using the resulting coordinates for mapping the posterior irrigation
prevalence predictions. We do this at the L4 grid level, but any other
level might be handled in a similar manner. The next chunk sets this up
based on the post_strat_tab
dataframe. We also include a
shapefile of Uganda’s national boundary.
# load L4 posterior prevalence predictions
L4_prev <- post_strat_tab
L4_prev <- L4_prev[, c(4,10)]
colnames(L4_prev) <- c("L4", "odds")
L4_prev$prev_perc <- (exp(L4_prev$odds) / (exp(L4_prev$odds) + 1)) * 100
# read the Uganda country boundary shapefile (courtesy https://gadm.org)
UG_boundary <- st_read("./sae/UG_boundary.shp")
## Reading layer `UG_boundary' from data source
## `/Users/MWalsh/Dropbox/My Mac (Markuss-MacBook-Pro-3.local)/Documents/Code/QSEL/sae/UG_boundary.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1 feature and 3 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 1065555 ymin: -710798.1 xmax: 1664877 ymax: -68832.23
## Projected CRS: +proj=laea +lat_0=5 +lon_0=20 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_defs
# extract the bounding box
bbox <- st_bbox(UG_boundary)
We then calculate the reverse geocodes at GID L4 to provide their center coordinates as follows …
# stringsplit L4 geocode
split <- L4_prev %>%
mutate(
EW = str_extract(L4, "^[A-Z]"),
x = as.numeric(gsub("[^0-9]+", "", str_extract(L4, "^[A-Z][0-9]+"))),
NS = str_extract(gsub("^[A-Z][0-9]+", "", L4), "^[A-Z]"),
y = as.numeric(gsub("[^0-9]", "", gsub("^[A-Z][0-9]+[A-Z]", "", L4)))
)
# reverse geocode center coordinates of L4 GIDs
res.pixel <- 12500
L4_prev <- split %>%
mutate(
x = ifelse(EW == "E", (x*res.pixel) - (0.5 * res.pixel), "0"),
y = ifelse(NS == "S", (-1*y * res.pixel) - (0.5 * res.pixel), "0")
)
We use the sf package to construct the L4 grid for Uganda and join the posterior prevalence predictions to it.
# convert L4_prev to a `sf` object
L4_prev <- st_as_sf(L4_prev, coords = c("x", "y"), crs = st_crs(UG_boundary))
# Create a grid
grid <- st_make_grid(UG_boundary, cellsize = c(res.pixel, res.pixel))
clipped_grid <- st_intersection(grid, UG_boundary)
grid_sf <- st_sf(geometry = st_sfc(clipped_grid))
# Join point data to the grid
L4_join <- st_join(grid_sf, L4_prev, join = st_intersects)
The next step is to plot the posterior irrigation prevalence prediction grid. Note that one could similarly map the credible intervals or other L4 variables, but we shall leave those for you to explore.
# Define colors in RGB (hexadecimal format)
color_start <- "#B40426"
color_mid <- "#DDDDDD"
color_end <- "#3B4CC0"
# plotting with tiles and superimposing country borders
L4_grid_plot <- ggplot() +
geom_sf(data = L4_join, aes(fill = prev_perc)) +
geom_sf(data = UG_boundary, color = "black", fill = NA) +
scale_fill_gradientn(colors = c(color_start, color_mid, color_end)) +
theme_minimal() +
coord_sf(xlim = c(bbox$xmin, bbox$xmax), ylim = c(bbox$ymin, bbox$ymax), datum = NA) +
labs(fill = "Prevalence (%)")
L4_grid_plot
# save the plot
ggsave("./sae/L4_grid_plot.png", plot = L4_grid_plot, width = 10, height = 8, dpi = 300)
# Export the grid as a shapefile for reference and reuse
# st_write(grid_sf, './sae/grid_shapefile.shp')
Multilevel Regression and Poststratification (MRP) can be an invaluable tool in making data-driven management decisions and policies based on survey data. In the context of agriculture it can combine information about environmental conditions, crop requirements, agricultural energy needs, food security, and socio-economic factors. Here are some examples of how MRP can be applied in irrigated cropping systems:
Monitoring change: Agricultural monitoring is essential for several reasons, especially in the context of changing environmental conditions, growing populations and increasing demand for food security. MRP can analyze complex datasets that could include climate models, crop responses to various climatic conditions, regional environmental changes, and specific irrigation practices, among others.
Planning energy integration: Irrigation requires energy. This is of central interest in the ongoing CWP project titled “Using Data to Catalyze Energy Investments”. MRP can help in determining suitable locations for e.g., mini-grids and agrivoltaic systems by analyzing data on solar radiation, crop types, and water availability. This helps in identifying areas where the combination of solar energy generation and irrigated agriculture would be most effective.
Guiding policy decisions: Data-driven insights from MRP can inform policymakers about the most effective strategies for irrigation systems, helping in the allocation of resources, subsidies, and support for farmers adopting these practices.
Assessing long-term sustainability: MRP can aid in understanding the long-term impacts of irrigation systems on soil health, ground and surface water resources, GHG emissions, crop choice, crop productivity, and farm economics, which are all essential for sustainable agricultural decision making.
Enhancing fertilizer use efficiency: MRP can be used analyze data on soil composition, crop types and water usage to recommend the most effective fertilizer types and application rates. This would help to ensure that crops receive the necessary nutrients without the risk of over-fertilization, which can lead to environmental issues like water pollution.
Adapting to climate change: In areas with fluctuating climatic conditions, MRP can be used to model how changes in rainfall, temperature, and solar radiation affect irrigation needs for dynamic, including in-season adjustment of farming practices.
These examples show that MRP offers significant advantages in optimizing agricultural practices, particularly in areas of irrigation management, sustainable energy integration, and fertilizer utilization, thereby potentially enhancing both productivity and environmental sustainability.
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Big Sur 11.7.10
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Denver
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] sp_2.1-0 sf_1.0-14 DT_0.30 ggthemes_4.2.4
## [5] brms_2.20.4 Rcpp_1.0.11 htmlwidgets_1.6.2 leaflet_2.2.0
## [9] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3
## [13] purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1
## [17] ggplot2_3.4.3 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] DBI_1.1.3 gridExtra_2.3 inline_0.3.19
## [4] rlang_1.1.1 magrittr_2.0.3 e1071_1.7-13
## [7] matrixStats_1.0.0 compiler_4.3.1 loo_2.6.0
## [10] systemfonts_1.0.4 callr_3.7.3 vctrs_0.6.3
## [13] reshape2_1.4.4 pkgconfig_2.0.3 crayon_1.5.2
## [16] fastmap_1.1.1 backports_1.4.1 ellipsis_0.3.2
## [19] labeling_0.4.3 utf8_1.2.3 threejs_0.3.3
## [22] promises_1.2.1 rmarkdown_2.25 markdown_1.9
## [25] tzdb_0.4.0 ps_1.7.5 ragg_1.2.5
## [28] xfun_0.40 cachem_1.0.8 jsonlite_1.8.7
## [31] later_1.3.1 parallel_4.3.1 prettyunits_1.2.0
## [34] R6_2.5.1 dygraphs_1.1.1.6 RColorBrewer_1.1-3
## [37] bslib_0.5.1 stringi_1.7.12 StanHeaders_2.26.28
## [40] jquerylib_0.1.4 rstan_2.32.3 knitr_1.44
## [43] zoo_1.8-12 base64enc_0.1-3 leaflet.providers_1.13.0
## [46] bayesplot_1.10.0 httpuv_1.6.11 Matrix_1.5-4.1
## [49] igraph_1.5.1 timechange_0.2.0 tidyselect_1.2.0
## [52] rstudioapi_0.15.0 abind_1.4-5 yaml_2.3.7
## [55] codetools_0.2-19 miniUI_0.1.1.1 curl_5.1.0
## [58] processx_3.8.2 pkgbuild_1.4.2 lattice_0.21-8
## [61] plyr_1.8.9 shiny_1.7.5.1 withr_2.5.1
## [64] bridgesampling_1.1-2 posterior_1.5.0 coda_0.19-4
## [67] evaluate_0.22 units_0.8-4 proxy_0.4-27
## [70] RcppParallel_5.1.7 xts_0.13.1 pillar_1.9.0
## [73] tensorA_0.36.2 KernSmooth_2.23-21 checkmate_2.3.0
## [76] stats4_4.3.1 shinyjs_2.1.0 distributional_0.3.2
## [79] generics_0.1.3 hms_1.1.3 rstantools_2.3.1.1
## [82] munsell_0.5.0 scales_1.2.1 gtools_3.9.4
## [85] xtable_1.8-4 class_7.3-22 glue_1.6.2
## [88] tools_4.3.1 shinystan_2.6.0 colourpicker_1.3.0
## [91] mvtnorm_1.2-3 grid_4.3.1 QuickJSR_1.0.7
## [94] crosstalk_1.2.0 colorspace_2.1-0 nlme_3.1-162
## [97] cli_3.6.1 textshaping_0.3.6 fansi_1.0.5
## [100] Brobdingnag_1.2-9 V8_4.4.0 gtable_0.3.4
## [103] sass_0.4.7 digest_0.6.33 classInt_0.4-10
## [106] farver_2.1.1 htmltools_0.5.6 lifecycle_1.0.3
## [109] mime_0.12 shinythemes_1.2.0