1 Introduction

This notebook develops a Bayesian approach to mapping and potentially monitoring building densities at local to national and international scales. I use Uganda’s most recent GeoSurvey (2020) data and the associated raster features to illustrate the mapping workflow using GeoSurvey and raster feature data from Uganda. This R markdown document is maintained and versioned on the Uganda OSF repository here, from where it can be downloaded and modified. Please cite this work appropriately, as it is protected under a CC-By-4.0 International Public License.

1.1 Challenges for predictive mapping of buildings

An accurate record and mapping of buildings is important for a range of applications, from population estimation and urban planning to humanitarian response and environmental science. Apart from their numerous other uses (see Jochem and Tatum, 2021), accurate maps of specifically building densities are useful for understanding human impacts on the natural environment. For example, the data and models can help with estimating energy needs and GHG emissions in a certain area, pressure on protected areas and wildlife due to urbanization, or urban expansion into agricultural areas. However, mapping and regular monitoring of buildings covering a country or a whole continent presents challenges. The challenges are primarily centered around the availability of low cost, accurate and regularly updated data for predictive model training and testing, as well as about the computing resources needed for running and updating the associated models on a regular basis.

In contrast to this, the data and models that are presented here can be produced quite quickly, at low cost (depending on image sources), and run and updated on a standard computer … with some computational contortions (see Section 4 about predictions). A key advantage of the model workflow that is presented here is that it is open, reproducible, reusable and explicit about the uncertainty of the spatial predictions that are made.

There are volunteered building data providers, such as OpenStreetMap, who produce high quality building footprint coverage through either manual digitizing from imagery or by incorporating existing open building datasets. This is a time consuming process, which might not fit into project workflows where the primary focus is not on building and infrastructure mapping per se. There are also very high quality AI assisted building data sets such as Open Buildings publicly available. Open Buildings, among others, is based on a deep-learning model that produces building footprints. I recommend reading the full report for a discussion of some of issues associated with the footprints that have been produced so far (see at: Sirko et al., 2021). Notably, both OSM and Open Buildings produce only building presence data (in the form of footprints) but no data about definitive absences. Modeling presence-only data, such as modeling changes in building densities when only their presence is known, presents challenges:

Prediction uncertainty: Without information on where buildings are not present, it is difficult to understand the full scope of changes, which would require both new presences and new absences. This leads to uncertainty in predictions and biases in statistical as well as machine learning models.
Sample selection bias: Because presence-only data often comes from sources like satellite imagery or reports which are typically not sampled in a spatially balanced manner. This can bias models towards areas that have been more frequently observed or reported, skewing results.
Temporal changes and trends: Modeling over time requires assumptions about the temporal stability of the factors influencing building presence and building densities. Changes in urban and rural development planning, natural disasters, or economic factors can drastically alter building distributions, and without absence data, these changes are going to be hard to quantify accurately.
Complexity of baseline establishment: Establishing a reliable building density baseline is challenging as it needs to accurately reflect the initial state from which changes are measured. Any errors or biases in the baseline can propagate through the entire monitoring period.
Environmental and contextual factors: The presence of buildings is influenced by a wide variety of factors such as terrain, zoning laws, establishment of protected areas, economic conditions, and cultural preferences. Accurately modeling the effects of these factors, especially without data on where buildings aren’t, does not produce optimal results.

1.2 GeoSurvey

GeoSurvey is an application for rapidly collecting and analyzing Land Use / Land Cover (LULC) observations. High resolution satellite images and/or other aerial (e.g., aircraft or drone) imagery can be systematically and rapidly labeled by either trained image interpreters and/or by vetted crowds of Citizen Scientists. When done with care, these observations can result in large, well-structured, properly labeled, geospatial data sets that are suitable for data mining, machine learning and geostatistical predictions. When supplied with properly time-stamped imagery, GeoSurvey could also be used for monitoring LULC changes. Figure 1 shows some examples of labeled GeoSurvey quadrats.

Figure 1. Examples of GeoSurvey-based land cover, land use tagging in 250 × 250 m quadrats.

The top two screen shots in Figure 1 show tagged buildings (in white) and cropland grid counts (in red). The bottom two screen shots differ in the amount of woody vegetation cover. It would certainly be difficult to drive a Land Rover through the one on the left, and so it is classified as having dense woody vegetation cover. Both of the bottom screen shots do not show any recent, visible anthropogenic impact … of buildings or cropping. In this notebook I focus on modeling the building tags shown in the upper left part of Figure 1, against the background examples shown in the other parts of the figure.

A detailed manual for conducting your own GeoSurveys is available at: GeoSurvey manual. The manual should definitely be consulted to obtain more information about how GeoSurvey can be used to carry out potentially high value surveys of remote areas. There is also a great slide deck available here, which illustrates different land cover and use labeling approaches. You can also try other African GeoSurvey datasets, which are openly available via the GeoSurvey workflow repository at OSF (here).

Figure 2 shows the most recent (2020) GeoSurvey based land cover map of Uganda. Note that the areas highlighted by the red box in the legend are of primary interest, because they delineate the areas of Uganda with buildings, which are the primary Region of Interest ROI examined in this notebook.

$**Figure 2.** GeoSurvey-based land cover map and area estimates for Uganda (2020).$

Figure 2. GeoSurvey-based land cover map and area estimates for Uganda (2020).

2 Setup

To actually run this notebook, you will need to install and load the R-packages indicated in the chunk directly below.

# Package names
packages <- c("osfr", "tidyverse", "jsonlite", "rgdal", "sf", "raster", 
              "brms", "ggridges", "pROC")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
    utils::install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

2.1 Data

dir.create("geosurvey", showWarnings = FALSE)

# download GeoSurvey data from OSF
osf_retrieve_node("ft749") %>%
osf_ls_files(n_max = Inf) %>%
osf_download(path = "./geosurvey", conflicts = "overwrite")

# download load raster data from OSF
#osf_retrieve_node("jc4qs") %>%
#osf_ls_files(n_max = Inf) %>%
#osf_download(path = "./geosurvey", conflicts = "overwrite")

# load Geosurvey data
geos <- read.table("./geosurvey/UG_GS_2020.csv", header = T, sep = ",")

# load covariate grids
glist <- list.files(path="./geosurvey/", pattern="tif", full.names = TRUE)
grids <- stack(glist)

2.2 ETL

This section extracts, transforms and loads the observations from 30,030 (250 × 250 m) GeoSurvey quadrats, distributed on a spatially balanced sampling frame Grafström and Schelin (2014) for all of Uganda’s land area. The quadrats were labeled by 6 experienced image interpreters between 26^th March - 24^th April of 2020. It calculates building counts from the associated point location JSON variables (bloc). I’ll use the resulting building counts (bcount) as dependent variables in the notebook. The script also extracts the center-point locations (bcoord) of all 150,000+ buildings that were tagged. I will not be using those in the analyses presented here, but they may be of interest for other types of analyses.

# Coordinates and number of buildings per quadrat
bp <- geos[which(geos$bp == "Y"), ] ## identify quadrats with buildings
bp$bloc <- as.character(bp$bloc)

# coordinates of tagged building locations from quadrats with buildings
c <- fromJSON(bp$bloc[1])
bcoord <- do.call("rbind", c$feature$geometry$coordinates)
for(i in 2:nrow(bp)) {
  c <- fromJSON(bp$bloc[i])
  bcoord_temp <- do.call("rbind", c$feature$geometry$coordinates)
  bcoord <- rbind(bcoord, bcoord_temp)
}
bcoord <- as.data.frame(bcoord) ## vector of coordinates per quadrats with buildings
colnames(bcoord) <- c("lon","lat")

# number of tagged building locations from quadrats with buildings
bcount <- rep(NA, nrow(bp))
for(i in 1:nrow(bp)) {
  t <- fromJSON(bp$bloc[i])
  bcount[i] <- nrow(t$features)
}
bcount ## vector of number of buildings per quadrats with buildings
ba <- geos[which(geos$bp == "N"), ]
ba$bcount <- 0
bp <- cbind(bp, bcount)
geos <- rbind(ba, bp)
geos <- geos[order(geos$today),] ## sort in original sample order

The next chunk extracts the raster features at the GeoSurvey quadrat locations.

geos <- geos[ ,-c(10:11)]

# project GeoSurvey coords to grid CRS
geos.proj <- as.data.frame(project(cbind(geos$lon, geos$lat), "+proj=laea +ellps=WGS84
                                   +lon_0=20 +lat_0=5 +units=m +no_defs"))
colnames(geos.proj) <- c("x","y")
geos <- cbind(geos, geos.proj)
coordinates(geos) <- ~x+y
projection(geos) <- projection(grids)

# extract gridded variables at GeoSurvey locations
geosgrid <- extract(grids, geos)
gsdat <- as.data.frame(cbind(geos, geosgrid))

The next chunk specifies geocodes (GIDs) at different levels of resolution (in meters) of a discreet grid (DG), on the Lambert Azimuthal Equal Area coordinate reference system of the raster layers. The GIDs are set up in hierarchically nested levels (L1-L4).

xy <- gsdat[ ,c(15:16)]

# specify vector of resolution values (in meters) for each DG level
resolution <- c(100000, 50000, 25000, 12500)

# initialize an empty list to store the results
gids <- list()

# loop over each resolution value
for (i in 1:length(resolution)) {
    res.pixel <- resolution[i]
    xgid <- ceiling(abs(xy$x)/res.pixel)
    ygid <- ceiling(abs(xy$y)/res.pixel)
    gidx <- ifelse(xy$x < 0, paste("W", xgid, sep=""), paste("E", xgid, sep=""))
    gidy <- ifelse(xy$y < 0, paste("S", ygid, sep=""), paste("N", ygid, sep=""))
    gids[[i]] <- paste(gidx, gidy, sep="")
}

# append GIDs to `gsdat`
gids <- data.frame(gids)
names(gids) <- c("L1", "L2", "L3", "L4")
gsdat <- cbind(gids, gsdat)
gsdat <- rename(gsdat, BD20 = BD20.PERMANENT, BP20 = BP20.PERMANENT,
                GOB = GOB_bcount.PERMANENT, OSM = OSM_bcount.PERMANENT)

# select quadrats within the ROI only
gsdat <- filter(gsdat, BP20 > 0)

This is the dressed up version of the GeoSurvey data I shall be using in the subsequent analyses.

## 'data.frame':    15994 obs. of  20 variables:
##  $ L1      : chr  "E15S3" "E16S5" "E12S5" "E13S3" ...
##  $ L2      : chr  "E29S6" "E31S10" "E24S9" "E25S6" ...
##  $ L3      : chr  "E57S12" "E61S20" "E48S17" "E49S11" ...
##  $ L4      : chr  "E113S23" "E122S39" "E96S33" "E98S22" ...
##  $ today   : chr  "2020-03-26 22:28:00.330925+00:00" "2020-03-26 22:31:40.561058+00:00" "2020-03-26 22:35:27.245094+00:00" "2020-03-26 22:38:05.852700+00:00" ...
##  $ surveyor: chr  "Markus-W" "Markus-W" "Markus-W" "Markus-W" ...
##  $ id      : int  18162505 18148857 18168178 18170717 18166235 18168655 18166644 18148179 18152984 18168089 ...
##  $ lat     : num  2.349 0.584 1.232 2.466 0.151 ...
##  $ lon     : num  32.7 33.7 30.8 30.9 30.6 ...
##  $ bp      : chr  "N" "Y" "Y" "Y" ...
##  $ cp      : chr  "Y" "Y" "Y" "Y" ...
##  $ wp      : chr  "N" "N" "Y" "N" ...
##  $ cs      : chr  "N" "N" "N" "N" ...
##  $ bcount  : num  0 10 20 14 7 0 14 2 0 0 ...
##  $ BD20    : num  2.04 5.14 16.09 18.64 13.76 ...
##  $ BP20    : num  0.422 0.88 0.956 0.906 0.968 ...
##  $ GOB     : num  0 9 15 14 3 1 10 7 0 0 ...
##  $ OSM     : num  0 0 5 14 0 0 0 5 0 0 ...
##  $ x       : num  1405375 1518375 1197875 1215125 1176125 ...
##  $ y       : num  -281375 -475875 -408625 -271375 -528875 ...

3 Models

The next chunks fits 4 Bayesian regression models using the bmrs package (Bürkner et al., 2023). I include 3/4 spatial covariates (features) in the model setups, which are shown in Figure 3.

Figure 3. Building count covariates (per 6.25 ha pixel) with Level-4 grid overlays in Uganda (2020).

BD20 – is a stacked spatial prediction of the GeoSurvey bcount data from 2020 that is based on a number of remote sensing and GIS data layers and six ML models. The R code and the associated raster layers are available at the Uganda OSF repository here.
BP20 – is a stacked spatial prediction of the GeoSurvey building presence / absence data from 2020. It is also based on ML models that are trained and tested on remote sensing and GIS data. The code and data are available at the Uganda OSF repository here.
GOB – is a rasterized version of building counts from derrived from Open Buildings data (~18 million predicted footprints for Uganda). There is a handy Colab notebook available, which you can run online to obtain predicted building footprints for individual countries. I used the footprint centerpoints to derive the counts shown in Figure 3.
OSM – is the the most recent Open Street Map data for Uganda, which contains digitized building footprints. You can dowload those from GeoFabrik, where they are updated on a regular basis. Notably, the area of Uganda covered by OSM footprints is still quite sparse for the time being, particulary in rural parts of the country. So, I do not use the data here. However, other countries in Africa have denser OSM coverage that may be useful for training models in those places.

3.1 Fixed effects model

# specify priors
priors = c(prior(flat(), class = "b"),
           prior(student_t(3, -2.3, 2.5), class = "Intercept"),
           prior(student_t(3, 0, 2.5), class = "sd"))

# fit model
fe0 <- brm(data = gsdat,
           family = gaussian(),
           formula = bcount ~ BP20 + BD20 + GOB,
           # prior = priors,
           iter = 2500, warmup = 500, chains = 4, cores = 8,
           control = list(adapt_delta = .975, max_treedepth = 20),
           seed = 1235813)

saveRDS(fe0, "./geosurvey/fe0_brms_bcount.rds") # save model object

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: bcount ~ BP20 + BD20 + GOB 
##    Data: gsdat (Number of observations: 15994) 
##   Draws: 4 chains, each with iter = 2500; warmup = 500; thin = 1;
##          total post-warmup draws = 8000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -0.82      0.14    -1.09    -0.54 1.00     5221     5220
## BP20          3.47      0.22     3.03     3.89 1.00     4869     4647
## BD20          0.36      0.01     0.34     0.38 1.00     4398     4631
## GOB           0.45      0.01     0.44     0.46 1.00     5045     4637
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     8.85      0.05     8.75     8.94 1.00     7594     5391
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

3.2 Thin plate spline model

# specify priors
priors = c(prior(flat(), class = "b"),
           prior(student_t(3, -2.3, 2.5), class = "Intercept"),
           prior(student_t(3, 0, 2.5), class = "sd"))

# fit model
tp0 <- brm(data = gsdat,
           family = gaussian(),
           formula = bcount ~ BP20 + BD20 + GOB + s(x,y),
           # prior = priors,
           iter = 2500, warmup = 500, chains = 4, cores = 8,
           control = list(adapt_delta = .975, max_treedepth = 20),
           seed = 1235813)

saveRDS(tp0, "./geosurvey/tp0_brms_bcount.rds") # save model object

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: bcount ~ BP20 + BD20 + GOB + s(x, y) 
##    Data: gsdat (Number of observations: 15994) 
##   Draws: 4 chains, each with iter = 2500; warmup = 500; thin = 1;
##          total post-warmup draws = 8000
## 
## Smooth Terms: 
##            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(sxy_1)     6.67      1.13     4.78     9.25 1.00     1655     3472
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -1.45      0.15    -1.74    -1.16 1.00     9641     6071
## BP20          4.38      0.23     3.92     4.84 1.00     9164     5525
## BD20          0.36      0.01     0.34     0.38 1.00     8485     6179
## GOB           0.45      0.01     0.44     0.47 1.00     9032     6044
## sxy_1        -8.62      4.59   -17.66     0.51 1.00     4678     5354
## sxy_2       -15.23      3.74   -22.59    -7.98 1.00     4777     5175
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     8.76      0.05     8.66     8.85 1.00    18606     6046
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

3.3 Multilevel model

# specify priors
priors = c(prior(flat(), class = "b"),
           prior(student_t(3, -2.3, 2.5), class = "Intercept"),
           prior(student_t(3, 0, 2.5), class = "sd"))

# fit random intercept model
ml0 <- brm(data = gsdat,
           family = gaussian(),
           formula = bcount ~ BP20 + BD20 + GOB + (1 | L4),
           # prior = priors,
           iter = 2500, warmup = 500, chains = 4, cores = 8,
           control = list(adapt_delta = .975, max_treedepth = 20),
           seed = 1235813)

saveRDS(ml0, "./geosurvey/ml0_brms_bcount.rds") # save model object

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: bcount ~ BP20 + BD20 + GOB + (1 | L4) 
##    Data: gsdat (Number of observations: 15994) 
##   Draws: 4 chains, each with iter = 2500; warmup = 500; thin = 1;
##          total post-warmup draws = 8000
## 
## Group-Level Effects: 
## ~L4 (Number of levels: 1241) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     2.06      0.11     1.84     2.26 1.00     3072     4900
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -0.95      0.16    -1.26    -0.65 1.00    10320     6808
## BP20          3.77      0.23     3.33     4.23 1.00    11270     6470
## BD20          0.35      0.01     0.34     0.37 1.00    14334     6273
## GOB           0.46      0.01     0.45     0.47 1.00    14784     6813
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     8.62      0.05     8.52     8.72 1.00     9535     6012
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

3.4 Combined model

As the nam implies this model combines the predictors of the previous three. At this stage we do not know whether this is a sensible thing to do given the increased complexity of this model. I test for the differences in prediction performance in section 3.7.

st0 <- brm(
  formula = bcount ~ BP20 + BD20 + GOB + s(x,y) + (1 | L4),
  data = gsdat,
  family = gaussian(),
  iter = 2500, warmup = 500, chains = 4, cores = 8,
  control = list(adapt_delta = .975, max_treedepth = 20),
  seed = 1235813)

saveRDS(st0, "./geosurvey/st0_brms_bcount.rds") # save model object

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: bcount ~ BP20 + BD20 + GOB + s(x, y) + (1 | L4) 
##    Data: gsdat (Number of observations: 15994) 
##   Draws: 4 chains, each with iter = 2500; warmup = 500; thin = 1;
##          total post-warmup draws = 8000
## 
## Smooth Terms: 
##            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sds(sxy_1)     6.17      1.17     4.19     8.79 1.00     2698     4429
## 
## Group-Level Effects: 
## ~L4 (Number of levels: 1241) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.66      0.11     1.44     1.87 1.00     3463     5425
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    -1.44      0.16    -1.74    -1.13 1.00    13330     7012
## BP20          4.37      0.24     3.91     4.82 1.00    16934     5782
## BD20          0.36      0.01     0.34     0.37 1.00    16725     6349
## GOB           0.46      0.01     0.45     0.47 1.00    17617     6384
## sxy_1       -10.85      5.17   -20.79    -0.37 1.00     7694     6297
## sxy_2       -14.17      4.24   -22.62    -6.15 1.00     6943     6601
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     8.61      0.05     8.51     8.71 1.00    11344     5791
## 
## Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

3.5 Posterior predictions

# clear up memory
rm(list = setdiff(ls(), c("gsdat")))
gc()

# load previously saved models
fe0 <- readRDS(file = "./geosurvey/fe0_brms_bcount.rds")
tp0 <- readRDS(file = "./geosurvey/tp0_brms_bcount.rds")
ml0 <- readRDS(file = "./geosurvey/ml0_brms_bcount.rds")
st0 <- readRDS(file = "./geosurvey/st0_brms_bcount.rds")

# extract posterior predictions
fe <- predict(fe0)
tp <- predict(tp0)
ml <- predict(ml0)
st <- predict(st0)

# convert to data frames
fe <- as.data.frame(fe)
tp <- as.data.frame(tp)
ml <- as.data.frame(ml)
st <- as.data.frame(st)

# combine data frames
comb_pred <- bind_cols(fe[,1], tp[,1], ml[,1], st[,1])
names(comb_pred) <- c("fe", "tp", "ml", "st")
gsdat <- cbind(gsdat, st)
write.csv(gsdat, "./geosurvey/st0_bcount_preds.csv", row.names = F)

# convert to long format
comb_long <- comb_pred %>%
  pivot_longer(
    cols = everything(),
    names_to = "Model",
    values_to = "Value"
  )

ggplot(comb_long, aes(x = Value, y = Model, fill = Model)) +
  geom_density_ridges() +
  theme_ridges() +
  labs(x = "Building count", y = "Model") +
  xlim(-5, 50)

Figure 4. Posterior distributions for 4 models of GeoSurvey building count data. fe is the fixed effects model, ml is the multilevel model, st is the combined model, and tp is the thin plate spline model

3.6 Model comparisons

Figure 4 shows that the posterior predictive distributions of the 4 models are quite similar. This begs the question which individual model or alternatively which combination of models to use for generating spatial predictions. I compare the models the 4 models using leave-out-one (LOO). LOO is a cross-validation technique where a model is fit to all data points except one, and the omitted point is then used to test the model. This process is repeated for each data point in the dataset. The aim is to assess the model’s predictive performance based on its Expected Log Predictive Density (elpd). Higher values of elpd indicate better predictive performance.

fe_loo <- loo(fe0)
tp_loo <- loo(tp0)
ml_loo <- loo(ml0)
st_loo <- loo(st0)

loo_compare(fe_loo, tp_loo, ml_loo, st_loo)

##     elpd_diff se_diff
## st0    0.0       0.0 
## ml0  -57.4      19.2 
## tp0  -92.3      31.7 
## fe0 -238.5      42.6

The next comparison is done using the Widely Applicable Information Criterion (WAIC). WAIC estimates the out-of-sample expectation of the log predictive density, which is a measure of how well the model predicts new data. WAIC also includes a correction for overfitting.

fe_waic <- waic(fe0)
tp_waic <- waic(tp0)
ml_waic <- waic(ml0)
st_waic <- waic(st0)

loo_compare(fe_waic, tp_waic, ml_waic, st_waic)

##     elpd_diff se_diff
## st0    0.0       0.0 
## ml0  -67.0      21.8 
## tp0  -84.0      31.3 
## fe0 -232.3      41.3

Both comparisons show that the combined model (st0) substantially outperforms the predictions generated by 3 alternative models that were considered here. Hence, Bayesian Model Averaging (BMA) or other forms of model stacking, are therefore not indicated. However, additional validation procedures should be considered before model deployment. This would require additional GeoSurvey data collections from spatially representative quadrats.

3.7 Building density classification

In this section I derive a building density classification based on the Receiver-Operator Curves of the st0 model. Categorizing continuous variables allows for easier comparison and area measurement of groups. For example, building density levels can be divided into distinctive categories to compare environmental impacts or for planning for agricultural intensification and monitoring of local energy investments. They are also useful for pre- or post-stratifying field surveys.

# define the threshold values and their directions (">" or "<")
thresholds <- list(
  list(value = 50, direction = ">"),
  list(value = 10, direction = ">"),
  list(value = 0, direction = ">")
)

# initialize lists to store results
roc_results <- list()
auc_results <- list()
cut_results <- list()

# loop through the thresholds
for (i in seq_along(thresholds)) {
    threshold_value <- thresholds[[i]]$value
    threshold_direction <- thresholds[[i]]$direction

    # Check the direction and create a binary variable accordingly
    if (threshold_direction == ">") {
        binary_variable <- ifelse(gsdat$bcount > threshold_value, 1, 0)
    } else if (threshold_direction == "<") {
        binary_variable <- ifelse(gsdat$bcount < threshold_value, 1, 0)
    } else {
        stop("Invalid threshold direction: must be '>' or '<'")
    }

    # Diagnostic print: Check the proportion of 1s and 0s
    print(table(binary_variable))

    # Calculate ROC curve
    roc_curve <- roc(binary_variable, gsdat$Estimate)
    roc_results[[i]] <- roc_curve

    # Calculate AUC
    auc_results[[i]] <- auc(roc_curve)

    # Calculate best cut-off
    cut_results[[i]] <- coords(roc_curve, "best")
}

## binary_variable
##     0     1 
## 15617   377 
## binary_variable
##     0     1 
## 11591  4403 
## binary_variable
##     0     1 
##  5470 10524

auc_values <- sapply(auc_results, function(x) {
  unname(x[[1]])
})
auc <- data.frame(id = seq_along(auc_values), AUC = auc_values)

cut <- do.call(rbind, lapply(seq_along(cut_results), function(i) {
  df <- cut_results[[i]]
  df$id <- i
  return(df)
}))

dcut <- merge(auc, cut, by = "id")
write.csv(dcut, "./geosurvey/dcut.csv", row.names = F)
dcut

##   id       AUC threshold specificity sensitivity
## 1  1 0.9841075 25.470218   0.9485176   0.9469496
## 2  2 0.9495252  8.864114   0.8702442   0.8850784
## 3  3 0.9590707  2.993283   0.8987203   0.8979475

gsdat$dclass <- cut(gsdat$Estimate, 
                  breaks = c(-Inf, dcut[3,3], dcut[2,3], dcut[1,3], Inf), 
                  labels = c("absent", "sparse", "medium", "dense"),
                  include.lowest = TRUE)

write.csv(gsdat, "./geosurvey/st0_bcount_preds.csv", row.names = F)

4 Predictive mapping

While spatial prediction with the models that have been developed is conceptually straightforward, computation of predictive maps is quite challenging, at least on standard desktop computers. The code provided below has been optimized to predict directly from saved R model objects (i.e. .rds files). The main trick is to partition the (new) data into small chunks (batches) to generate predictions that are then written out to disk. At the end of this, the batches are reassembled to generate the the predictive maps. It is also helpful to process the batches in parallel, using multiple computer cores and careful memory allocation. The next parts of this notebook show how this is done, starting with assembling the setup of spatial data features (GIS data), to writing out the generated .Geotif files of prediction estimates and their credible intervals, which can be imported into any GIS for post-processing and display.

4.1 Setup

# package names
packages <- c("foreach", "doParallel")

# install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
    utils::install.packages(packages[!installed_packages])
}

# load packages
invisible(lapply(packages, library, character.only = TRUE))

# load a previously saved brms model. Substitute a different model object here ...
model <- readRDS(file = "./geosurvey/st0_brms_bcount.rds")

# prepare new data from rasters
bp20_raster <- raster("./geosurvey/BP20.tif")
bd20_raster <- raster("./geosurvey/BD20.tif")
gob_raster <- raster("./geosurvey/GOB.tif")
osm_raster <- raster("./geosurvey/OSM.tif")

# extract values from the raster
bp20_values <- getValues(bp20_raster)
bd20_values <- getValues(bd20_raster)
gob_values <- getValues(gob_raster)
osm_values <- getValues(osm_raster)

# Extract coordinates from one of the rasters (both should have the same coordinates)
coords <- xyFromCell(bp20_raster, 1:ncell(bp20_raster))

# combine the coordinates and values into a data frame
new_data <- data.frame(x = coords[, 1],  # X coordinates
                       y = coords[, 2],  # Y coordinates
                       BP20 = bp20_values,
                       BD20 = bd20_values,
                       GOB = gob_values)

# remove NAs if necessary (e.g., remove rows with NAs)
new_data <- na.omit(new_data)

xy <- new_data[ ,c(1:2)]

# specify vector of resolution values (in meters) for each DG level
resolution <- c(12500)

# initialize an empty list to store the results
gids <- list()

# loop over each resolution value
for (i in 1:length(resolution)) {
    res.pixel <- resolution[i]
    xgid <- ceiling(abs(xy$x)/res.pixel)
    ygid <- ceiling(abs(xy$y)/res.pixel)
    gidx <- ifelse(xy$x < 0, paste("W", xgid, sep=""), paste("E", xgid, sep=""))
    gidy <- ifelse(xy$y < 0, paste("S", ygid, sep=""), paste("N", ygid, sep=""))
    gids[[i]] <- paste(gidx, gidy, sep="")
}

# append GIDs to `gsdat`
gids <- data.frame(gids)
names(gids) <- c("L4")
new_data <- cbind(gids, new_data)
new_data <- na.omit(new_data)

# select grid cells where buildings might occur only
new_data <- filter(new_data, BP20 > 0)
new_data <- new_data %>%
  mutate(across(where(is.character), as.factor))

# clear up memory
rm(list = setdiff(ls(), c("model", "new_data")))
gc()

4.2 Generate predictions

Generating the predictions for > 1.6M pixels and their 95% credible intervals takes about 30 minutes per model to run, using 8 cores on a 16 core MacBook Pro with 32Mb of RAM.

# register number of cores
no_cores <- detectCores() - 8
registerDoParallel(cores=no_cores)

# Define a directory for temporary files (make sure it exists or create it)
temp_dir <- "./geosurvey/temp_files"
dir.create(temp_dir, showWarnings = FALSE)

# Determine batch size (number of rows per batch)
batch_size <- 10000 # adjust based on your system's capability

# Calculate the total number of batches
total_rows <- nrow(new_data)
total_batches <- ceiling(total_rows / batch_size)

# Process in batches using parallel foreach loop
foreach(i = 1:total_batches, .packages = c("dplyr")) %dopar% {
  # Define the row indices for the current batch
  start_row <- (i - 1) * batch_size + 1
  end_row <- min(i * batch_size, total_rows)
  
  # Subset data for the current batch
  tmp_data <- new_data[start_row:end_row, ]
  
  # Process the data (e.g., make predictions)
  pred_data <- as.data.frame(predict(model, newdata = tmp_data, allow_new_levels = TRUE))
  combined_data <- cbind(tmp_data, pred_data)
  
  # Write results of the current batch to a file
  batch_file <- file.path(temp_dir, paste0("batch_", i, ".rds"))
  saveRDS(combined_data, file = batch_file)
}

# read all the batch files and combine them into the final dataset
file_list <- list.files(temp_dir, full.names = TRUE)
final_results <- do.call("rbind", lapply(file_list, readRDS))

# apply the density classification thresholds
dcut <- read.table("./geosurvey/dcut.csv", header = TRUE, sep = ",")

final_results$dclass <- as.numeric(cut(final_results$Estimate, 
                  breaks = c(-Inf, dcut[3,3], dcut[2,3], dcut[1,3], Inf), 
                  labels = c("absent", "sparse", "medium", "dense"),
                  include.lowest = TRUE))

# write out prediction file for reuse
write.csv(final_results, "./geosurvey/st0_bcount_spreds.csv", row.names = F)

4.3 Export raster files

# convert final_results to SpatialPointsDataFrame
sp_preds <- final_results
coordinates(sp_preds) <- ~ x+y

# read raster template
bp <- raster("./geosurvey/BP20.tif")

# create new raster with estimated values
brms_est <- raster(bp)  # Use bp20_raster's properties as a template
brms_est <- rasterize(sp_preds, brms_est, field = "Estimate") 

# create new raster with error estimates
brms_error <- raster(bp)
brms_error <- rasterize(sp_preds, brms_error, field = "Est.Error") 

# create new raster with dclass categories
brms_dclass <- raster(bp)
brms_dclass <- rasterize(sp_preds, brms_dclass, field = "dclass") 

# export
spreds <- brick(brms_est, brms_error, brms_dclass)
fname <- paste("./geosurvey/", "st0_bcount_preds.tif", sep = "") # change name here
writeRaster(spreds, filename = fname, format = "GTiff", options = "INTERLEAVE=BAND", overwrite = TRUE)

Following a few GIS cosmetics, this is what the 2 main prediction maps look like.

Figure 5. Predicted building densities and density classes in Uganda.

5 Takeaways

This notebook demonstrates that generating high-quality building density prediction maps at a very low cost with a fast turnaround is feasible. In approximately two months, we can cover an entire country the size of Uganda, including the modeling aspects. This approach enables a wide range of stakeholders to participate in sampling, mapping, and monitoring human habitats. Besides tagging buildings, GeoSurvey has also been used for broader land use and land cover mapping, typically collecting multiple labels in a single survey. You can explore and utilize other national survey datasets from Africa that I have contributed to in the OSF repository here.
A key aspect of the workflow presented is the use of open-access legacy data, like Google’s Open Buildings and Open Street Map building footprint predictions, combined with new observations from satellite or drone imagery, based on a spatially representative sampling frame. The essence of Bayesian modeling lies in using prior information, updating, blending, stacking, teting and integrating multiple sources of evidence while explicitly handling uncertainty.
In this case, non-time-stamped Google VHR Imagery from around 2016 to 2020 was used to tag buildings within 6.25 ha (250 × 250 m) pixels. However, any high-resolution, cloud-free, and time-stamped imagery should be suitable for tagging or digitizing building footprints. You can specify the imagery that is used in the app (see at GeoSurvey). The choice of sampling 6.25 ha resolution quadrats is somewhat arbitrary but aligns with many open and globally available image time series and raster layers. Also it’s important to note that higher resolution doesn’t necessarily mean higher accuracy or ease of (re)use in practical applications.
Conceptually, fitting, testing, and predicting with the models is straightforward. However, these steps require considerable computational power and careful memory management on standard computers. As shown in this notebook, which was tailored for use on standard computers, the entire workflow can be completed in about 1.5 hours on a MacBook Pro. The code can also be run using GPUs or in cloud environments like AWS, which could be crucial for large-scale projects like predicting building densities across all of Africa.

6 Session info

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Big Sur 11.7.10
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Denver
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] pROC_1.18.4     ggridges_0.5.4  brms_2.20.4     Rcpp_1.0.11    
##  [5] raster_3.6-23   sf_1.0-14       rgdal_1.6-7     sp_2.1-0       
##  [9] jsonlite_1.8.7  lubridate_1.9.3 forcats_1.0.0   stringr_1.5.0  
## [13] dplyr_1.1.3     purrr_1.0.2     readr_2.1.4     tidyr_1.3.0    
## [17] tibble_3.2.1    ggplot2_3.4.3   tidyverse_2.0.0 osfr_0.2.9     
## 
## loaded via a namespace (and not attached):
##   [1] tensorA_0.36.2       rstudioapi_0.15.0    magrittr_2.0.3      
##   [4] farver_2.1.1         rmarkdown_2.25       fs_1.6.3            
##   [7] vctrs_0.6.3          memoise_2.0.1        base64enc_0.1-3     
##  [10] terra_1.7-46         htmltools_0.5.6      distributional_0.3.2
##  [13] curl_5.1.0           StanHeaders_2.26.28  sass_0.4.7          
##  [16] KernSmooth_2.23-21   bslib_0.5.1          htmlwidgets_1.6.2   
##  [19] plyr_1.8.9           zoo_1.8-12           cachem_1.0.8        
##  [22] igraph_1.5.1         mime_0.12            lifecycle_1.0.3     
##  [25] pkgconfig_2.0.3      colourpicker_1.3.0   Matrix_1.5-4.1      
##  [28] R6_2.5.1             fastmap_1.1.1        shiny_1.7.5.1       
##  [31] digest_0.6.33        colorspace_2.1-0     ps_1.7.5            
##  [34] crosstalk_1.2.0      labeling_0.4.3       urltools_1.7.3      
##  [37] fansi_1.0.5          timechange_0.2.0     mgcv_1.8-42         
##  [40] httr_1.4.7           abind_1.4-5          compiler_4.3.1      
##  [43] proxy_0.4-27         withr_2.5.1          backports_1.4.1     
##  [46] inline_0.3.19        shinystan_2.6.0      DBI_1.1.3           
##  [49] QuickJSR_1.0.7       pkgbuild_1.4.2       classInt_0.4-10     
##  [52] gtools_3.9.4         loo_2.6.0            tools_4.3.1         
##  [55] units_0.8-4          httpuv_1.6.11        threejs_0.3.3       
##  [58] glue_1.6.2           callr_3.7.3          nlme_3.1-162        
##  [61] promises_1.2.1       grid_4.3.1           checkmate_2.3.0     
##  [64] reshape2_1.4.4       generics_0.1.3       gtable_0.3.4        
##  [67] tzdb_0.4.0           class_7.3-22         hms_1.1.3           
##  [70] utf8_1.2.3           pillar_1.9.0         markdown_1.9        
##  [73] posterior_1.5.0      later_1.3.1          splines_4.3.1       
##  [76] lattice_0.21-8       tidyselect_1.2.0     miniUI_0.1.1.1      
##  [79] knitr_1.44           gridExtra_2.3        V8_4.4.0            
##  [82] crul_1.4.0           stats4_4.3.1         xfun_0.40           
##  [85] bridgesampling_1.1-2 matrixStats_1.0.0    DT_0.30             
##  [88] rstan_2.32.3         stringi_1.7.12       yaml_2.3.7          
##  [91] evaluate_0.22        codetools_0.2-19     httpcode_0.3.0      
##  [94] cli_3.6.1            RcppParallel_5.1.7   shinythemes_1.2.0   
##  [97] xtable_1.8-4         munsell_0.5.0        processx_3.8.2      
## [100] jquerylib_0.1.4      triebeard_0.4.1      coda_0.19-4         
## [103] parallel_4.3.1       rstantools_2.3.1.1   ellipsis_0.3.2      
## [106] prettyunits_1.2.0    dygraphs_1.1.1.6     bayesplot_1.10.0    
## [109] Brobdingnag_1.2-9    mvtnorm_1.2-3        scales_1.2.1        
## [112] xts_0.13.1           e1071_1.7-13         crayon_1.5.2        
## [115] rlang_1.1.1          shinyjs_2.1.0

Predictive mapping of building densities from GeoSurvey data in Uganda

M.G. Walsh

23, January, 2024