1 Introduction

Quantifying the geographical extent, location and spatial dynamics of croplands, rural and urban settlements and different types of vegetation cover provides essential information for monitoring and managing human dominated ecosystems and landscapes. Large portions of Africa remain a virtual “terra incognita” in this context. The main reason for monitoring land cover is to assess where in a particular country or region of interest (ROI) significant impacts of humans on ecosystem services can be expected within different land cover classes, and vice versa.

The main goal of this notebook is to illustrate improved starter code for predictive land cover mapping with multilabel data (also see the previous workflow examples here). We use Malawi’s legacy GeoSurvey data from 2018 and the associated raster features to illustrate the general approach and the main data analysis steps. Land cover data present a special class of labeling problems for data mining, machine learning and various statistical applications because they frequently contain multiple labels, which may be interdependent. The figure below illustrates the structure of multilabel data relative to binary and multiclass data types and classification tasks.


**Figure 1:** Main differences between binary, multiclass and multilabel data types.

Figure 1: Main differences between binary, multiclass and multilabel data types.

GeoSurvey is an application for rapidly collecting and analyzing land cover observations. High resolution satellite images and/or other aerial (e.g., aircraft or drone) imagery can be systematically and rapidly labeled by either trained image interpreters and/or by vetted crowds of Citizen Scientists. When done with care, these observations can result in large, well-structured, properly labeled, geospatial data sets that are suitable for data mining, machine learning and geostatistical predictions of land cover and in some instances for monitoring land use. When supplied with properly time-stamped imagery, GeoSurvey can also be used for monitoring ecosystem and landscape changes. Figure 2 shows some labeled examples from Malawi.

**Figure 2:** Examples of tagged multilabel GeoSurvey observations over Malawi (2018).

Figure 2: Examples of tagged multilabel GeoSurvey observations over Malawi (2018).

The detailed manual for conducting your own GeoSurveys is available at: GeoSurvey manual. The manual should definitely be consulted to obtain more information about how GeoSurvey can be used to carry out potentially high value surveys of remote areas. There is also a great slide deck available here, which illustrates different land cover and use labeling approaches. I’ll not cover those issues in this notebook and will assume that you already have well-designed GeoSurvey data and well collocated spatial features in hand.

2 Data setup

To run this notebook, you will need to load the packages indicated in the chunk below. This allows you to model and predict the Malawi GeoSurvey observations using machine learning algorithms MLAs and/or geostatistics. The notebook itself is versioned and maintained on Github, and you can fork and modify it from there as you see fit.

# Package names
packages <- c("tidyverse", "rgdal", "sp", "raster", "leaflet", "htmlwidgets", "caret", "caretEnsemble",
              "doParallel", "pROC")

# Install packages
installed_packages <- packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
    install.packages(packages[!installed_packages])
}

# Load packages
invisible(lapply(packages, library, character.only = TRUE))

2.1 Data downloads

This chunk loads the data, which are needed to run this example. The downloads contain the most recent GeoSurvey data from 2018, and the respective raster features. The survey data should be downloaded to your working directory from https://osf.io/myzct. The raster maps can be downloaded from https://osf.io/dcsgm. The next chunk then loads the data from your working directory and links the GeoSurvey with the rasters.

# GeoSurvey data
geos <- read.table("MW_gsdat18.csv", header = T, sep = ",")
geos$BP <- as.factor(ifelse(geos$BP == 1, "a", "b"))
geos$CP <- as.factor(ifelse(geos$CP == 1, "a", "b"))
geos$WP <- as.factor(ifelse(geos$WP == 1, "a", "b"))

# Raster feature data
# unzip("MW_250m_2018.zip", overwrite = T)
glist <- list.files(pattern="tif", full.names=T)
grids <- stack(glist)

# Project GeoSurvey coords to grid CRS
geos.proj <- as.data.frame(project(cbind(geos$lon, geos$lat), 
                                   "+proj=laea +ellps=WGS84 +lon_0=20 +lat_0=5 +units=m +no_defs"))

colnames(geos.proj) <- c("x","y")
geos <- cbind(geos, geos.proj)
coordinates(geos) <- ~x+y
projection(geos) <- projection(grids)

# Extract raster variables at GeoSurvey locations
geosgrid <- extract(grids, geos)
gsdat <- as.data.frame(cbind(geos, geosgrid)) 
gsdat <- na.omit(gsdat) ## includes only complete cases
# gsdat <- gsdat[!duplicated(gsdat), ] ## removes any duplicates 

An overview map of where the >18k, spatially representative, Malawi GeoSurvey observations were collected by trained image interpreters in 2018 is generated by the next chunk. You can click and zoom into the individual locations.

# Plot GeoSurvey sample locations
w <- leaflet() %>%
  setView(lng = mean(gsdat$lon), lat = mean(gsdat$lat), zoom = 7) %>%
  addProviderTiles(providers$OpenStreetMap.Mapnik) %>%
  addCircleMarkers(gsdat$lon, gsdat$lat, clusterOptions = markerClusterOptions())
saveWidget(w, 'MW_GS_sample_locs.html', selfcontained = T) ## save widget
w ## plot widget

2.2 Calibration / validation split

To start the fitting processes the next chunk scrubs some of the extraneous objects in memory, removes any incomplete cases, sets-up labels and features, and creates a randomized (80/20%) partition between the training and validation dataframes.

# Set output directories
dir.create("results", showWarnings=F)
dir.create("base_learner", showWarnings=F)

# Set calibration/validation set randomization seed
seed <- 12358
set.seed(seed)

# Split data into calibration and validation sets
gsIndex <- createDataPartition(gsdat$BP, p = 4/5, list = F, times = 1)
gs_cal <- gsdat[ gsIndex,]
gs_val <- gsdat[-gsIndex,]

# GeoSurvey calibration labels
labs <- c("BP") ## insert other presence/absence labels (CP, WP) here!
lcal <- as.vector(t(gs_cal[labs]))

# Raster calibration features
fcal <- gs_cal[ ,c(9:42)]

The chunk also sets the calibration labels and the associated raster features. Note that while we are illustrating the code for are buildings present? (BP) data here, you would also need to substitute the cropland and woody cover >60% presence variables (CP and WP) and specify those with the labs variable in the chunk directly above. If we were to run all 3 land cover labels at once, a “normal” computer would quickly run out of memory in training the complete predictions for this dataset.

3 Spatial multilabel predictions

We will be using a stacked generalization here (Wolpert, 1992). This amounts to independently training individual binary classifiers for each individual land cover label and then combining (stacking) the classifiers to account for dependencies between labels. Given unseen samples, the stacked model then predicts all of the labels for which the respective base classifiers indicate positive results. This method of dividing the task into multiple binary tasks resembles the more familiar one-vs-rest or one-vs-one methods that are typically used in multiclass classification tasks. However, it is fundamentally different because binary or multiclass classifiers only deal with single rather than multiple target labels; see e.g., Probst et. al., (2017) for a concise description about how stacking works in the context of multilabel classification.

3.1 Base-learner training

This chunk fits 5 models that use the that use all of gridded calibration data with 10-fold cross-validation. Learn more about how these algorithms work by following links at: MASS, randomForest, xgboost and nnet. You can use caretEnsemble instead of caret as long as the feature variables (grids in this case), and the trainControl methods are the same for each model in the caretList function. This shortens the script-length of this notebook but does not otherwise affect the overall caret functionality. Note however that the calculations take a bit of time to run on a normal 8-core, 16 Gb memory computer. This is not a big problem for a relatively small ROI like Malawi, but it might be computationally challenging for larger countries like the DRC, Tanzania or Ethiopia. We fit these models with 10-fold cross-validation and default-tuning of the relevant hyperparameters.

# Start doParallel
set.seed(seed)
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Specify model training controls
tc <- trainControl(method = "cv", number = 10, classProbs = T,
                   summaryFunction = twoClassSummary, allowParallel = TRUE, savePredictions="final")

# Fit 4 base classifiers using all of the raster features
blist <- caretList(fcal, lcal,
                   trControl = tc,
                   tuneList = NULL,
                   methodList = c("glmStepAIC", "rf", "xgbTree", "nnet"),
                   preProcess = c("center","scale"),
                   metric = "ROC")

# Generate spatial predictions
gl.pred <- predict(grids, blist$glmStepAIC, type = "prob")
rf.pred <- predict(grids, blist$rf, type = "prob")
xt.pred <- predict(grids, blist$xgbTree, type = "prob")
nn.pred <- predict(grids, blist$nnet, type = "prob")
spreds <- stack(gl.pred, rf.pred, xt.pred, nn.pred)
names(spreds) <- c("gl","rf","xt","nn")

# Save fitted models
stopCluster(mc)
fname <- paste("./base_learner/", labs, "_blist.rds", sep = "")
saveRDS(blist, fname)

This next chunk then fits and predicts a classifier ensemble based on the initial 4 base-learner models using the calibration data.

# Extract base-learner predictions at GeoSurvey locations
coordinates(gs_cal) <- ~x+y
projection(gs_cal) <- projection(spreds)
gspred <- extract(spreds, gs_cal)
gs_cal <- as.data.frame(cbind(gs_cal, gspred))
fcal <- gs_cal[ ,c(43:46)]

# Start doParallel
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Control setup
set.seed(1385321)
tc <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = T, 
                   summaryFunction = twoClassSummary, allowParallel = T)

# Model training
en <- train(fcal, lcal,
            method = "glmStepAIC",
            family = "binomial",
            metric = "ROC",
            trControl = tc)

# Model outputs & predictions
en.pred <- predict(spreds, en, type = "prob") ## ensemble spatial predictions
stopCluster(mc)
fname <- paste("./base_learner/", labs, "_en.rds", sep = "")
saveRDS(en, fname)

# Write out base-learner prediction grids
spreds <- stack(gl.pred, rf.pred, xt.pred, nn.pred, en.pred)
names(spreds) <- c("gl","rf","xt","nn","en")
fname <- paste("./base_learner/", "base_", labs, "_preds.tif", sep = "")
writeRaster(spreds, filename=fname, datatype="FLT4S", options="INTERLEAVE=BAND", overwrite=T)

To save time we have pre-trained all 15 base learners we’ll be using for the stacking and model validation steps below. This is a bit of hack. However, you can download all of the associated .rds model and .gtif files from our GeoSurvey OSF repository here. If you’d like to train your own classifiers, make sure to scrub extraneous objects in memory after the initial training steps only retaining the gs_cal, gs_val and lcal dataframes for the next steps.

3.2 Multilabel-learner training

The overall training process is quite similar to that for the base-learners. The main difference is that is that the training features that are used now are the ensemble learner predictions that were generated in Section 3.2 above. This is one form of a problem transformation approach to multilabel classification (see e.g., Probst et. al., 2017), which handles the dependencies among labels. You can download the 3 pre-trained ensemble prediction grids that are needed for this step from https://osf.io/f4p8g. Make sure to unzip and place those into your base_learner sub-directory.

# Scrub extraneous objects in memory
rm(list=setdiff(ls(), c("gs_cal", "gs_val", "labs", "lcal"))) ## scrubs extraneous objects in memory

# Load ensemble-learner raster files
glist <- list.files(path="./base_learner", pattern="_en.tif", full.names = T)
grids <- stack(glist) ## load ensemble-learner grids
coordinates(gs_cal) <- ~x+y
projection(gs_cal) <- projection(grids)

# Extract gridded variables at survey locations
mlcgrid <- extract(grids, gs_cal)
gs_cal <- as.data.frame(cbind(gs_cal, mlcgrid))

The next chunk then fits the stacked multilabel models to the calibration data.

#Select calibration features
fcal <- gs_cal[ , c(47:49)]

# Start doParallel
mc <- makeCluster(detectCores())
registerDoParallel(mc)

# Control setup
set.seed(1385321)
tc <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = T, 
                   summaryFunction = twoClassSummary, allowParallel = T)

# Model training
st <- train(fcal, lcal,
            method = "glmStepAIC",
            family = "binomial",
            metric = "ROC",
            trControl = tc)

# Model outputs & predictions
st.pred <- predict(grids, st, type = "prob") ## stacked spatial predictions
stopCluster(mc)
fname <- paste("./results/", labs, "_st.rds", sep = "")
saveRDS(st, fname)

# Write out multilabel-learner prediction grids
fname <- paste("./results/", "stack_", labs, "_pred.tif", sep = "")
writeRaster(st.pred, filename=fname, datatype="FLT4S", overwrite=T)

Figure 3 shows the multilabel-learner predictions (… after applying a few GIS cosmetics in GRASS) that will be used in the model validation part of this notebook. Note that the presence of buildings (in red) is positively associated with croplands (yellow) in different configurations and both are negatively associated with the presence of dense (>60%) woody vegetation cover (green). You can download, examine and reuse the pre-trained model objects and spatial predictions at https://osf.io/eyda2. Unzip and place those into your results sub-directory.


**Figure 3:** Multilabel-learner predictions indicating the presence / absence of buildings, croplands, and dense woody vegetation cover.

Figure 3: Multilabel-learner predictions indicating the presence / absence of buildings, croplands, and dense woody vegetation cover.

3.3 Model validation

The models that have been developed have not seen any of the validation (test-set) data up to this stage. The next chunks calculate the Receiver Operator Characteristics (ROC) for the mapped predictions of the land cover multilabels using the validation data.

# Scrub extraneous objects
unlink("./results/*.xml")
rm(list=setdiff(ls(), c("gs_val", "labs", "lcal"))) ## scrubs extraneous objects in memory

# Load multilabel learner rasters
glist <- list.files(path="./results", pattern="_pred.tif", full.names = T)
grids <- stack(glist) ## load multilabel learner grids

# Extract gridded variables at survey locations
coordinates(gs_val) <- ~x+y
projection(gs_val) <- projection(grids)
mlcgrid <- extract(grids, gs_val)
gs_val <- as.data.frame(cbind(gs_val, mlcgrid))

A ROC curve provides information about a classification test’s performance. The closer the apex of the curve toward the upper left corner, the greater the discriminatory ability of the test (i.e., the true-positive (Sensitivity) and the true-negative (Specificity) rates are both high. This next chunk does the calculations and plots the ROC curve using the pROC package. Areas under the ROC curve AUC of 1 discriminate perfectly. AUC values of 0.5 are no better than a random guess.

# Buildings
BP_roc <- roc(gs_val$BP, gs_val$stack_BP_pred)
BP_auc <- auc(BP_roc)

# Croplands
CP_roc <- roc(gs_val$CP, gs_val$stack_CP_pred)
CP_auc <- auc(CP_roc)

# Dense woody vegetation cover (>60%)
WP_roc <- roc(gs_val$WP, gs_val$stack_WP_pred)
WP_auc <- auc(WP_roc)
par(pty="s", mar=c(4,4,1,1))
plot(BP_roc, xlim=c(1,0), ylim=c(0,1), col="tomato", cex.axis = 1, cex.lab = 1.3)
lines(CP_roc, col="#feb24c")
lines(WP_roc, col="dark green")
**Figure 4:** Classification ROC curves for the validation set. Buldings (red), croplands (yellow), dense woody vegetation cover (>60%, green).

Figure 4: Classification ROC curves for the validation set. Buldings (red), croplands (yellow), dense woody vegetation cover (>60%, green).

4 Main takeaways

Any questions, comments or corrections related to this notebook are most welcome via AFSIS.