library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(mapme.biodiversity)
Objectives
This tutorial gives you some information how to transform the output of the mapme-biodiversity package to a wide format for exchange with other (geospatial-)software, such as QGIS. This is necessary because the package uses the so-called nested-list format by default to support very different indicators that each may come in its own format. However, this format is specific to R and to use the data in other software thus requires some additional steps to be taken. This vignette thus shows you how you can obtain a wide-format for your portfolio that you can easily export and use with other software.
What are long vs. wide tables?
Tabular data can be structured in two different ways, which are usually referred to as long and wide format. Most people are more familiar with the wide format, because this is a format we as humans would naturally structure our data when we work with spreadsheets, e.g. in Excel. In wide-format, the identifier of a observation is included exactly once and does not repeat itself (see Table A). In the long format, the identifier as well as other qualifying variables, might be repeated several times to uniquely identify each observation is a single row (see Table B). The long format is often required when interacting with computers, e.g. to make plots with ggplot2. The content of the two is exactly the same either way, the one might be just more friendly to humans than to computers. If you are familiar with the R tidyverse, you might also have heard of the term tidy data. In terms of tabular data you can imagine tidy data to be referring to data in a long table which naturally fulfills the following requirements:
- each variable has its own column
- each observation has its own row
- each value has its own cell

Fig. 1: Example of a wide and long table holding the same data.
Table A, in that sense, is not tidy since the year variable is not found in its own column but instead it is scattered in two different columns. Table B is a long format with each variable being found in exactly one column. In that sense, each individual row represents exactly one observation, meaning the observation of a specific country in a specific year.
When we structure data in the long format the objects will usually have a larger memory footprint than compared to a wide format. For smaller objects or data types with small memory consumption, this might not pose a serious limitation to the workflow. However, geometry information, here indicated as a WKT string, might quickly accumulate a large proportion of the available memory, even more so if the portfolio consists of a high number of complex geometries that are then copied to fit the the long-format requirement. For that reason, this packages uses a nested-list format to hold tables for indicators as single columns within the portfolio. The remainder of this tutorial will show you in more detail how you can work with those R specific data format.
The simple case - single-row indicators
We start by reading a GeoPackage from disk. For the sake of the argument, we split the original single polygon into 9 distinct polygons to simulate a realistic portfolio consisting of multiple assets.
aoi <- read_sf(
system.file("extdata", "sierra_de_neiba_478140_2.gpkg",
package = "mapme.biodiversity"
)
)
aoi <- st_as_sf(st_make_grid(aoi, n = 3))
print(aoi)
#> Simple feature collection with 9 features and 0 fields
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS: WGS 84
#> x
#> 1 POLYGON ((-71.80933 18.5766...
#> 2 POLYGON ((-71.65022 18.5766...
#> 3 POLYGON ((-71.49111 18.5766...
#> 4 POLYGON ((-71.80933 18.6175...
#> 5 POLYGON ((-71.65022 18.6175...
#> 6 POLYGON ((-71.49111 18.6175...
#> 7 POLYGON ((-71.80933 18.6584...
#> 8 POLYGON ((-71.65022 18.6584...
#> 9 POLYGON ((-71.49111 18.6584...
As a simple example, suppose we interested in extracting the average traveltime to cities between 20,000 and 50,000 inhabitants in our portfolio. As usual, we will have to make available the Nelson et al. resource as well as requesting the calculation of the respective indicator.
port <- init_portfolio(
aoi,
years = 2018,
outdir = system.file("res", package = "mapme.biodiversity"),
tmpdir = system.file("tmp", package = "mapme.biodiversity"),
add_resources = TRUE
)
port_w_nelson <- get_resources(port, "nelson_et_al", range_traveltime = "20k_50k")
#> Starting process to download resource 'nelson_et_al'........
#> Skipping existing files in output directory.
port_w_traveltime <- calc_indicators(port_w_nelson, "traveltime", stats_accessibility = "mean")
#> Argument 'engine' for resource 'traveltime' was not specified. Setting to default value of 'extract'.
print(port_w_traveltime)
#> Simple feature collection with 9 features and 2 fields
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS: WGS 84
#> # A tibble: 9 × 3
#> assetid traveltime x
#> <int> <list> <POLYGON [°]>
#> 1 1 <tibble [1 × 2]> ((-71.80933 18.57668, -71.65022 18.57668, -71.65022 …
#> 2 2 <tibble [1 × 2]> ((-71.65022 18.57668, -71.49111 18.57668, -71.49111 …
#> 3 3 <tibble [1 × 2]> ((-71.49111 18.57668, -71.33201 18.57668, -71.33201 …
#> 4 4 <tibble [1 × 2]> ((-71.80933 18.61756, -71.65022 18.61756, -71.65022 …
#> 5 5 <tibble [1 × 2]> ((-71.65022 18.61756, -71.49111 18.61756, -71.49111 …
#> 6 6 <tibble [1 × 2]> ((-71.49111 18.61756, -71.33201 18.61756, -71.33201 …
#> 7 7 <tibble [1 × 2]> ((-71.80933 18.65844, -71.65022 18.65844, -71.65022 …
#> 8 8 <tibble [1 × 2]> ((-71.65022 18.65844, -71.49111 18.65844, -71.49111 …
#> 9 9 <tibble [1 × 2]> ((-71.49111 18.65844, -71.33201 18.65844, -71.33201 …
As we can observe from the output, two new columns were added to the
sf object. The first is called assetid
and simply is a
unique identifier of the respective asset. The second column is called
traveltime
and is of type list
indicating that
it represents a nested-list column. What that means is that we are able
to maintain the rectangular shape of our original data (e.g. one polygon
per row), while supporting arbitrarily shaped outputs for indicators.
Let’s observe how the traveltime indicator looks like in this
instance:
print(port_w_traveltime$traveltime[[1]])
#> # A tibble: 1 × 2
#> minutes_mean distance
#> <dbl> <chr>
#> 1 49.8 20k_50k
From the syntax above, you can see how you can access a single object
within the nested list column (e.g. by using the list accessor
[[
). In this case, the shape of the traveltime indicator is
a single-row and two-column tibble with the average minutes and the
distance category as its value. Because the indicator comes as a single
row, we can simply unnest the traveltime column to get to a wide-format
output:
port_wide_1 <- tidyr::unnest(port_w_traveltime, traveltime)
print(port_wide_1)
#> Simple feature collection with 9 features and 3 fields
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS: WGS 84
#> # A tibble: 9 × 4
#> assetid minutes_mean distance x
#> <int> <dbl> <chr> <POLYGON [°]>
#> 1 1 49.8 20k_50k ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 2 2 104. 20k_50k ((-71.65022 18.57668, -71.49111 18.57668, -71.4…
#> 3 3 112. 20k_50k ((-71.49111 18.57668, -71.33201 18.57668, -71.3…
#> 4 4 84 20k_50k ((-71.80933 18.61756, -71.65022 18.61756, -71.6…
#> 5 5 139. 20k_50k ((-71.65022 18.61756, -71.49111 18.61756, -71.4…
#> 6 6 104. 20k_50k ((-71.49111 18.61756, -71.33201 18.61756, -71.3…
#> 7 7 141. 20k_50k ((-71.80933 18.65844, -71.65022 18.65844, -71.6…
#> 8 8 118. 20k_50k ((-71.65022 18.65844, -71.49111 18.65844, -71.4…
#> 9 9 73.1 20k_50k ((-71.49111 18.65844, -71.33201 18.65844, -71.3…
In this case, the result has 9 rows just like the original data
frame. In this particular scenario, where the only variable is
minutes_mean
, the object port_wide_1
contains
just one value per polygon. This aligns well with the characteristics of
the wide format, which can be easily understood and processed by various
(geospatial) software. So, we can now export our portfolio for example
as a geopackage that can be read directly into e.g. QGIS or as a CSV
without the geospatial information:
st_write(port_wide_1, dsn = tempfile(fileext = ".gpkg"))
#> Writing layer `file25be19ec7058' to data source
#> `/tmp/RtmpXlQGOZ/file25be19ec7058.gpkg' using driver `GPKG'
#> Writing 9 features with 3 fields and geometry type Polygon.
write.csv(st_drop_geometry(port_wide_1), file = tempfile(fileext = ".csv"))
The harder case - indicators with multi-row output
Let’s continue to query an indicator which has a multi-row output, i.e. precipitation statistics from WorldClim.
port_w_chirps <- get_resources(port, "worldclim_precipitation")
#> Starting process to download resource 'worldclim_precipitation'........
#> Skipping existing files in output directory.
port_w_prec <- calc_indicators(port_w_chirps, "precipitation_wc", stats_worldclim = "mean")
#> Argument 'engine' for resource 'precipitation_wc' was not specified. Setting to default value of 'extract'.
print(port_w_prec)
#> Simple feature collection with 9 features and 2 fields
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS: WGS 84
#> # A tibble: 9 × 3
#> assetid precipitation_wc x
#> <int> <list> <POLYGON [°]>
#> 1 1 <tibble [12 × 2]> ((-71.80933 18.57668, -71.65022 18.57668, -71.65022…
#> 2 2 <tibble [12 × 2]> ((-71.65022 18.57668, -71.49111 18.57668, -71.49111…
#> 3 3 <tibble [12 × 2]> ((-71.49111 18.57668, -71.33201 18.57668, -71.33201…
#> 4 4 <tibble [12 × 2]> ((-71.80933 18.61756, -71.65022 18.61756, -71.65022…
#> 5 5 <tibble [12 × 2]> ((-71.65022 18.61756, -71.49111 18.61756, -71.49111…
#> 6 6 <tibble [12 × 2]> ((-71.49111 18.61756, -71.33201 18.61756, -71.33201…
#> 7 7 <tibble [12 × 2]> ((-71.80933 18.65844, -71.65022 18.65844, -71.65022…
#> 8 8 <tibble [12 × 2]> ((-71.65022 18.65844, -71.49111 18.65844, -71.49111…
#> 9 9 <tibble [12 × 2]> ((-71.49111 18.65844, -71.33201 18.65844, -71.33201…
We see that similar to the traveltime indicator, the output consists of 9 rows with the precipitation indicator as a single nested-list column. Note, however, the differences when we take a look at the shape of a the indicator for a specific asset:
print(port_w_prec$precipitation_wc[[1]])
#> # A tibble: 12 × 2
#> prec_mean date
#> <dbl> <date>
#> 1 21.1 2018-01-01
#> 2 21.4 2018-02-01
#> 3 61.4 2018-03-01
#> 4 83.6 2018-04-01
#> 5 331. 2018-05-01
#> 6 60.3 2018-06-01
#> 7 74.0 2018-07-01
#> 8 94.0 2018-08-01
#> 9 207. 2018-09-01
#> 10 126. 2018-10-01
#> 11 52.6 2018-11-01
#> 12 20.2 2018-12-01
For a single asset, we obtain a tibble with 12 rows (for each month in the queried year 2018). Now, if we just simply unnest this indicator column observe what happens to the shape of the output:
port_wide_2 <- tidyr::unnest(port_w_prec, precipitation_wc)
print(port_wide_2)
#> Simple feature collection with 108 features and 3 fields
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS: WGS 84
#> # A tibble: 108 × 4
#> assetid prec_mean date x
#> <int> <dbl> <date> <POLYGON [°]>
#> 1 1 21.1 2018-01-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 2 1 21.4 2018-02-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 3 1 61.4 2018-03-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 4 1 83.6 2018-04-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 5 1 331. 2018-05-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 6 1 60.3 2018-06-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 7 1 74.0 2018-07-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 8 1 94.0 2018-08-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 9 1 207. 2018-09-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 10 1 126. 2018-10-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> # ℹ 98 more rows
Instead of 9 rows, we get a tibble with 108 rows (9 assets * 12), with the metadata for each asset as well as the geometry column being repeated. This is not desirable, especially for portfolios with large numbers of assets where the repetition of the geometry data can cause very large objects. Now, let’s investigate how we still are able to get a wide output. A valid approach is to re-shape the indicators per asset to a wide format before unnesting the indicator column. For this, let’s write a function which takes the precipitation indicator of a single asset as input and transforms it to a wide-format.
prec2wide <- function(data) {
pivot_wider(data, names_from = date, values_from = prec_mean, names_prefix = "prec-")
}
prec2wide(port_w_prec$precipitation_wc[[1]])
#> # A tibble: 1 × 12
#> `prec-2018-01-01` `prec-2018-02-01` `prec-2018-03-01` `prec-2018-04-01`
#> <dbl> <dbl> <dbl> <dbl>
#> 1 21.1 21.4 61.4 83.6
#> # ℹ 8 more variables: `prec-2018-05-01` <dbl>, `prec-2018-06-01` <dbl>,
#> # `prec-2018-07-01` <dbl>, `prec-2018-08-01` <dbl>, `prec-2018-09-01` <dbl>,
#> # `prec-2018-10-01` <dbl>, `prec-2018-11-01` <dbl>, `prec-2018-12-01` <dbl>
From the example output above we see that our custom
prec2wide()
functions takes a single tibble as input, puts
in the dates appended by prec-
as column names and the
associated average precipitation values as a single row. Now, we can use
functionality of the purrr
package to map this function
over our indicator column, save the result in a new column (or simply
overwrite the original), and finally simply call unnest to get the
output in a wide format.
port_w_prec$precipitation_wc2 <- purrr::map(port_w_prec$precipitation_wc, prec2wide)
port_w_prec <- select(port_w_prec, assetid, precipitation_wc2)
port_wide_3 <- unnest(port_w_prec, precipitation_wc2)
print(port_wide_3)
#> Simple feature collection with 9 features and 13 fields
#> Geometry type: POLYGON
#> Dimension: XY
#> Bounding box: xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS: WGS 84
#> # A tibble: 9 × 14
#> assetid `prec-2018-01-01` `prec-2018-02-01` `prec-2018-03-01`
#> <int> <dbl> <dbl> <dbl>
#> 1 1 21.1 21.4 61.4
#> 2 2 25.9 24.5 62.0
#> 3 3 24.5 24.2 58.4
#> 4 4 32.1 28.1 72.8
#> 5 5 28.7 28.4 68.4
#> 6 6 17.6 19.8 54.4
#> 7 7 24.5 24.3 69.9
#> 8 8 19.4 21.1 62.8
#> 9 9 14.2 17.5 52.2
#> # ℹ 10 more variables: `prec-2018-04-01` <dbl>, `prec-2018-05-01` <dbl>,
#> # `prec-2018-06-01` <dbl>, `prec-2018-07-01` <dbl>, `prec-2018-08-01` <dbl>,
#> # `prec-2018-09-01` <dbl>, `prec-2018-10-01` <dbl>, `prec-2018-11-01` <dbl>,
#> # `prec-2018-12-01` <dbl>, x <POLYGON [°]>
We can now use st_write()
to save this wide-format sf
object to a geospatial format of our choice or drop the geometry
information altogether and dump the data into a csv.
st_write(port_wide_3, dsn = tempfile(fileext = ".gpkg"))
#> Writing layer `file25be52eddcc9' to data source
#> `/tmp/RtmpXlQGOZ/file25be52eddcc9.gpkg' using driver `GPKG'
#> Writing 9 features with 13 fields and geometry type Polygon.
write.csv(st_drop_geometry(port_wide_3), file = tempfile(fileext = ".csv"))