Skip to contents
library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(mapme.biodiversity)

Objectives

This tutorial gives you some information how to transform the output of the mapme-biodiversity package to a wide format for exchange with other (geospatial-)software, such as QGIS. This is necessary because the package uses the so-called nested-list format by default to support very different indicators that each may come in its own format. However, this format is specific to R and to use the data in other software thus requires some additional steps to be taken. This vignette thus shows you how you can obtain a wide-format for your portfolio that you can easily export and use with other software.

What are long vs. wide tables?

Tabular data can be structured in two different ways, which are usually referred to as long and wide format. Most people are more familiar with the wide format, because this is a format we as humans would naturally structure our data when we work with spreadsheets, e.g. in Excel. In wide-format, the identifier of a observation is included exactly once and does not repeat itself (see Table A). In the long format, the identifier as well as other qualifying variables, might be repeated several times to uniquely identify each observation is a single row (see Table B). The long format is often required when interacting with computers, e.g. to make plots with ggplot2. The content of the two is exactly the same either way, the one might be just more friendly to humans than to computers. If you are familiar with the R tidyverse, you might also have heard of the term tidy data. In terms of tabular data you can imagine tidy data to be referring to data in a long table which naturally fulfills the following requirements:

  • each variable has its own column
  • each observation has its own row
  • each value has its own cell
Fig. 1: Example of a wide and long table holding the same data.
Fig. 1: Example of a wide and long table holding the same data.

Table A, in that sense, is not tidy since the year variable is not found in its own column but instead it is scattered in two different columns. Table B is a long format with each variable being found in exactly one column. In that sense, each individual row represents exactly one observation, meaning the observation of a specific country in a specific year.

When we structure data in the long format the objects will usually have a larger memory footprint than compared to a wide format. For smaller objects or data types with small memory consumption, this might not pose a serious limitation to the workflow. However, geometry information, here indicated as a WKT string, might quickly accumulate a large proportion of the available memory, even more so if the portfolio consists of a high number of complex geometries that are then copied to fit the the long-format requirement. For that reason, this packages uses a nested-list format to hold tables for indicators as single columns within the portfolio. The remainder of this tutorial will show you in more detail how you can work with those R specific data format.

The simple case - single-row indicators

We start by reading a GeoPackage from disk. For the sake of the argument, we split the original single polygon into 9 distinct polygons to simulate a realistic portfolio consisting of multiple assets.

aoi <- read_sf(
  system.file("extdata", "sierra_de_neiba_478140_2.gpkg",
    package = "mapme.biodiversity"
  )
)
aoi <- st_as_sf(st_make_grid(aoi, n = 3))
print(aoi)
#> Simple feature collection with 9 features and 0 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS:  WGS 84
#>                                x
#> 1 POLYGON ((-71.80933 18.5766...
#> 2 POLYGON ((-71.65022 18.5766...
#> 3 POLYGON ((-71.49111 18.5766...
#> 4 POLYGON ((-71.80933 18.6175...
#> 5 POLYGON ((-71.65022 18.6175...
#> 6 POLYGON ((-71.49111 18.6175...
#> 7 POLYGON ((-71.80933 18.6584...
#> 8 POLYGON ((-71.65022 18.6584...
#> 9 POLYGON ((-71.49111 18.6584...

As a simple example, suppose we interested in extracting the average traveltime to cities between 20,000 and 50,000 inhabitants in our portfolio. As usual, we will have to make available the Nelson et al. resource as well as requesting the calculation of the respective indicator.

mapme_options(outdir = system.file("res", package = "mapme.biodiversity"))

port_w_nelson <- get_resources(aoi, get_nelson_et_al(ranges = "20k_50k"))
#> Skipping existing files in output directory.
port_w_traveltime <- calc_indicators(port_w_nelson, calc_traveltime(stats = "mean"))
#> Found a column named 'assetid'. Overwritting its values with a unique identifier.
print(port_w_traveltime)
#> Simple feature collection with 9 features and 2 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS:  WGS 84
#> # A tibble: 9 × 3
#>   assetid traveltime                                                           x
#>     <int> <list>                                                   <POLYGON [°]>
#> 1       1 <tibble [1 × 2]> ((-71.80933 18.57668, -71.65022 18.57668, -71.65022 …
#> 2       2 <tibble [1 × 2]> ((-71.65022 18.57668, -71.49111 18.57668, -71.49111 …
#> 3       3 <tibble [1 × 2]> ((-71.49111 18.57668, -71.33201 18.57668, -71.33201 …
#> 4       4 <tibble [1 × 2]> ((-71.80933 18.61756, -71.65022 18.61756, -71.65022 …
#> 5       5 <tibble [1 × 2]> ((-71.65022 18.61756, -71.49111 18.61756, -71.49111 …
#> 6       6 <tibble [1 × 2]> ((-71.49111 18.61756, -71.33201 18.61756, -71.33201 …
#> 7       7 <tibble [1 × 2]> ((-71.80933 18.65844, -71.65022 18.65844, -71.65022 …
#> 8       8 <tibble [1 × 2]> ((-71.65022 18.65844, -71.49111 18.65844, -71.49111 …
#> 9       9 <tibble [1 × 2]> ((-71.49111 18.65844, -71.33201 18.65844, -71.33201 …

As we can observe from the output, a new column was added to the sf object. It is called traveltime and is of type list indicating that it represents a nested-list column. What that means is that we are able to maintain the rectangular shape of our original data (e.g. one polygon per row), while supporting arbitrarily shaped outputs for indicators. Let’s observe how the traveltime indicator looks like in this instance:

print(port_w_traveltime$traveltime[[1]])
#> # A tibble: 1 × 2
#>   minutes_mean distance
#>          <dbl> <chr>   
#> 1         49.8 20k_50k

From the syntax above, you can see how you can access a single object within the nested list column (e.g. by using the list accessor [[). In this case, the shape of the traveltime indicator is a single-row and two-column tibble with the average minutes and the distance category as its value. Because the indicator comes as a single row, we can simply unnest the traveltime column to get to a wide-format output:

port_wide_1 <- tidyr::unnest(port_w_traveltime, traveltime)
print(port_wide_1)
#> Simple feature collection with 9 features and 3 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS:  WGS 84
#> # A tibble: 9 × 4
#>   assetid minutes_mean distance                                                x
#>     <int>        <dbl> <chr>                                       <POLYGON [°]>
#> 1       1         49.8 20k_50k  ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 2       2        104.  20k_50k  ((-71.65022 18.57668, -71.49111 18.57668, -71.4…
#> 3       3        112.  20k_50k  ((-71.49111 18.57668, -71.33201 18.57668, -71.3…
#> 4       4         84   20k_50k  ((-71.80933 18.61756, -71.65022 18.61756, -71.6…
#> 5       5        139.  20k_50k  ((-71.65022 18.61756, -71.49111 18.61756, -71.4…
#> 6       6        104.  20k_50k  ((-71.49111 18.61756, -71.33201 18.61756, -71.3…
#> 7       7        141.  20k_50k  ((-71.80933 18.65844, -71.65022 18.65844, -71.6…
#> 8       8        118.  20k_50k  ((-71.65022 18.65844, -71.49111 18.65844, -71.4…
#> 9       9         73.1 20k_50k  ((-71.49111 18.65844, -71.33201 18.65844, -71.3…

In this case, the result has 9 rows just like the original data frame. In this particular scenario, where the only variable is minutes_mean, the object port_wide_1 contains just one value per polygon. This aligns well with the characteristics of the wide format, which can be easily understood and processed by various (geospatial) software. So, we can now export our portfolio for example as a geopackage that can be read directly into e.g. QGIS or as a CSV without the geospatial information:

st_write(port_wide_1, dsn = tempfile(fileext = ".gpkg"))
#> Writing layer `file2572d286bf1' to data source 
#>   `/tmp/Rtmpiejr47/file2572d286bf1.gpkg' using driver `GPKG'
#> Writing 9 features with 3 fields and geometry type Polygon.
write.csv(st_drop_geometry(port_wide_1), file = tempfile(fileext = ".csv"))

The harder case - indicators with multi-row output

Let’s continue to query an indicator which has a multi-row output, i.e. precipitation statistics from WorldClim.

port_w_wc <- get_resources(aoi, get_worldclim_precipitation(years = 2010:2018))
#> Skipping existing files in output directory.
port_w_prec <- calc_indicators(port_w_wc, calc_precipitation_wc(stats = "mean"))
#> Found a column named 'assetid'. Overwritting its values with a unique identifier.
print(port_w_prec)
#> Simple feature collection with 9 features and 2 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS:  WGS 84
#> # A tibble: 9 × 3
#>   assetid precipitation_wc                                                     x
#>     <int> <list>                                                   <POLYGON [°]>
#> 1       1 <tibble [12 × 2]> ((-71.80933 18.57668, -71.65022 18.57668, -71.65022…
#> 2       2 <tibble [12 × 2]> ((-71.65022 18.57668, -71.49111 18.57668, -71.49111…
#> 3       3 <tibble [12 × 2]> ((-71.49111 18.57668, -71.33201 18.57668, -71.33201…
#> 4       4 <tibble [12 × 2]> ((-71.80933 18.61756, -71.65022 18.61756, -71.65022…
#> 5       5 <tibble [12 × 2]> ((-71.65022 18.61756, -71.49111 18.61756, -71.49111…
#> 6       6 <tibble [12 × 2]> ((-71.49111 18.61756, -71.33201 18.61756, -71.33201…
#> 7       7 <tibble [12 × 2]> ((-71.80933 18.65844, -71.65022 18.65844, -71.65022…
#> 8       8 <tibble [12 × 2]> ((-71.65022 18.65844, -71.49111 18.65844, -71.49111…
#> 9       9 <tibble [12 × 2]> ((-71.49111 18.65844, -71.33201 18.65844, -71.33201…

We see that similar to the traveltime indicator, the output consists of 9 rows with the precipitation indicator as a single nested-list column. Note, however, the differences when we take a look at the shape of the indicator for a specific asset:

print(port_w_prec$precipitation_wc[[1]])
#> # A tibble: 12 × 2
#>    prec_mean date      
#>        <dbl> <date>    
#>  1      21.1 2018-01-01
#>  2      21.4 2018-02-01
#>  3      61.4 2018-03-01
#>  4      83.6 2018-04-01
#>  5     331.  2018-05-01
#>  6      60.3 2018-06-01
#>  7      74.0 2018-07-01
#>  8      94.0 2018-08-01
#>  9     207.  2018-09-01
#> 10     126.  2018-10-01
#> 11      52.6 2018-11-01
#> 12      20.2 2018-12-01

For a single asset, we obtain a tibble with 12 rows (for each month in the queried year 2018). Now, if we just simply unnest this indicator column observe what happens to the shape of the output:

port_wide_2 <- tidyr::unnest(port_w_prec, precipitation_wc)
print(port_wide_2)
#> Simple feature collection with 108 features and 3 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS:  WGS 84
#> # A tibble: 108 × 4
#>    assetid prec_mean date                                                      x
#>      <int>     <dbl> <date>                                        <POLYGON [°]>
#>  1       1      21.1 2018-01-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#>  2       1      21.4 2018-02-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#>  3       1      61.4 2018-03-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#>  4       1      83.6 2018-04-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#>  5       1     331.  2018-05-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#>  6       1      60.3 2018-06-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#>  7       1      74.0 2018-07-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#>  8       1      94.0 2018-08-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#>  9       1     207.  2018-09-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> 10       1     126.  2018-10-01 ((-71.80933 18.57668, -71.65022 18.57668, -71.6…
#> # ℹ 98 more rows

Instead of 9 rows, we get a tibble with 108 rows (9 assets * 12), with the metadata for each asset as well as the geometry column being repeated. This is not desirable, especially for portfolios with large numbers of assets where the repetition of the geometry data can cause very large objects. Now, let’s investigate how we still are able to get a wide output. A valid approach is to re-shape the indicators per asset to a wide format before unnesting the indicator column. For this, let’s write a function which takes the precipitation indicator of a single asset as input and transforms it to a wide-format.

prec2wide <- function(data) {
  pivot_wider(data, names_from = date, values_from = prec_mean, names_prefix = "prec-")
}
prec2wide(port_w_prec$precipitation_wc[[1]])
#> # A tibble: 1 × 12
#>   `prec-2018-01-01` `prec-2018-02-01` `prec-2018-03-01` `prec-2018-04-01`
#>               <dbl>             <dbl>             <dbl>             <dbl>
#> 1              21.1              21.4              61.4              83.6
#> # ℹ 8 more variables: `prec-2018-05-01` <dbl>, `prec-2018-06-01` <dbl>,
#> #   `prec-2018-07-01` <dbl>, `prec-2018-08-01` <dbl>, `prec-2018-09-01` <dbl>,
#> #   `prec-2018-10-01` <dbl>, `prec-2018-11-01` <dbl>, `prec-2018-12-01` <dbl>

From the example output above we see that our custom prec2wide() functions takes a single tibble as input, puts in the dates appended by prec- as column names and the associated average precipitation values as a single row. Now, we can use functionality of the purrr package to map this function over our indicator column, save the result in a new column (or simply overwrite the original), and finally simply call unnest to get the output in a wide format.

port_w_prec$precipitation_wc2 <- purrr::map(port_w_prec$precipitation_wc, prec2wide)
port_w_prec <- select(port_w_prec, precipitation_wc2)
port_wide_3 <- unnest(port_w_prec, precipitation_wc2)

print(port_wide_3)
#> Simple feature collection with 9 features and 12 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS:  WGS 84
#> # A tibble: 9 × 13
#>   `prec-2018-01-01` `prec-2018-02-01` `prec-2018-03-01` `prec-2018-04-01`
#>               <dbl>             <dbl>             <dbl>             <dbl>
#> 1              21.1              21.4              61.4              83.6
#> 2              25.9              24.5              62.0              82.8
#> 3              24.5              24.2              58.4              83.9
#> 4              32.1              28.1              72.8              80.9
#> 5              28.7              28.4              68.4              83.3
#> 6              17.6              19.8              54.4              83.6
#> 7              24.5              24.3              69.9              80.1
#> 8              19.4              21.1              62.8              86.8
#> 9              14.2              17.5              52.2              85.6
#> # ℹ 9 more variables: `prec-2018-05-01` <dbl>, `prec-2018-06-01` <dbl>,
#> #   `prec-2018-07-01` <dbl>, `prec-2018-08-01` <dbl>, `prec-2018-09-01` <dbl>,
#> #   `prec-2018-10-01` <dbl>, `prec-2018-11-01` <dbl>, `prec-2018-12-01` <dbl>,
#> #   x <POLYGON [°]>

We can now use st_write() to save this wide-format sf object to a geospatial format of our choice or drop the geometry information altogether and dump the data into a csv.

st_write(port_wide_3, dsn = tempfile(fileext = ".gpkg"))
#> Writing layer `file257258075579' to data source 
#>   `/tmp/Rtmpiejr47/file257258075579.gpkg' using driver `GPKG'
#> Writing 9 features with 12 fields and geometry type Polygon.
write.csv(st_drop_geometry(port_wide_3), file = tempfile(fileext = ".csv"))