Introduction
If you are reading this vignette you are most probably to contribute to the mapme.biodiversity package. This is great news and we are very happy to receive Pull-Requests extending the package’s functionality! Below you will receive important in-depth information about how to add resources and indicators to make the process as seamless as possible for both you and the package’s maintainers. Please make sure to read and understand this guide before opening a PR. If in doubt, especially if you feel that the framework does not support your specific use case, always feel free to raise an issue and we will happily discuss how we can support your ideas. If you have not already done so, make sure to read Terminology vignette to get familiar with the most important concepts of this package.
Note that we use the tidyverse
style guide for the package. That specifically means that function
and variable names should follow the snake case pattern. We also use the
arrow assignment operator (<-
). When submitting a PR
that does not consistently follow the tidyverse style guide, the
maintainers of the package might change the code to adhere to this code
style without further notice before accepting the PR.
Getting started
Ideally, you clone the GitHub repository via the git command in a command line on Linux and MacOS systems or via the GitHub Desktop application on Windows. On Linux, the command would look like this:
git clone https://github.com/mapme-initiative/mapme.biodiversity
We do not accept pushes to main, thus the first step would be to
create a specific branch for your extension. In this tutorial, we will
pretend to reimplement the soilgrids
resources and the
associated soilproperties
indicator, so that we create a
branch reflecting this. Don’t forget to check out to the newly created
branch!
git branch add-soilgrid-indicators
git checkout add-soilgrid-indicators
Below, we will assume that you develop your extension to the package
in R Studio. The general guidelines to follow also apply if you choose
different tooling for your development process, however, it will not be
covered in this vignette. We assume that all R development dependencies
for the state of the package when you used the git clone
command are installed. The easiest way to ensure this is using devtools
when in the package’s directory:
devtools::install_dev_deps()
Adding a resource
Checklist
-
Add the new resource to
R/resources_backlog.R
following the standardized template -
Create a file for all necessary
code to download your resource (
R/get_resource_name.R
) - Include roxygen documentation for your resource following the provided template
- Check user-specified arguments (if any) for correctness
- Retrieve portfolio-wide parameters of interest for your resource from the portfolio
- Match the spatio-temporal extent of the portfolio with your resource
-
Provide your own download
functionality or use
.download_or_skip()
- Delete any intermediate files that are no longer required
- Return the absolute file paths for all matching resource files
-
Write a testthat script testing
all the newly added functionality (except the actual download) and write
it to
test/testthat/test-get_resource_name.R
-
Add a small example data set of
your resource to
inst/res/resource_name/
- Added a new dependency? Make sure to include a supporting statement for that dependency in your PR!
Introducing a new resource to the backlog
A resource is a supported dataset that can be downloaded from a
user’s perspective
via the get_resources()
function. Currently, the package
supports only raster and vector resources. If you wish to submit the
support of a new resource, please be aware that we will only accept new
resources if they are associated with at least one indicator
calculation. The very first step to adding a resource is to add it into
the internal resource backlog function so that the package is aware of
its existence. Once checkout to the new branch and having the project
opened in R Studio, issue the following command to open the resource
backlog file:
file.edit("R/resources_backlog.R")
This file keeps track of all supported resources in a list object.
You will see that each resource shares a common structure and how it is
specified. The name of the list object will be the name the package uses
to identify a specific resource. Most importantly, the type argument
specifies whether a resource is of type ‘raster’ or ‘vector’. If
applicable, the source argument shall contain an URL pointing to a
webpage documenting the resource. The downloader argument is the package
internal function name that is used to download the resource. This
function is the most important code file for a new resource. Then,
arguments and their default value to govern the download process can be
specified. If no additional arguments are needed just enter an empty
list. For the soilgrids
resource, the internal backlog
looks like this (don’t spend too much effort in understanding the
arguments just yet. These will become clearer when we “write” the
downloader. When contributing a new resource, it is usually an iterative
process between the backlog and the downloader to
y those arguments that need to be specified by users):
soilgrids <- list(
type = "raster",
source = "https://www.isric.org/explore/soilgrids",
downloader = ".get_soilgrids",
arguments = list(
layers = "clay",
depths = "0-5cm",
stats = "mean"
)
)
With the resource being backlogged, the package now can find a
resource called soilgrids
of type raster
and
it can also identify the downloader function In this specific case, the
package can also determine the default values of three arguments in case
users did not specify anything. This is important information that will
determine how the get_resources()
function works when
called by users.
Documenting the new resource
By convention, the filename of a downloader MUST
start with get_<resource_name>.R
appended by the name
of the resource. In the case of the soilgrids resource that translates
to get_soilgrids.R
. In the first part of such a downloader,
make sure to include detailed documentation. This documentation should
explain what this resource represents, where it comes from (including a
citation), and the arguments users should specify to control what is
downloaded. Importantly, this documentation MUST
receive the roxygen tag @docType data
as well as the
@keywords resource
tag, so that the documentation can be
identified as a resource. The NULL
value below the
documentation MUST be included. Below is a template
that should be used for documenting a resource.
#' Short title
#'
#' One or more description paragraphs might follow here. Please describe
#' required user arguments here, ideally as itemized lists.
#'
#' @name <the short name of your resource, same as in the backlog>
#' @docType data <we document resources as a dataset>
#' @keywords resource <identifies the documentation as a resource>
#' @format <one sentence on data format and spatial extent>
#' @references <ideally a citable scientific publication>
#' @source <a link in the \url{} tag linking to an online documentation>
NULL
Function inputs for resources
After documenting the resource, you can get started with implementing
the actual downloader. The downloader is a package’s internal function
that users do not directly interact with. By convention, we append
package internal function names with a dot. Similar to the filename
itself, resource downloaders should start with
.get_<resource_name>
. The first argument is always
x
, which corresponds to the portfolio object. Important
attributes (e.g. the spatial-temporal extent) can be derived from this
object. Then additional user-facing arguments might follow. After these
arguments, each resource downloader receives the argument
rundir
which by default should point to the output of
tempdir()
, but will be pointing to an output directory on
disk where the output shall be written to when used by users.
Additionally, a logical called verbose
, by default set to
TRUE, controls the verbosity of the downloader as well as the dots
argument. For the soilgrids
resource, the function header
thus looks like this:
.get_soilgrids <- function(x, layers, depths, stats,
rundir = tempdir(),
verbose = TRUE,
...) {
# downloader coder goes here
}
Check arguments and retrieval of portfolio-wide parameters
Before actually conducting any downloads, it is important that you as
the provider of the new resource check extensively that all required
arguments were correctly specified. That specifically applies to the
user-defined variables that your downloader requires. The package
framework cannot check for the correctness of these arguments. That is
something that each downloader has to take care of. If some arguments
are wrongly specified, the function should fail (via
stop()
) and gracefully inform users which arguments where
misspecified and which values represent valid values. You can head over
to the soilgrids downloader (use
file.edit("R/get_soilgrids.R")
) and analyse the first few
lines of the file (up to line ~130) to see how the inputs are checked
for the soilgrids resource.
Some portfolio-wide parameters that might be important to your
specific downloader can be determined by analysing the x
portfolio attributes. Currently, the following attributes with regard to
a resource download are set, when users initialize their portfolio:
attr(x, "nitems") <- nrow(x)
attr(x, "bbox") <- st_bbox(x)
attr(x, "years") <- years
attr(x, "aria_bin") <- aria_bin
Your resource downloader should take care that with these user-specified arguments and the portfolio-wide parameters the files matching the spatio-temporal extent of the portfolio are downloaded. These can be queried with the following syntax with the temporal extent of the portfolio as an example:
attributes(x)$years
Using helper functions
If you construct several URLs and associated local filenames that you wish to iterate over, the package provides a helper function to download these files and skip already existing ones. The following code snippet shows how to use that function:
aria_bin <- attributes(x)$aria_bin
verbose <- attributes(x)$verbose
.download_or_skip(urls = source_urls, filenames = target_filenames,
verbose = verbose, stubbornnes = 6, check_existencs = TRUE,
aria_bin = aria_bin)
This function will attempt to download the specified URLs to the
corresponding local filenames. URLs for which a corresponding filename
already exists will be skipped and information about this will be issued
as a message if verbose is set to TRUE. The stubbornness controls the
number of retries of failed downloads. If check_existence is set to
TRUE, RCurl::url.exists(url)
will be used to check if a
given URL exists online. This check takes some time to run, but for some
resources, it can be useful if you cannot be sure if the constructed URL
exists on the remote location. Users are enabled to specify an
executable aria2c installation to support parallel downloads. The value
of the aria_bin variable will be NULL
if no valid
executable has been specified. Otherwise, the downloader will use the
aria2c program to download the specified URLs.
You are free and also encouraged to develop helper functions for your
resource to increase the understandability and ease maintenance of your
downloader. Any helper functions associated with your resource should be
located in the same file. If you feel a helper function might serve a
purpose across different resources, feel free to raise this in a comment
and we can consider moving it to R/utils.R
. Since helper
functions are internal to the package, they MUST start
with a dot. Package internal functions do not require a roxygen
documentation. If you wish to include documentation to make your code
easier to understand, make sure to add the
@keywords internal
and @noRd
tags to your
functions.
Defining the output and handling intermediate files
You can create and delete intermediate files and directories within
rundir
. The expected output of a resource downloader is a
character vector including the local file paths to all target files
matching the spatio-temporal extent of the portfolio x
. For
raster resources, a tile index indicating the location on earth for each
raster file and its file path will be constructed and added as a
resource to the portfolio. Vector resources are expected to be
translated to GeoPackages (.gpkg
) and will be appended to
the portfolio as is.
Adding sample resource for package internal testing
We ask you to provide a small subset of your resource to
inst/res/resource_name
so that indicators that depend on
the resource can be tested without the need to actual download the
resource. Because there are some restrictions to the final size of the
package, we ask you to put substantial effort in reducing the size of
the files to a minimum. This includes cropping all resource samples to
the spatial extent of the polygon provided in
inst/extdata/sierra_de_neibe_478140.gpkg
or a polygon of
similar size supplied by you in case it the spatial extent does not
intersect with your resource. For raster resources, if the original
raster is encoded as float, consider changing the data type to integer
by introducing a scale factor. Also, please use a compression algorithm
to further reduce the file size. For vector resources, consider reducing
the number of vertices in case the geometries are very complex.
A note on dependencies for resources
Note, that a resource SHALL NOT add additional dependencies to the package. If you add dependencies we require you to add a supporting statement to your PR explaining why these dependencies are needed and why other approaches would fail. Before accepting your PR, we might request you to change your code to remove these dependencies, if it is feasible to achieve the same functionality without.
Adding an indicator
The process of adding an indicator is very similar to the one for resources. However, some input-output requirements are actually different. Note, that in case that you added a new resource we also expect a new indicator taking advantage of that resource in your PR. As you will see, there are two new important concepts to have in mind when adding an indicator. These are the processing mode and computational engines. We will briefly explain these concepts below, however, you can also head over to the Terminology vignette if you are interested in a more comprehensive definition of these two terms.
Checklist
-
Add the new indicator to
R/indicators_backlog.R
following the standardized template -
Create a file for all necessary
code compute your indicator (
R/calc_indicator_name.R
) - Include roxygen documentation for your indicator following the provided template
- Check user-specified arguments (if any) for correctness
- Retrieve portfolio-wide parameters of interest for your indicator from the asset/portfolio
- Implement different computation engines for your indicator
- If applicable, implement both, asset and portfolio based processing modes
- Return a tibble in long format (no variables “hidden” in column names)
-
Write a testthat script testing
all the newly added functionality write it to
test/testthat/test-calc_indicator_name.R
, use snapshots to check for the correctness of numeric outputs - Added a new dependency? Make sure to include a supporting statement for that dependency in your PR, check if the dependency is installed, and add it to the Suggests field in the DESCRIPTION file
Introducing a new indicator to the backlog
An indicator is a logical routine depending on one or more resources
that extracts numeric outputs for all assets in a portfolio. From a
user’s perspective,
indicators are processed via the calc_indicators()
function. We realized, that for large (potentially global) portfolios,
depending on the spatial resolution of a resource, different processing
modes substantially decrease the time needed for a computation. For high
to medium resolution raster resources, processing on the asset level
benefits computation time. However, spatially cropping coarse resolution
datasets for a high number of assets introduces significant overhead,
thus processing these resources on a portfolio level makes more
sense.
You are asked to provide the most sensible approach to your indicator when you submit it to the internal backlog. When adding an indicator there, the package is aware made aware of its existence. You can issue the following command to open the resource backlog file (or open it manually):
file.edit("R/indicator_backlog.R")
This file keeps track of all supported indicators in a list object.
You will see that each indicator shares a common structure. The name of
the list object will be the name the package uses to identify a specific
indicator. Note, that the name of an indicator MUST NOT
be equal to any other indicator or resource. The processor
argument specifies the name of the function that you will provide for
the indicator calculation. The inputs
argument refers to
the supported resources that are required inputs for your indicator. As
a requirement, at least one resource needs to be specified, but your
indicator can also depend on more resources. We also ask you to provide
the type of the resource (e.g. raster
or
vector
). In the arguments
list object, you can
specify any additional arguments that users need to specify if they want
to call your indicator function. If there are no additional arguments,
simply add an empty list. Please put sensible default values in case
your indicator function requires some arguments. Finally, the
processing_mode
governs what your indicator functions will
receive as inputs. In case you set it to asset
, each call
to your indicator function will receive a single asset and the required
resources crop to the spatial extent of that asset. If you set it to
portfolio
the function will receive the whole portfolio
object and all required resource at the spatial extent of the portfolio.
Below, you will find an example for the precipitation
indicator:
precipitation <- list(
processor = ".calc_precipitation",
inputs = list(chirps = "raster"),
arguments = list(
scales_spi = NULL,
engine = "extract"
),
processing_mode = "portfolio"
)
If neither of the two processing modes lead to satisfactory processing times for your indicator, please leave an issue/comment to discuss the addition of another processing mode with the maintainers of the package.
Documenting the new indicator
By convention, the filename of a indicator processor
MUST start with calc_indicator_name.R
appended by the name of the resource. In the case of the precipitation
indicator that translates to calc_precipitation.R
. In the
first part of such a indicator processor, make sure to include detailed
documentation. This documentation should explain how it derives its
numeric output, which resources are required for its calculation, and
the arguments users should specify to control its functioning.
Importantly, this documentation MUST receive the
roxygen tag @docType data
as well as the
@keywords indicator
tag, so that the documentation can be
identified as an indicator The NULL
value below the
documentation MUST be included. Below is a template
that should be used for documenting an indicator
#' Short title
#'
#' One or more description paragraphs might follow here. Please describe
#' required resource and user arguments here, ideally as itemized lists.
#' Please document which processing engines are available for your indicator
#' and briefly describe how the indicator is derived from its inputs.
#'
#' @name <the short name of your indicator, same as in the backlog>
#' @docType data <we document indicators as a dataset>
#' @keywords indicator <identifies the documentation as an indicator>
#' @format <one sentence on number of columns, columns names of ouput tibble>
NULL
Function inputs for indicators
After documenting the indicator, you can get started with
implementing the actual processor. An indicator processor is a package’s
internal function that users do not directly interact with. By
convention, we append package internal function names with a dot.
Similar to the filename itself, indicator processors should start with
.calc_indicator_name>
. The first argument and indicator
processor receives is always shp
, which corresponds to a
single asset of the portfolio if you specified
processing_mode
as "asset"
or the entire
portfolio object if you specified "portfolio"
. Important
attributes (e.g. the spatial-temporal extent) can be derived from this
object, irrespective if it represents a single asset or the whole
portfolio. Then, one or more of the required resource are to be
specified with the exact names as they are included in the resource
backlog. Again, in case the processing_mode
is set to
"asset"
, these resource will be spatial cropped to the
extent of the single asset, in case it is set to
"portfolio"
the complete spatial extent of the portfolio is
included. After the resource, any user-defined additional arguments will
follow. You can assume that if users did not specify an argument, the
default values from the indicator backlog will be inserted instead.
Note, that we define engine
as an argument which is set by
the users in order to give them more fine-control how the output is
computed. We will have a close look at engines in below.
All indicator functions also receive the argument
rundir
, where intermediate files can be written to. You do
not have to take care of cleaning that directory, since the framework
will clean up after the processing is done. Also, a logical controlling
the verbosity is handed over that you should use to decide whether or
not to print additional informative messages. For raster resources, we
included a logical todisk
governing if intermediate raster
files shall be kept in memory or written to disk. Currently, the
decision has to be handled by your indicator processor, however, we are
evaluating possibilities to determine the behaviour within the package
itself. Last, the processing mode is specified. This can be helpful if
you wish to supply your processor with the possibility to support either
of the two modes. For the precipitation
indicator, the
function header will look like this:
.calc_precipitation <- function(shp,
chirps,
scales_spi = NULL,
engine = "extract",
rundir = tempdir(),
verbose = TRUE,
todisk = FALSE,
processing_mode = "portfolio",
...) {
# processor logic goes here
}
Check arguments and retrieval of portfolio-wide parameters
Before actually conducting any computation, it is important that you
as the provider of the new indicator check extensively that all required
arguments were correctly specified. That specifically applies to the
user-defined variables that your processor requires. The package
framework cannot check for the correctness of these arguments. If some
arguments are wrongly specified, the function should fail (via
stop()
) and gracefully inform users which arguments where
misspecified and which values represent valid values. You can head over
to the precipitation processor (use
file.edit("R/calc_precipitation.R")
) and analyse the first
few lines of the file (up to line ~76) to see how the inputs are
checked. Also note, that in case the required resource are
NULL
(that is the default value if an asset does not
intersect with a resource), or if any other configurations (e.g. years
only smaller 1981 for the case of the precipitation indicator) prevent a
sensible processing of an indicator, we simply return NA
.
The package will then fill in the tibble values for that asset with
NA
, so users now that the given indicator could not be
calculated there.
Some portfolio-wide parameters that might be important to your
specific indicator routine, and they can be derived via the
attributes()
function, e.g. for the years attribute:
years = attributes(shp)$years
Using helper functions
There are some package internal helper function that we found to be
of use for multiple indicators that you are free to use in your
indicator processor. You will find them in R/utils.R
. These
helpers currently are:
- .check_available_years(): Checks if a given target year vector is available for a given indicator
- .check_engine(): Checks if a user-specified engine is available
- .check_stats(): Stats if a user-specified zonal statistic is available
You are encouraged to write your own helper function that are needed
for your indicator processor. These should be located in the same file
as the main processor, start with a dot and should not be exported. If
you wish to include roxygen documentation for your helpers, make sure to
add the @keywords internal
and @noRd
tags to
your functions. If you feel that one or more of your helper functions
would be of benefit to more that just one indicator, please comment in
and issue/pull-request to discuss with the package maintainers if your
helper function could be moved to R/utils.R
.
Adding engines to your indicator processor
In writing this package we realized that depending on the structure of a portfolio (i.e. the number of assets, their size and geometric complexity), different engines might lead to better processing times. We thus included three different engines for most of our indicators, and we would invite you to do the same for your contribution. Engines are mostly used in the very last step of an indicator calculation, that is when some kind of zonal statistics are calculated for a specific asset. The currently used engines are:
- terra::extract(): Takes a SpatRaster and a SpatVector as input and computes a zonal statistic for all pixels within the SpatVector
- terra::zonal(): Takes two SpatRasters as input, one the with the target variable(s) the other representing the rasterized input polygon. Then a zonal statistic for the pixels that correspond to the asset extent is calculated
- exactextractr::exact_extract(): Takes a SpatRaster and an sf-object as input and calculates a zonal statistic. It is implemented in C, thus promising fast processing even for very large extents.
If you wish to include another processing engine for your indicator,
please indicate this in a comment so that it can be discussed with the
packages maintainers. Note, that indicators ideally should not add new
dependencies if possible. If they do, please add a supporting statement
why this dependency is necessary for your indicator. We also ask you to
add dependencies to the Suggests
field of the
DESCRIPTION
file and that you check whether or not this
dependency is installed at the beginning of your indicator routine.
Defining the output of indicator functions
Indicator functions, should return a tibble in long format as their
output without “hiding” any variables in column names. Beside from that
requirement, the output of your indicator does not need to follow any
specific shape, except that columns shall be equal across all assets. In
case that you cannot calculate the indicator for a specific indicator
(e.g. because the extents do not overlap), simply return
NA
. The package will handle these values internally and
fill in NA
a single row for that asset with the same column
names as any other assets, its values set to NA
.