Skip to content

spec_invest

Dave Fisher edited this page Nov 29, 2021 · 28 revisions

InVEST Model Specification and Implementation Recommendations

Owner: James Douglass jdouglass@stanford.edu

While InVEST is a suite of models that each have their own requirements, it is helpful for us to have some shared development standards. What is presented here is the current state of this evolution and will continue to be updated as our own understanding changes. This guidance is not intended to be prescriptive or formulaic, nor can it possibly handle all of the cases that will come up when developing software. But it does offer a few defaults and things to consider based on the use cases we have regularly encountered in our software development. And if anything doesn't make sense or should be updated, let's talk about it!

Table of Contents

Functional Requirements of an InVEST Model

An InVEST model should:

  • Solve an interesting scientific problem
  • Run in reasonable time w/r/t size of data (algorithmic as well as wall-clock)
  • Use a reasonable amount of disk space and memory
  • Be documented in the InVEST User's Guide, with documentation reviewed by the science lead for the model
  • Have sample data and a datastack parameter set in the invest-data repository and accessible by the Windows Installer
  • Have documented functions (see PEP 257). These can be programmatically verified with pydocstyle (pip install pydocstyle) and docstring RST syntax can be verified with the RST docstring linter (pip install flake8-rst-docstrings).
  • Use PEP 8 for code style where reasonable. This can be programmatically validated with pycodestyle (pip install pycodestyle).
  • Use taskgraph where there would be statistically significant gains in runtime performance for the use case of the model (it's better to avoid the complexity unless it helps.)

InVEST Model Implementation Notes

Supported versions of Python

InVEST development should, at a minimum, be able to import and execute on whichever two versions of python 3.x are currently supported by the Python core developers. Support for additional versions can be considered as needed. Python 2.7 is no longer supported by InVEST as of InVEST 3.8.0.

Where to save a model's source code

Most models can be contained within a single python file, at src/natcap/invest/<model>.py. Models with compiled components (*.pyx, *.c, *.cpp), resource files (*.js, *.png), or multiple tools (such as preprocessors) might have their own subpackage of natcap.invest. In this case, the main model entrypoint would be at src/natcap/invest/<model>/<model>.py, and model resources would be saved into the same directory.

File authoring

It's helpful if we can all agree on a few things about how InVEST models should be written.

  • Use UTF-8 file encodings where possible.
    • Use # coding: UTF-8 at the top of your python files.
    • The full spec for how python interprets this is in PEP 263
    • Check your editor documentation for setting this as well, in case the above isn't recognized automatically.
  • Expand tabs to 4 spaces.
  • Line endings can be either Linux or DOS based but once they are defined in the first PR of a file, it must be consistent for future changes of that file. (If James starts a file with Linux line endings, Rich would keep that consistent when he edits the file on Windows later).

The execute function

Every InVEST model has a function called execute that takes a single parameter, a dict called args, containing arguments to the model. The execute function has a few consistent behaviors:

  • When execute is called, it begins executing the model with the user's inputs and blocks until the model completes successfully or raises an exception.
  • The execute function does not return a value; it implicitly returns None.
  • It is expected that execute will attempt to reasonably validate inputs early in its execution and raise an exception if errors are found.
  • execute should not have any known side effects other than writing temporary, intermediate and output files to the defined workspace. The model should not modify the incoming args dict in any way.

The args dict

The args dict passed to execute should have the following structure:

  • Keys should be python strings of ASCII lowercase alphanumeric words separated by underscores.

  • Keys should be named sensibly according to the value, action, or option that they represent. Keys should not reflect how they happen to be visually represented in a particular user interface. Examples to emulate include:

    • landcover_raster_path
    • farms_vector_path
    • landcover_biophysical_table_path
    • do_valuation
  • Several keys are standardized across all InVEST models:

    • workspace_dir
      • Required parameter.
      • Represents a directory on the local filesystem where temporary, intermediate, and output files and folders created by the model will be saved. If this folder does not exist, it (and any needed parent folders) will be created as part of the model run. The user must have write access to this path.
    • results_suffix
      • If this parameter is included in args, the string provided will be appended to the end of all files (not directories) created by the model run within the workspace.
    • n_workers
      • If this parameter is included in args, the value provided must cast to an integer. Represents the number of computational workers the model's graph of tasks may use. If the model does not use taskgraph to execute its tasks, this args parameter should be ignored.
  • Values should be serializable (str, int, float, None, ...). Nested python data structures (dict, list, tuple) are ok where it makes sense to use them.

  • If a value is a string, it should be encoded as UTF-8.

The ARGS_SPEC

All InVEST models will own a data structure with information about the model’s inputs. This will be a dictionary (ARGS_SPEC) with the structure detailed below. The ARGS_SPEC will be used in a validation function, reducing the amount of work needed to properly and effectively validate model inputs. As of 3.10.0, ARGS_SPEC data is also the source of text for the Workbench and for the User's Guide "Data Needs" sections.

  • "model_name": “Habitat Risk Assessment” (The human-readable name of the model)
  • "module": “natcap.invest.hra” (The python-importable module name, in practice, use __name__)
  • "userguide_html": “habitat_risk_assessment.html” (The html page of the UG for this model, relative to the sphinx HTML build root)
  • "args_with_spatial_overlap":
    • "spatial_keys": [list of string keys]
    • "different_projections_ok": True or False
  • "args": (A dict describing all possible args accepted by the model's execute)
    • <args_key>: (the args key, e.g. ‘workspace_dir’)
    • "name": The human-readable name of this input. The workbench UI displays the "name" property as a label for each input. As such, we want to keep them consistent:
      • The name should be as short as possible. Extra description should go in "about" which becomes the tooltip text.
      • It should be all lower-case, except for things that are always capitalized (acronyms, proper names). Any capitalization rules such as "always capitalize the first letter" will be applied on the workbench side.
    • "type": <string> (one of the following) or <set> (e.g. {"raster", "vector"}):
      • “directory” - a directory that may or may not exist on disk
      • “file” - a file that may or may not exist on disk
      • “raster” - a raster that can be opened with GDAL
      • “vector” - a vector that can be opened with GDAL
      • “csv” - a CSV on disk (comma-or-semicolon delimited, possibly with a UTF-8 BOM)
      • “number” - a scalar value.
      • "integer" - commonly used for biophyiscal table LULC codes
      • "ratio" - a decimal number between 0 and 1
      • "percent" - a number between 0 and 100
      • “freestyle_string” - a string that the user may customize with any valid character
      • “option_string” - a string where the value must belong to a set of options.
      • “boolean” - either true or false (or something that can be cast to True or False)
    • "units": <pint.Unit>
      • generally used for any numeric type input
      • import the Unit Registry from spec_utils.py
    • "expression": <string> (optional)
      • an expression that can be evaluated by python to validate the arg's value
      • see natcap.invest.validation._evaluate_expression
      • e.g. "value > 0"
      • e.g. "value in {1, 2, 3, 4, 5}"
    • "bands": <dict> (only for raster type - example below)
      • the value is a nested arg spec, keyed by the band number used by the model.
    • "columns": <dict> (only for csv type - example below)
      • the values are nested arg specs, keyed by the required column names in the csv.
    • "fields": <dict> (only for vector type - example below)
      • the values are nested arg specs, keyed by the required field names in the vector.
    • "projected": <bool> (only for vector/raster type)
      • If True the dataset must have a projected (as opposed to geographic) coordinate system
      • Can be ommited instead of set to False
    • "projection_units": <pint.Unit> (only for vector/raster)
      • Used if the model has strict requirement for linear units of the coordinate system
    • "geometries": <set> (only for vector type)
      • the set of acceptable geometry types for input.
      • types are defined in spec_utils.py and should be imported from there.
    • "options": <list | dict> (only for option_string type)
      • list of strings representing all possible acceptable values
      • or a dict, where keys are the acceptable input values, and dict values are meaningful descriptions.
    • "required": True | False | <boolean expression of args_keys>
      • If this attribute is omitted, it defaults to True
      • If True, the input is required.
      • If False, the input is optional.
      • If an expression, the input is conditionally required based on evaluating the expression. If any args keys are provided within the expression, they will evaluate to either True or False depending on whether the key-value pair is present in args, there is a value associated with that key, and that value is truthy.
    • "about": String text about this input.

Some args can contain further nested args within. Here is an example with a csv arg. Each column is treated like an "arg key" and it's value follows the same pattern defined above.

"carbon_pools_path": {
    "type": "csv",
    "columns": {
        "lucode": {"type": "integer"},
        "c_above": {"type": "number", "units": u.metric_ton/u.hectare},
        "c_below": {"type": "number", "units": u.metric_ton/u.hectare},
        "c_soil": {"type": "number", "units": u.metric_ton/u.hectare},
        "c_dead": {"type": "number", "units": u.metric_ton/u.hectare}
    },
    "about": (
        "A table that maps the each LULC class from the LULC map(s)to "
        "the amount of carbon in their carbon pools."),
    "name": "Carbon Pools"
},

A vector arg can follow the same pattern, but it will have "fields" instead of "columns".

A raster arg will have a "bands" attribute that is treated as a nested arg:

"pawc_path": {
    "type": "raster",
    "bands": {1: {"type": "ratio"}},
...

A spec for args that are common across many invest models (e.g. DEM, LULC, AOI, etc) are defined in spec_utils.py and should be imported and used from there.

What happens within execute()

Much of what happens within the model is very specific to the model at hand. However, there are a few problems that are common across most (if not all) models.

Spatial alignment

We've found it to be very useful to have a step in the model where inputs are all 'aligned'. By alignment, we mean that the set of spatial inputs to be processed are mutated to a state where:

  1. The bounding boxes of all spatial inputs intersect in a way that makes sense for the model. Often, but not always, this is the intersection of all of the inputs' bounding boxes.
  2. The resolution and extents of the rasters to be processed all match perfectly. Note that this will need to include some interpolation scheme appropriate for the inputs provided which will be context relevant to the model (a DEM might be linearly interpolated while a LULC might use mode).
  3. Inputs that must be in the same projection are warped if needed. Most models use one of the inputs as the reference projection and all other inputs are warped to this. In hydrological models, for example, the DEM is used as the source of truth. The source of truth should make sense for the problem at hand.

For rasters, this step is primarily handled by a combination of pygeoprocessing.align_and_resize_raster_stack and pygeoprocessing.warp_raster. Vector alignment is a bit different and varies by model according to the model's use case. As always, the spatial alignment performed should be appropriate and necessary for the model.

Filepath management

InVEST models write a variety of files as their primary outputs, and so, regardless of the contents of the file, the model will need to decide where a file should be saved. We don't have a single, great way to handle filepaths, so what you use will depend on your use case. The two approaches you'll see most commonly are:

  1. A file registry object
  2. The static definition of filepaths according to string patterns, where the pattern is replaced by some string derived from user input.

Note that approach #1 is commonly used when the model always produces the same files (see SDR) and approach #2 is commonly used when the files produced depend greatly on user input (see Pollination). Within approach #2, some models define patterns as module-level variables, and others define the paths within execute. Use your best judgment to determine what makes sense for the model.

File suffixes

File suffixes allow for the interleaving of files from various runs all within a single workspace. If a suffix is provided within args via the standard key, it should be used in determining filepaths.

A suffix is constructed with the following rules:

  • If the results_suffix args key is present but the string has no characters in it, ignore the suffix.
  • If the suffix starts with an underscore, ignore the leading underscore.
  • Otherwise, prepend an underscore to the file suffix.

An implementation of the file suffix construction rules is available in the natcap.invest.utils.make_suffix_string(args, suffix_key) function.

Directories within the workspace

Within the workspace, it's common for models to have folders such as:

  • workspace/output
  • workspace/intermediate

For models that use taskgraph, a cache directory is stored within the workspace as well[1]. Temporary files and temporary folders are created within the workspace, sometimes within the intermediate directory, sometimes in their own directory. The location for each of these should make sense in the context of the problem being solved.

Temporary files

When processing large datasets in a memory-efficient way, it is sometimes necessary to temporarily write files to disk to avoid keeping more than is needed in memory while computing the target output. Temporary files should be written to a location within the workspace (args['workspace_dir']), and if the files are not intended for public consumption, these files should be removed before the model run completes. If a temporary file is intended for public consumption, it might be better suited as a non-temporary file.

For directory creation, consider using the natcap.invest.utils function make_directories(). For creating temporary directories, consider using tempfile.mkdtemp() with some of the optional arguments to clarify the purpose and parent folder of the new directory.

Consider removing temporary files when they are no longer needed.

Geospatial file formats

InVEST models should support reading any raster and vector format supported by whichever version of GDAL InVEST is built against.

For output spatial files, however, InVEST should write rasters as GeoTiffs and vectors as geopackages ( or ESRI Shapefiles) unless some other format makes sense for the domain of this model.

Here are a few other notes about working with geospatial files created within InVEST:

  • If geopackages are used for output formats, the geopackage's tables should also have the file suffix appended. This makes it easier to work with these layers in GIS software.
  • local_op functions passed to pygeoprocessing.raster_calculator should be careful to handle (and test) the case where a nodata value might not be defined. This is a valid configuration of a raster and will cause errors when not handled correctly. See #228 for more information.
Tabular file format: CSV

InVEST models should read and write CSV files when a table is needed. If it's convenient to load the data into a python dictionary, utils.build_lookup_from_csv is helpful and can handle a variety of edge cases. If it's convenient to load into a pandas dataframe, use utils.read_csv_to_dataframe

If needed, the python stdlib csv module can be useful, especially for handling nonstandard table layouts. Be forewarned, however, that the csv module's support for unicode strings is severely lacking. Prefer pandas if possible.

Variable Names

It's helpful to have clear, descriptive variable names that help a reader to understand what a variable is and what it represents. Like with args keys, some specific recommendations would be to follow the sort of convention used by pygeoprocessing. Some examples of this include:

  • Variables representing filepaths end with _path
    • If you see something with uri in it, please rename or delete it. We used to call filepaths uris. This is incorrect; InVEST uses local filepaths, not URIs.
  • Input file path variable names start with base_
  • Output files to a function start with target_
  • If a variable indicates a raster, vector or table, include raster, vector, or table in the variable name
    • Historically, we referred to rasters as datasets and vectors as datasources (and occasionally shapefiles), after the internal GDAL nomenclature. Instead, please use raster and vector here unless the specific format used calls for more specific nomenclature.
  • If a variable represents a list, append _list to the variable name

Some tools like pylint will suggest that variable names be capped to a certain length. It is OK to have a longer variable name if it helps to clarify what it is and how it's used. Short variable names can also be OK if it helps with clarity within its context. We're trying to make maintainable software, not adhere to pylint's arbitrary rules.

Taskgraph

InVEST models can be thought of as managed geoprocessing workflows that can be broken up into a variety of functions that must be executed in a certain order. By defining these functions as tasks within a directed, acyclic graph with our library 'taskgraph', we're allowing a model to be able to:

  1. Re-use results from a previous execution of the task if the parameters have not changed
  2. Execute tasks in parallel.

Specific suggestions about taskgraph are:

  • Use the 'n_workers' args key (cast as an int) as the n_workers parameter to taskgraph. If the user doesn't define a valid value or doesn't define the parameter at all, assume an n_workers value of -1, or the current value that indicates synchronous execution and task management.

  • Taskgraph requires a directory parameter for where it should store information about the tasks it has already computed ("work tokens"). This should be a directory within args['workspace_dir'].

Loops are often (but not always) slow

In Python, loops incur nontrivial overhead relative to the equivalent iteration in C. While this overhead is inconsequential for limited iteration, it becomes noticeable when iterating over, say, all the pixels within a raster. Here are a couple suggestions for improving the speed of iteration:

Use comprehensions for mutating a sequence

Python's comprehension notation is often a compact way to represent an operation that produces either a list or a generator. Comprehensions are typically about 30% faster than for-loops.

Note that comprehensions themselves, while often used within a list, can be used to create generators instead. Using a generator in this context would eliminate the need to create a copy of the original data structure in memory.

Use numpy

If your data is already stored in a numpy array, try to use numpy's library of operations to index into and manipulate arrays. This is often several orders of magnitude faster than a simple loop in python.

As a sub-topic of numpy it's also worth noting that you can save array indexes for later use to avoid recomputing an index. Most InVEST models use numpy's boolean array indexing, saving the mask of valid pixels to an array called valid_mask and then using it later with a local operation to minimize the number of pixel-stack operations.

When creating numpy arrays, use the array creation routines that make sense for your use case. numpy.zeros, numpy.ones, numpy.full and numpy.empty are often good choices. For the sake of speed and efficiency, be sure to set an appropriate dtype parameter to ensure your array only uses the memory required. Numpy array dtypes often default to a larger amount of memory than you expect.

Write a cython extension

While numpy is incredibly useful and has most of the functions an InVEST model will need for most cases, some operations can really benefit from a lower-level implementation. This should be a last resort, as writing and maintaining a cython extension takes significantly longer to develop and is significantly harder to debug. It is also, however, sometimes the best way to handle lots of custom looping, or (more commonly) randomly walking across pixel values in a raster without exhausting available memory. Talk to Rich of James if you think a cython extension is needed before taking this on.

Iterating over raster values

GDAL-compatible rasters are especially interesting because their data is laid out on disk in sequential blocks. Since GDAL reads and writes whole blocks at a time, the most efficient way to iterate over the contents of a raster is to read and write whole blocks (or groups of contiguous blocks) at a time. We have two helper tools for this purpose:

  • pygeoprocessing.raster_calculator, for when:
    • Operations are local only to a stack of aligned pixels
    • The operation being performed on a pixel stack doesn't care about where it is within a raster
    • An output raster needs to be created
  • pygeoprocessing.iterblocks, for when:
    • You need to iterate over one or more aligned rasters and read pixel values to compute something (example 'what is the set of unique values in this LULC raster?')
    • You need to know where a pixel is within a block or the raster

In both of these cases, the functions merely handle the reading and writing to and from the rasters ... numpy operations are usually the best way to interact with the arrays returned.

Handling warnings in numpy operations

When operating on large numpy arrays (which is especially common with local_ops passed to pygeoprocessing.raster_calculator), numpy will throw warnings when it cannot perform a mathematical operation on a given pixel. Although numpy won't crash in these cases, they should be treated as errors and fixed.

If you are experiencing numpy warnings, it can be useful to cause numpy to raise them as exceptions in order to halt model execution. This can be set by calling numpy.seterr:

numpy.seterr(all='raise')

The validate function

In the same python module as the execute(args) function, there should also be a function with the signature validate(args, limit_to=None) and decorated by @validation.invest_validator. This function is called by the user interface layer (and often by execute as well) to provide fast, informative feedback to the user when they select inputs to the model.

At a minimum, the validate function should call:

validation_warnings = validation.validate(
        args, ARGS_SPEC['args'], ARGS_SPEC['args_with_spatial_overlap'])

and

return validation_warnings

validate may perform other validation that is specific to the model and that is outside of the scope of what is provided in the standard ARGS_SPEC-based validation. Please ensure the extra validation completes quickly and reliably and adheres to the validation API.

The complete spec of the validation API is defined in the Validation Design Doc. For the most part, validation functions should be fairly complete across InVEST, though they may not be fully tested.

Reporting progress to the user: logging

InVEST uses Python's stdlib logging library for handling log messages. This library allows us to do fancy things like pass log messages between processes (we use this in taskgraph), or decide which messages to write to the UI's progress dialog, the command line, and the logfile written during a model run. With logging, each of these streams can be handled separately.

As a consequence of this, however, print statements will not be captured and written to logs.

Usage of logging

import logging

LOGGER = logging.getLogger(__name__)

If you see an error message about No handlers could be found for logger "<name of logger>", ask James or Rich. We believe this issue to be taken care of within InVEST, but we could be wrong.

Please do not use logging.basicConfig() within InVEST. This is a function that should only be used for an entry point, and only the UI or CLI is an entry point for InVEST.

Log level recommendations

A lot of metadata is included with every log message, but one of the most visible is the log level. InVEST uses various logging levels to indicate the severity of a message. This then allows us (and users) to decide which sorts of messages to see. For developing InVEST, here are a few suggestions about which level to use:

  • logging.DEBUG: information intended for a developer or model maintainer.
  • logging.INFO: user-facing information like progress logging.
  • logging.WARNING: something doesn't quite make sense or is likely to produce an error.
    • N.B: It's often useful to fix or prevent the cases where warnings would be needed rather than warning a user of something. (Real-world example: if a user passes two rasters with the "same" projection, but their WKT are slightly different, GDAL interprets them as different. In this case, rather than raise an Exception, InVEST will log the error and proceed. If the output looks off, the user can look back and see the projection warning to see if it is relevant).
  • logging.ERROR: something went very wrong but no Exception will be raised. An example of this might be a server processing function that is tolerant of a spotty network connection. If an operation fails and it makes sense to try again, we'd log an ERROR rather than terminate with an Exception.

While the logging system supports custom levels, the standard log levels are probably good enough for our purposes.

Managing dependencies of natcap.invest

When used effectively, the right dependencies can make reading, writing and interpreting a program or application much easier. Dependencies come at a cost, however, which can be observed in more complicated build, distribution, and installation processes. Dependencies also sometimes add some legal liabilities as well.

Unfortunately, adopting new dependencies can be costly:

  • When APIs change, we will eventually need to update how we use the package.
  • There are sometimes conflicts between packages. Sometimes this is merely a namespace issue, sometimes these conflicts can cause serious application crashes. * Real-world example of this: the interplay between the python package Shapely and the GDAL/OGR library, both of which are compiled against the C++ library GEOS, used for geometric operations. The C++ library that each was compiled against was compiled with slightly different flags, leading to a hard crash under certain circumstances. See this github issue for how things shook out.
  • Software libraries are developed under myriad different licenses, which makes distribution tricky and, in some cases (as with the GPL), could add a variety of legal liabilities to The Natural Capital Project.

Adding dependencies can be a Very Good Thing, but it should be done with tech lead approval or team consensus.

Documenting code within natcap.invest

Docstrings

InVEST uses Google-style docstrings within sphinx, which is one of the more readable of the docstring format standards. Important sections of the docstring to consider are:

  • Args, for describing the parameters of the function, including the type, whether it's optional, and any required structure or format details.
  • Returns, for describing any return values of the function.
  • Raises, if the function raises exceptions as part of the expected interface of the function. So, if someone can call this function, have the function raise an exception and have that be part of the normal flow of the program, that exception should be documented. No need to document all possible error states that might arise.

Inline comments

Inline comments should be considered a part of the source code itself, and should provide helpful contextual information. Consider future maintenance of the software when writing these docstrings. What is not 100% clear from reading the source code itself? Why was a constant chosen? How was a given technical decision or approximation chosen?