Zoo API Docs#

zoo#

class pyearthtools.zoo.BaseForecastModel#

Setup BaseForecastModel

A child must at least implement the `.load` function to pass back a `pyearthtools.training.wrapper.Predictor`.

## Setup
  • Setting _default_config_path provides a default config path.

    This should be given, otherwise it must be set by the user each time.

  • Setting _times allows a model to specify which time deltas need to be

    retrieved for predictions. Used for live download.

  • Setting _download_paths specifies files to download.

  • Setting _name provides the name of the model. It is best to set

    this identical to where the model is registered to. If not given, will be the class name. Use ‘/’ to set categories.

### _download_paths

Setting _download_paths in the class will allow those assets to be automatically retrieved and stored. They are then accessible underneath a directory retrievable from self.assets.

If given as a str the last ‘/’ will be used as the name, or if given as a tuple, the first element is the link, and the second the name.

These paths can be to either a file or a zip file on a server or on the local machine.

If the assets should be downloaded each time, set _redownload_each_time to True.

## Config Folder The config folder for a model can use the following conventions to ease in setup

Data/ - Location for all data loaders Pipeline/ - Location for all pipelines

It is assumed that most data configs will have a pipeline name identically to them for loading and preparing the data, however, the following exception applies. If a Data config has a (), with a str inside, it represents a different data source, but the same pipeline, this is can be useful for setting by different sources of the same data, link for downloading, archived data or experiments.

Additionally, any data with a - represents an ancillary source, i.e. forcings and will not be included in the available data sources. Any text prior to the - represents the parent source and any after is it’s purpose.

Getting ancillary_pipeline will give back a dictionary of ancillary pipelines associated with the chosen source.

When creating a model subclass set _default_config_path to the default path.

A user can provide a config_path during __init__ to allow access to user defined configs. This allows experiments to be easily run, and will follow the conventions outlined above. i.e. Providing a data config with () will use the base pipeline.

### Examples Consider the following structure.

>>> ├── Data
>>> │   ├── ERA5-Forcings.yaml
>>> │   ├── ERA5(cds)-Forcings.yaml
>>> │   ├── ERA5(cds).yaml
>>> │   └── ERA5.yaml
>>> └── Pipeline
>>>     └── ERA5.yaml

A user can request either ERA5 or ERA5(cds) as the data source, these two sources are then loaded and use Pipeline/ERA5.yaml as it’s pipeline.

When getting ancillary_pipeline, either ERA5-Forcings or ERA5(cds)-Forcings will be used, dependent on the data source as detailed above. If a Pipeline/ERA5-Forcings.yaml existed, both sources would then use this as their pipeline.

### Configurable Config Path Using pyearthtools.config the paths in which config files for pyearthtools.zoo can be adjusted. This can be done by either setting configs in ~/.config/pyearthtools/models.yaml or setting pyearthtools_MODELS__CONFIGS in the environment.

An environment can define a list of paths split by : at pyearthtools_MODELS__CONFIGS. These will be added to the valid pipelines, with the model class name added to the end.

For most models this should be the full categorical path of the model, see each model for it’s _name. If not set will be the class name.

### Config Assignments Specifying a ‘{}’ after a config selection allows a user to specify replacement keys for the pipeline.

All keys in a pipeline need to be surrounded by ‘__’, so that a key ID corresponds in the config to __ID__, Say the ERA5 pipeline contains a key: ‘__ID__’, to allow a user to select a certain ID at the time of running, the config can be specified as:

` 'ERA5{ID=42}' `

This will replace __ID__ inside the config before it is loaded with ‘42’. The replacement value will be a str.

#### Default Assignments | Key | Value | | — | —– | | pyearthtools_ASSETS | Asset path to this model | | pyearthtools_MODELS_DEFAULT_CONFIG | Default config path for a model | | FILE | Folder containing the loaded config | | OUTPUT_DIR | Output location as specified by the user |

If ‘:’ follows the KEY part and still within ‘__*__’, anything following will be considered the default value.

#### Class assignments Assignments like shown above can also be provided within self._default_assignments which will be used when loading a Pipeline.

` self._default_assignments = {ID = 42} `

## Assets Assets will be saved at the location given in the config at models.assets. This can be cnanged by either setting assets in ~/.config/pyearthtools/models.yaml or setting pyearthtools_MODELS__ASSETS in the environment.

The model name is appended to this path, as specify only the overall pyearthtools asset path.

This asset path is then accessible from self.assets.

## Caching Inputs Setting config_path allows for the inputs of a model to be cached out before inference. This may be especially useful for sanity checking, or preloading downloaded data before switching to a compute node.

data_cleanup defines how to manage this cache, by default, will remove any data over 1 day old, and limit the directory size to 10GB, see pyearthtools.data.indexes.CachingIndex for more information.

The model _name and pipeline name is automatically added to the path to prevent collisions. So:

If config_path is /data/goes/here, and the model is Model/Name, with pipeline PipelineName

The full path is /data/goes/here/Model/Name/PipelineName

The pattern of the cache will then take over.

Must be implemented by a child class to setup a model

A child must at least implement the .load function to pass back a pyearthtools.training.wrapper.Predictor wrapper.

Parameters:
  • pipeline_name (Optional[str]) – Pipeline name to use, must be in valid_pipeline

  • pipeline – Already-loaded pipeline object (alternative to the pipeline name)

  • predictions (output Location to save)

  • config_path (Optional[os.PathLike]) – Override for config path to find Data & Pipelines. Defaults to None.

  • data_cache (Optional[os.PathLike]) – Location to set a data cache for, automatically adds model name & pipeline to path. Defaults to None.

  • data_cleanup (dict[str, Any] | str | None) – Config for cleanup for data_cache. Defaults to None.

  • delete_cache (Optional[bool]) – Delete all data in cache. Defaults to False.

  • download_assets (bool, optional) – Whether to download assets. Will be called anyway upon first call to .index Defaults to False.

  • **kwargs (Any, optional) – All extra kwargs used when getting the DataIndex.

  • output (Optional[os.PathLike])

Raises:

ValueError – If pipeline not in ._valid_pipeline() and a valid loaded pipeline is not supplied

property ancillary_pipeline: dict[str, 'pyearthtools.pipeline.Pipeline']#

Ancillary Pipelines

Get all ancillary pipelines associated with the selected one.

Ancillaries are marked with a ‘-’ with the prior representing the core, and the post the name of the ancillary.

Returns:

Name of ancillary: Loaded Pipeline

Return type:

(dict[str, pyearthtools.pipeline.Pipeline])

property assets: Path#

Get assets directory. Set in config by models.assets, therefore can be configured by the user in ~/.config/pyearthtools/models.yaml, or by setting pyearthtools_MODELS__ASSETS in the environment.

property cache: Path | None#

Get cache directory

data(basetime)#

Get data from pipeline

Used to download for live runs

Parameters:

basetime (str) – Time that a prediction would be run at

Return type:

list[Any]

download_assets()#

Download all assets in _download_paths, and store in .assets

Return type:

None

classmethod get_all_config_paths(config_path)#

Get all config paths associated with this model.

Parameters:

config_path (PathLike | None) – Defined Config path to add.

Returns:

All config paths

Return type:

(tuple[Path, …])

Raises:

ValueError – If no config paths found.

classmethod get_config(key, default=None)#

Get config for key from pyearthtools.config

Parameters:
  • key (str)

  • default (Any)

Return type:

Any

classmethod get_name()#

Get name of this class.

Can be overriden by setting _name, if not given, will be cls.__name__.

Return type:

str

property index: pyearthtools.training.MLDataIndex#

Get pipeline as an MLDataIndex

classmethod is_valid_pipeline(pipeline_name, config_path=None)#

Check if pipeline is a valid pipeline

Parameters:
  • pipeline – Pipeline name to check if valid

  • config_path (PathLike | None) – Path to search for configuration

  • pipeline_name (str)

Returns:

If pipeline is valid.

Return type:

(bool)

abstractmethod load(*args, **kwargs)#

Load pyearthtools.training.wrapper.Predictor, and provide kwargs for pyearthtools.training.MLDataIndex.

Must accept user passed kwargs.

Return type:

tuple[‘pyearthtools.training.wrapper.Predictor’, dict[str, Any]]

load_pipeline(pipeline, data=True, ancillary=None, **kwargs)#

Hook to allow modification of how pipeline is loaded.

Parameters:
  • pipeline (str) – Path to pipeline file to open.

  • data (bool) – If pipeline is the data source or pipeline.

  • ancillary (Optional[str]) – Name of ancillary pipeline if ancillary pipeline.

  • kwargs (Any) – Assignments to pass to pyearthtools.pipeline.load

Return type:

pyearthtools.pipeline.Pipeline

Returns: Loaded pipeline

Usage:

A child model could override this to assign values within __KEY__ keys inside the Pipeline. Or add a step.

classmethod log()#

Model specific logger

Return type:

Logger

property pipeline: pyearthtools.pipeline.Pipeline#

Get pipeline as configured in the init.

run(*args, **kwargs)#

Run model

Using pipeline, and overwritten load function, create a DataIndex for the model, and run a prediction

All args, and kwargs passed through

Raises:

RuntimeError – If a DataNotFoundError occurs

Returns:

Result of running the index

Return type:

(Any)

search(*args, **kwargs)#

Run a safe search on the index, skipping override

Return type:

dict[str, Path]

timer(title)#

Get timer context local to this object.

Parameters:

title (str) – Name of timer

Returns:

Timer context

Return type:

(Timer)

classmethod valid_pipelines(ancillary=False, *, config_path=None)#

Get valid pipeline list at config_path.

See _valid_pipeline for full docs.

Parameters:
  • ancillary (bool)

  • config_path (PathLike | None)

pyearthtools.zoo.register(name, exists='warn')#

Register a custom model for pyearthtools.zoo.

Any registered model is accessible underneath pyearthtools.zoo.Models.*

By setting the key with ‘/’ the categories of the model can be set.

Example

>>> register('Category/MODEL')(MODEL)
>>> # Accessible at `pyearthtools.zoo.Models.Category.MODEL`
Parameters:
  • name (str) – Name under which the model should be registered. A warning is issued if this name conflicts with a preexisting model.

  • exists (Literal['warn', 'ignore', 'error'])

Return type:

Callable[[…], Any]

zoo.exceptions#

class pyearthtools.zoo.exceptions.ModelException#

Base model exception

class pyearthtools.zoo.exceptions.ModelRegistrationException#

Model Registration exception

zoo.model#

class pyearthtools.zoo.model.Timer#

Record and log the execution time of code within this context manager.

Parameters:
  • title (str)

  • logger (logging.Logger | None)

zoo.predict#

pyearthtools.zoo.predict.data(model, time, pipeline, data_cache=None, config_path=None, **kwargs)#

Get data needed for model to run,

Can be used to precache data for ‘live’ runs.

Parameters:
  • model (str) – Model name to load

  • time (str) – Isoformat of time to get data for

  • pipeline (str) – Pipeline config to use

  • data_cache (Path | str | None) – Where to cache data. Defaults to None

  • config_path (Path | str | None) – Override for config path. Defaults to None

  • kwargs (dict[Any, Any]) – Extra keyword arguments to send to the model.

Raises:

RuntimeError – If an error occured, catch it with nice error message.

Returns:

Loaded data needed for the model.

Return type:

list[Any]

pyearthtools.zoo.predict.predict(model, time, pipeline_name, output, data_cache=None, config_path=None, **kwargs)#

Run a prediction for a given model, pipeline, and time.

Parameters:
  • model (str) – Model name to load

  • time (str) – Isoformat of time to run prediction for

  • pipeline (str) – Pipeline config to use

  • output (Path | str) – Location to save data

  • data_cache (Path | str | None) – Where to cache data. Defaults to None

  • config_path (Path | str | None) – Override for config path. Defaults to None

  • kwargs (dict[Any, Any]) – Extra keyword arguments to send to the model.

  • pipeline_name (str)

Raises:

RuntimeError – If an error occured, catch it with nice error message.

Returns:

Loaded Predictions.

Return type:

(Any)

zoo.utils#

class pyearthtools.zoo.utils.Colour#

Colour helper

class pyearthtools.zoo.utils.CategorisedObjects(name, categories=None, *, _parse=None, **objects)#

Generic class to allow access into a categorised objects.

Categories are formed from nested kwargs and dictionaries, and can be set later with __setitem__. Key’s must be hashable, just like a dictionary.

Examples

>>> record = CategorisedObjects('Example', category_1 = {'sub_cat': 10})
>>> record.category_1
>>> ─┬ category_1 ──
>>>   └──sub_cat

## Parsing

Overriding _parse allows custom classes to be parsed when retrieved. Overriding _name allows custom classes names to be retrieved when displaying what is available.

Construct a Category, can itself have sub categories.

If any object is a dictionary, create another CategorisedObjects at that entry.

Parameters:
  • name (str) – Name of this category.

  • categories (dict[str, Any | CategorisedObjects] | None, optional) – Dictionary to configure categories to allow access to. If element is dictionary, will be configured as a sub category. Defaults to None

  • _parse (Callable, None, optional) – Init arg to override _parse function, to allow parsing of object upon retrieval. Must be a callable expecting self and one argument.

  • **objects (Any | CategorisedObjects | dict[str, Any | CategorisedObjects]) – Kwargs form of categories, kwarg key is top level category.

property available: tuple[str, ...]#

Get list of available objects

items() a generator object providing a view on Category's items#
keys() a generator object providing a view on Category's keys#
update(_CategorisedObjects__dict=None, **kwargs)#

Update CategorisedObjects

Can be given as full path seperated by ‘/’.

Value can be dictionary, which will be expanded.

Parameters:
  • _CategorisedObjects__dict (dict[Any, Any] | None)

  • kwargs (Any)

values() an generator object providing a view on Category's values#
class pyearthtools.zoo.utils.AvailableModels#

Get all available models as defined by entrypoints underneath pyearthtools.zoo.register.

Categorise with these entry points by seperating layers with _.

Examples

>>> # Entrypoints
>>> # NESM_modelNAME
>>> AvailableModels()
>>> ─┬ Available Models ──
>>>   └─┬ NESM ──
>>>      └──modelNAME

Can retrieve model by getting attibute one layer at a time, or by getattr(self, 'NESM/modelNAME'), or if last name is unique, that name alone.

If NESM/Model exists within the AvailableModels, it can be retrieved in the following way, `python AvailableModels.NESM.modelNAME AvailableModels['NESM/modelNAME'] AvailableModels.modelNAME # Only works if `modelNAME` is unique. `

Construct object containing all available models

Raises:

ValueError – If a model will get overwritten by a duplicate key.

refresh()#

Refresh available models

class pyearthtools.zoo.utils.TabCompleter#

A tab completer that can either complete from the filesystem or from a list.

create_list_completer(ll)#

This is a closure that creates a method that autocompletes from the given list.

Since the autocomplete function can’t be given a list to complete from a closure is used to create the listCompleter function with a list to complete from.

Parameters:

ll (list | str)

path_completer(text, state)#

This is the tab completer for systems paths. Only tested on Linux systems

pyearthtools.zoo.utils.parse_str(item)#

Parse a str to a boolean if represents a bool

Parameters:

item (str)

Return type:

str | int | float | bool

pyearthtools.zoo.utils.find_demlim(value, delim_options)#

Find which delimiter is being used out of delim_options

Defaults to ‘-’ if none found

Parameters:
  • value (str)

  • delim_options (list[str])

pyearthtools.zoo.utils.delta_conversion(value, unit='hour')#

Attempt to convert a given value to an integer of the given unit.

If cannot convert, will quietly return value

Parameters:
  • value (Any) – Value to convert

  • unit (str, optional) – Unit to convert in to. Defaults to ‘hour’.

Returns:

Time delta in unit

Return type:

(int)

pyearthtools.zoo.utils.create_mapping(list1, list2)#

Creates a dictionary mapping elements from list1 to list2, ignoring text in ().

Allows data to be associated with pipelines designed to be generic. If no element found in the second list, value will be None. – Generated by Bard

Parameters:
  • list1 (list[str]) – A list of strings.

  • list2 (list[str]) – A list of strings.

Returns:

A dictionary mapping elements from list1 to list2, ignoring text in ().

Return type:

(dict[str, str | None])

Examples

Given two lists [‘era5’, ‘era5(test)’] and [‘era5’], the mapping would be

pyearthtools.zoo.utils.get_annotation(val)#

Get annotation from a signature value

Parameters:

val (Parameter)

pyearthtools.zoo.utils.get_arguments(function)#

Get arguments of a function

Parameters:

function (Callable) – Function to get arguments of

Returns:

[Required arguments, Type hints], [Defaulted arguements, defaults or type hints]

Return type:

(tuple[dict[str,Any], dict[str, Any]])

pyearthtools.zoo.utils.split_name_assignment(config)#

Split config into name and assignment components.

Assignment is given enclosed in {}, and multiple assignments can be split by ‘,’.

If no assignment, return it as None

Parameters:

config (str) – Pipeline config to parse

Raises:

ValueError – If too many elements discovered

Returns:

config name, dictionary of assignments if any or None

Return type:

(tuple[str, dict[str, str | int] | None])

zoo.warnings#

class pyearthtools.zoo.AccessorRegistrationWarning#

Warning for conflicts in accessor registration.