Data API Docs

Contents

Data API Docs#

data.archive#

class pyearthtools.data.archive.ZarrIndex(store, variables=None, *, template=False, transforms=None, open_kwargs=None, save_kwargs=None, remote=False, **kwargs)#

Zarr Index

Can be used to access local/remote zarr archives, with the ability to write into them.

Examples:

>>> zarr_archive = Zarr(PATH_TO_ZARR_ARCHIVE)
>>> zarr_archive()

For time aware indexing, use ZarrTime.

Additonally, this class can be used to create an ‘empty’ archive, with all metadata prepopulated.

This is useful to premake an archive, and then use many distributed processes to write subsets into it.

Template Example:

>>> zarr_archive = Zarr(PATH_TO_ZARR_ARCHIVE, template = True)
>>> zarr_archive.make_template(SINGLE_SAMPLE, time = EXPANDED_TIME)
>>>
>>> for subsample in TOTALSAMPLES: # Can be done distributedly
>>>     zarr_archive.save(subsample)

Zarr Archive

Can use sa as mode for saving, which means ‘safe append’. Will look at existing archive, and only append on append_dim data that is missing.

If template is True, exists will always be False.

Parameters:
  • store (PathLike) – Store or path to directory in local or remote file system.

  • variables (str | list[str] | None, optional) – Variables within the dataset to subset to. Defaults to None.

  • template (bool, optional) – Whether this archive is a template, will cause exists to always return False. Allows a cacher to write to this archive, despite it appearing to exist on disk. Defaults to False.

  • transforms (Transform | TransformCollection | None, optional) – Base Transforms to be applied to data. Transforms are applied on the retrieval of data, i.e. index[] but not when directly getting the data, index.get(). Defaults to TransformCollection().

  • open_kwargs (dict[str, Any] | None, optional) – Kwargs to use when opening the zarr archive. See https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html Defaults to None.

  • save_kwargs (dict[str, Any] | None, optional) – Kwargs to use when saving the zarr archive. See https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html Defaults to None.

  • remote (bool) – If this flag is set, then the store variable is an fsspec style URL to a remote Zarr store, so will not be treated like a local path.

exists(search_dict=None, **kwargs)#

Check if zarr archive exists

If template == True, always return False,

Parameters:
  • search_dict (dict[str, Any] | None, optional) – Key / val to check for in data. Defaults to None.

  • kwargs – Kwargs form of search_dict.

Returns:

If zarr archive / data in archive exists.

Return type:

(bool)

get()#

Get zarr archive

Used within the indexes data access flows

Subset on variables if given, but applies no other subsetting.

make_template(dataset, *, chunk=None, encoding=None, overwrite=True, append_dimension=None, expand_coords=None, **kwargs)#

Make a template dataset out of one sample of data, dataset.

A sample should contain all of the variables this full dataset should have. It must also contain all values along the coordinates not included in expand_coords that can be expected, i.e. all latitude values.

A sample does not need to include all values as specified in expand_coords, it will be reindexed to include them by this function.

The full dataset is defined as the sample expanded by expand_coords.

Parameters:
  • dataset (xr.Dataset) – Single sample of full dataset. All metadata will be taken from this sample.

  • chunk (Literal['auto'] | None | dict[str, Literal['auto'] | int ], optional) – Override for chunks of zarr archive. Any key in expand_coords will be chunked ‘auto’. Defaults to None.

  • overwrite (bool, optional) – Whether to override an existing zarr archive. Defaults to True.

  • append_dimension (str | None, optional) – Dimension to append on, if to append. Defaults to None.

  • expand_coords (dict[str, list[Any]] | None) – Coordinates to reindex. Allows for a single sample to be passed, but full archive created of larger data. Defaults to None.

  • kwargs – Kwargs form of expand_coords

  • encoding (dict[str, dict[str, Any]] | None)

Raises:

FileExistsError – If file exists and override == False.

Examples

>>> era5 = pyearthtools.data.archive.ERA5.sample()
>>>
>>> full_time_values = list(map(lambda x: x.datetime64(), pyearthtools.data.TimeRange('1980', '2020', '6 hour')))
>>>
>>> zarr_archive = Zarr(PATH_TO_ZARR, template = True)
>>> zarr_archive.make_template(era5['2000-01-01T00'], time = full_time_values)
... # Will create a zarr archive like `era5` but across all of `full_time_values`
save(data, save_kwargs=None, **kwargs)#

Save data into the zarr archive

See https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html

Can use sa as mode for saving, which means ‘safe append’. Will look at existing archive, and only append on append_dim data that is missing.

Parameters:
  • data (Dataset) – Dataset to save

  • save_kwargs (dict[str, Any] | None) – Extra kwargs to pass to .to_zarr, in addition to init.save_kwargs. Defaults to None.

  • **kwargs – Kwargs form of save_kwargs

search()#

Get path of zarr archive

Return type:

str | Path

class pyearthtools.data.archive.ZarrTimeIndex(store, variables=None, *, template=False, transforms=None, open_kwargs=None, save_kwargs=None, remote=False, **kwargs)#

Time index aware zarr archive

Allows for [] with a time value, and subsetting accordingly.

Zarr Archive

Can use sa as mode for saving, which means ‘safe append’. Will look at existing archive, and only append on append_dim data that is missing.

If template is True, exists will always be False.

Parameters:
  • store (PathLike) – Store or path to directory in local or remote file system.

  • variables (str | list[str] | None, optional) – Variables within the dataset to subset to. Defaults to None.

  • template (bool, optional) – Whether this archive is a template, will cause exists to always return False. Allows a cacher to write to this archive, despite it appearing to exist on disk. Defaults to False.

  • transforms (Transform | TransformCollection | None, optional) – Base Transforms to be applied to data. Transforms are applied on the retrieval of data, i.e. index[] but not when directly getting the data, index.get(). Defaults to TransformCollection().

  • open_kwargs (dict[str, Any] | None, optional) – Kwargs to use when opening the zarr archive. See https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html Defaults to None.

  • save_kwargs (dict[str, Any] | None, optional) – Kwargs to use when saving the zarr archive. See https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html Defaults to None.

  • remote (bool) – If this flag is set, then the store variable is an fsspec style URL to a remote Zarr store, so will not be treated like a local path.

exists(querytime=None, **kwargs)#

Check for existence,

If querytime given check it is in the zarr archive.

Parameters:

querytime (str | None)

retrieve(querytime=None, *args, transforms=None, **kwargs)#

If supplied, retrieve the data subset for the specified time

Parameters:
Return type:

Any

pyearthtools.data.archive.extensions.register_archive(name, *, sample_kwargs=None)#

Register a custom archive underneath pyearthtools.data.archive.

Parameters:
  • name (str) – Name under which the archive should be registered. A warning is issued if this name conflicts with a preexisting archive.

  • sample_kwargs (dict[str, Any] | None, optional) – Keyword arguments to initialise a sample index for demonstration. Can be retrieved with .sample

Return type:

Callable

pyearthtools.data.archive.reset_root()#

Reset all root directories

pyearthtools.data.archive.set_root(root_dir=None, **kwargs)#

Change root directory for data sources.

Can set value of dictionary to None which will result in the root directory being reset to the default value.

Parameters:
  • root_dir (dict[str, str | None] | None, optional) – Dictionary with root directory replacements. Defaults to None.

  • **kwargs (dict[str,str | None]) – Kwargs version of root_dir

pyearthtools.data.archive.config_root()#

Setup Root Directories

data.derived#

class pyearthtools.data.derived.DerivedValue(transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, *args, add_default_transforms=True, preprocess_transforms=None, **kwargs)#

Base class for Derived data

Subclassed from DataIndex so transforms can be used.

Child must implement derive.

Introduce [transforms][pyearthtools.data.transforms] to data loading

Parameters:
  • transforms (Transform | TransformCollection, optional) – Base Transforms to be applied to data. Defaults to TransformCollection().

  • add_default_transforms (bool, optional) – Add Default Transformations. Defaults to True

  • preprocess_transforms (Transform | TransformCollection | Callable | None) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

abstractmethod derive(*args, **kwargs)#

Get derived value.

Will only be passed most specific key, so if a function of time, expect a time.

Child class must implement

Return type:

Dataset

get(*args, **kwargs)#

Override for get to use derive.

classmethod like(dataset, **kwargs)#

Setup DerivedValue taking coords from dataset if key in __init__.

If cls takes latitude and longitude, and those coords in dataset, will take values, and pass to __init__

Examples: `python era = pyearthtools.data.archive.ERA5.sample() derived = DerivedValue.like(era['2000-01-01T00']) `

Parameters:

dataset (Dataset | DataArray)

class pyearthtools.data.derived.TimeDerivedValue(data_interval=None, **kwargs)#

Temporally derived value Index

Derived value which is a factor of time.

Hooks into TimeDataIndex to allow for series retrieval

Parameters:

data_interval (tuple[int, str] | int | str | TimeDelta | None, optional) – Default interval of data. Defaults to None.

class pyearthtools.data.derived.AdvancedTimeDerivedValue(data_interval=None, split_time=False, **kwargs)#

Advanced Temporally Derived Index

Allows for time-resolution-based retrieval.

Example:

>>> index = AdvancedTimeDerivedValue('6 hours')
>>> index['2000-01-01'] # Will get four steps 00,06,12,18
Parameters:
  • data_interval (tuple[int, str] | int | str | TimeDelta | None) – Interval of derivation, if given allows for [] to get multiple samples based on resolution.

  • split_time (bool) – Whether to split a series call into each individual time, or pass list of times.

Derived value which is a factor of time.

Hooks into TimeDataIndex to allow for series retrieval

Parameters:
  • data_interval (tuple[int, str] | int | str | TimeDelta | None, optional) – Default interval of data. Defaults to None.

  • split_time (bool)

series(start, end, interval=None, **_)#

Index into Provided Data function to create a continuous series of Data

Parameters:
  • DataFunction (AdvancedTimeIndex) – Data function, must be AdvancedTimeIndex or child

  • start (str | datetime.datetime | Petdt) – Timestep to begin series at

  • end (str | datetime.datetime | Petdt) – Timestep to end series at

  • interval (tuple[float, str]) – Time interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • inclusive (bool, optional) – Whether end time is included in retrieval. Defaults to False.

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to False.

  • transforms (Transform | TransformCollection, optional) – Extra [Transform’s][pyearthtools.data.transforms.Transform] to be applied to data. Defaults to TransformCollection().

  • verbose (bool, optional) – Print logging messages. Defaults to False.

  • force_get (bool, optional) – Use series method which loads each dataset using .get. WARNING: Takes significantly longer, as it does not use dask. Defaults to False.

  • subset_time (bool, optional) – Whether to force subset time dim. Defaults to True.

  • tolerance (tuple | pd.Timedelta, optional) – Tolerance for time subsetting. Defaults to None.

Returns:

Loaded xarray dataset

Return type:

(xr.Dataset)

class pyearthtools.data.derived.Insolation(latitude, longitude, interval=None, *, S=1.0, daily=False, clip_zero=True)#

Calculate the approximate solar insolation for given dates.

Use like to mimic a dataset, it must have latitude and longitude in the coords.

Calculate the approximate solar insolation for given dates.

For an example reference, see: https://brian-rose.github.io/ClimateLaboratoryBook/courseware/insolation.html

Parameters:
  • latitude (np.ndarray | list) – 1d or 2d array of latitudes

  • longitude (np.ndarray | list) – 1d or 2d array of longitudes (0-360deg). If 2d, must match the shape of latitude.

  • interval (tuple[int, str] | int | str | None, optional) – TimeDelta of data. E.g. 6 hour. Used for series retrieval. Can be None to not default have interval awareness. Defaults to None.

  • S (float, optional) – scaling factor (solar constant). Defaults to 1.0.

  • daily (bool, optional) – if True, return the daily max solar radiation (lat and day of year dependent only). Defaults to False.

  • clip_zero (bool, optional) – if True, set values below 0 to 0. Defaults to True.

Raises:

ValueError – If latitude or longitude are invalid.

derive(time)#

Get derived value.

Will only be passed most specific key, so if a function of time, expect a time.

Child class must implement

Parameters:

time (Timestamp)

Return type:

Dataset

data.download#

class pyearthtools.data.download.arcoera5.ARCOERA5(variables=None, level=None, transforms=None, **kwargs)#

Analysis-Ready, Cloud Optimized ERA5

google-research/arco-era5

Carver, Robert W, and Merose, Alex. (2023): ARCO-ERA5: An Analysis-Ready Cloud-Optimized Reanalysis Dataset. 22nd Conf. on AI for Env. Science, Denver, CO, Amer. Meteo. Soc, 4A.1, https://ams.confex.com/ams/103ANNUAL/meetingapp.cgi/Paper/415842

Analysis-Ready, Cloud Optimized ERA5 integrated within pyearthtools.

Allows for access to a cloud ERA5 archive.

Parameters:
  • variables (str | list[str] | None, optional) – Variables to retrieve, can be either short_name or long_name. Default to None, to retrieve all variables.

  • level (int | list[int] | None, optional) – Pressure levels to select. Defaults to None, to select all levels.

  • transforms (Transform | TransformCollection | None, optional) – Transforms to apply to dataset. Defaults to None.

property dataset: Dataset#

Get full dataset for this obj

get(time)#

Get timestep from dataset

Parameters:

time (str)

classmethod sample()#

Example subset of the dataset

pyearthtools.data.download.arcoera5.LEVELS = [1, 2, 3, 5, 7, 10, 20, 30, 50, 70, 100, 125, 150, 175, 200, 225, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975, 1000]#

valid ARCO-ERA5 level values

pyearthtools.data.download.arcoera5.LONG_NAMES#

mapping from long variable names to short variable names

pyearthtools.data.download.arcoera5.SHORT_NAMES#

mapping from short variable names to long variable names

class pyearthtools.data.download.weatherbench.WeatherBench2(dataset_url, license_url, *, variables=None, level=None, transforms=None, chunks='auto', download_dir=None, license_ok=False, **kwargs)#

WeatherBench2 cloud-optimized ground truth and baseline datasets

google-research/weatherbench2

Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russel, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell and Fei Sha (2024): WeatherBench 2: A benchmark for the next generation of data-driven global weather models Journal of Advances in Modeling Earth Systems, 16, e2023MS004019 https://doi.org/10.1029/2023MS004019

If a download_dir folder is provided, the selected subset (i.e. variables and levels) of the dataset will be first downloaded into the folder, in a subfolder named with the hash of the url. In this subfolder, each variable and level is saved as a separate compressed zarr dataset. Once downloaded, any subsequent access will use the local version.

Later, if you select a different set of variables and levels, make sure to use the same folder, as only the missing variables and levels will then be downloaded.

Parameters:
  • dataset_url (str) – URL of the zarr dataset

  • license_url (str) – License of the dataset

  • variables (str | list[str] | None, optional) – Variables to retrieve, can be either short_name or long_name. Default to None, to retrieve all variables.

  • level (int | list[int] | None, optional) – Pressure levels to select. Defaults to None, to select all levels.

  • transforms (Transform | TransformCollection | None, optional) – Transforms to apply to dataset. Defaults to None.

  • chunks (int | dict | Literal["auto"], optional) – Chunking used to load data into Dask arrays. Defaults to “auto”.

  • download_dir (str | Path, optional) – Folder where to save a copy of the dataset. Defaults to None.

  • license_ok (bool, optional) – License has been read. Defaults to False.

property dataset: Dataset#

Get full dataset for this obj

get(time)#

Get timestep from dataset

Parameters:

time (str)

license()#

Get the license for this dataset

Return type:

str

class pyearthtools.data.download.weatherbench.WB2ERA5(*, resolution='64x32', **kwargs)#

WeatherBench2 cloud-optimized ground truth ERA5 dataset

ERA5 datasets downloaded from the Copernicus Climate Data Store with a time range from 1959 to 2023 (incl.). The data have been downsampled to 6h and 13 levels, except for the “raw” dataset. The raw dataset is hourly with a 0.25 degree spatial resolution and 37 levels.

https://weatherbench2.readthedocs.io/en/latest/data-guide.html#era5

Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russel, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell and Fei Sha (2024): WeatherBench 2: A benchmark for the next generation of data-driven global weather models Journal of Advances in Modeling Earth Systems, 16, e2023MS004019 https://doi.org/10.1029/2023MS004019

See pyearthtools.data.download.weatherbench.WeatherBench2 for additional parameters.

Parameters:

resolution (str, optional) – Dataset resolution, one of “raw”, “1440x721”, “240x121” and “64x32”. The “raw” dataset is not subsampled, i.e. is hourly with 36 levels. Defaults to “64x32”.

classmethod sample()#

Example subset of the dataset

class pyearthtools.data.download.weatherbench.WB2ERA5Clim(*, resolution='64x32', period='1990-2017', **kwargs)#

WeatherBench2 cloud-optimized ground truth ERA5 climatology dataset

For WeatherBench 2, the climatology was computed using a running window for smoothing (see paper and script) for each day of year and sixth hour of day. Climatologies have been computed for 1990-2017 and 1990-2019.

https://weatherbench2.readthedocs.io/en/latest/data-guide.html#era5-climatology

Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russel, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell and Fei Sha (2024): WeatherBench 2: A benchmark for the next generation of data-driven global weather models Journal of Advances in Modeling Earth Systems, 16, e2023MS004019 https://doi.org/10.1029/2023MS004019

See pyearthtools.data.download.weatherbench.WeatherBench2 for additional parameters.

Parameters:
  • resolution (str, optional) – Dataset resolution, one of “1440x721”, “512x256”, “240x121” and “64x32”. Defaults to “64x32”.

  • period (str, optional) – Covered time period, either “1990-2017” or “1990-2019”. Defaults to “1990-2017”.

classmethod sample()#

Example subset of the dataset

data.indexes#

class pyearthtools.data.indexes.Index(*args, **kwargs)#

Base Level Index to define the structure

To use, subclass and define the .get function, any calls, shall be passed through.

abstractmethod get(*args, **kwargs)#

Base Level .get call, used to retrieve data from args

class pyearthtools.data.indexes.DataIndex(transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, *args, add_default_transforms=True, preprocess_transforms=None, **kwargs)#

Index to introduce [transforms][pyearthtools.data.transforms] to data loading

Transforms are applied on a retrieve or __call__, but not on get

Introduce [transforms][pyearthtools.data.transforms] to data loading

Parameters:
  • transforms (Transform | TransformCollection, optional) – Base Transforms to be applied to data. Defaults to TransformCollection().

  • add_default_transforms (bool, optional) – Add Default Transformations. Defaults to True

  • preprocess_transforms (Transform | TransformCollection | Callable | None) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

retrieve(*args, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, **kwargs)#

Retrieve data for the given time step, applying the suppled transforms

The untransformed data is obtained using get, which must be implemented by the user

Parameters:

transforms (Transform | TransformCollection, optional) – Extra transforms to apply. Defaults to TransformCollection().

Returns:

Loaded data with transforms applied

Return type:

(Any)

class pyearthtools.data.indexes.FileSystemIndex(*args, **kwargs)#

Index addon to load data from a File System

Provides basic loading functions and allows for an index to be ‘searched’.

exists(*args, **kwargs)#

First, use search(*args, **kwargs) to find all matching files, Paths or identifiers Then use check_existence to confirm the found data object

Returns:

If data exists

Return type:

(bool)

filesystem(*args)#

Find datafiles given args on local filesystem.

Must be implemented by child class to specify data.

Can return a dictionary[str, str], tuple, list or path representing the files to load.

Return type:

Path | dict[str, str]

get(*args, **kwargs)#

Get data by loading it from the search

Passes all args to search() and all kwargs to load()

Raises:

DataNotFoundError – Data could not be found

Returns:

Loaded Data

Return type:

(Any)

load(files, **kwargs)#

Load a given list of files.

Automatically determine method to load files for file extension

Supported:
  • netcdf

  • pandas [csv]

  • numpy

Parameters:
  • files (dict[str, str | Path] | Path | list[str | Path] | tuple[str | Path]) – Files to load

  • **kwargs (Any, optional) – Kwargs passed to underlying loading function

Raises:

InvalidDataError – If an error arose when loading file

Returns:

Loaded data

Return type:

(Any)

search(*args, **kwargs)#

Find file name/path, with the underlying functionality defined by discovered location.

All arguments passed to underlying function.

Parameters:
  • *args (Any, optional) – Arguments passed to underlying search function

  • *kwargs (Any, optional) – Keyword Arguments passed to underlying search function

Returns:

Path to data defined by arguments

Return type:

(Path | list[str | Path] | dict[str, str | Path])

class pyearthtools.data.indexes.TimeIndex(data_interval=None, round=False, **kwargs)#

Introduce general time based Indexing with [Petdt][pyearthtools.data.time.Petdt].

Allow for multiple time retrievals.

Setup TimeIndex.

Will warn a user if date is of incorrect resolution

Parameters:
  • data_interval (tuple[int, str] | str | int | TimeDelta | None) –

    Interval of data. Must follow format for [TimeDelta][pyearthtools.data.time.TimeDelta].

    • E.g.

    >>> (1, 'h') = 1 Hour
    >>> (10, 'D') = 10 Days.
    >>> 10 = 10 minutes.
    

  • round (bool) – Default value for round when retrieving data.

aggregation(start, end, interval, *, aggregation='mean', aggregation_dim='time', save_location=None, skip_invalid=True, num_divisions=1, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, verbose=False, **kwargs)#

Get aggregation of [TimeIndex][pyearthtools.data.TimeIndex] over given dimension

!!! Warning:

Any num_divisions not a factor of the number of data steps will result in some data being missed.

Parameters:
  • DataFunction (TimeIndex) – TimeIndex to retrieve Data

  • start (str | datetime.datetime | Petdt) – Start Date

  • end (str | datetime.datetime | Petdt) – End Date

  • interval (tuple[float, str]) – Interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • aggregation (str, optional) – Aggregation Function to apply. Defaults to “mean”.

  • aggregation_dim (str, optional) – Dimension to aggregate over apply. Defaults to “time”.

  • save_location (str | Path | None, optional) – Location to automatically save the result. Defaults to None.

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to True.

  • num_divisions (int, optional) – Number of times to divide series to alleviate memory issues. Defaults to 1.

  • transforms (Transform | TransformCollection, optional) – Extra Transforms to be applied. Defaults to TransformCollection().

  • verbose (bool, optional) – Whether to log progress messages. Defaults to False.

Returns:

Dataset with aggregation applied

Return type:

xr.Dataset

range(start, end, interval, *, skip_invalid=True, num_divisions=1, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, **kwargs)#

Find Minimum and Maximum of a [TimeIndex][pyearthtools.data.TimeIndex] in the given time range

Parameters:
  • DataFunction (TimeIndex) – TimeIndex to retrieve Data

  • start (str | Petdt) – Start Date

  • end (str | Petdt) – End Date

  • interval (tuple[float, str]) – Interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to True.

  • num_divisions (int, optional) – Number of times to divide series to alleviate memory issues. Defaults to 1.

  • transforms (Transform | TransformCollection, optional) – Extra Transforms to be applied. Defaults to TransformCollection().

Returns:

Dictionary with max and min populated

Return type:

dict

safe_series(start, end, interval, **kwargs)#

Safely index into the provided Data function to create a continuous series of Data.

Called by the series method

Uses [series][pyearthtools.data.operations.index_routines.series], but provides an automatic interpolation.

!!! Warning

If data is missing or if a resolution higher than the actual data resolution is provided, those missing time steps will be interpolated,

Parameters:
  • DataFunction (AdvancedTimeIndex) – Data function, must be AdvancedTimeIndex or child

  • start (str | Petdt) – Timestep to begin series at

  • end (str | Petdt) – Timestep to end series at

  • interval (TimeDelta) – Time interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • **kwargs (dict, optional) – Any extra keyword arguments to pass to [series][pyearthtools.data.operations.index_routines.series]

Returns:

Loaded xarray dataset

Return type:

(xr.Dataset)

series(start, end, interval, *, inclusive=False, skip_invalid=False, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, verbose=False, force_get=False, subset_time=True, time_dim=None, tolerance=None, **kwargs)#

Index into Provided Data function to create a continuous series of Data

Parameters:
  • DataFunction (AdvancedTimeIndex) – Data function, must be AdvancedTimeIndex or child

  • start (str | datetime.datetime | Petdt) – Timestep to begin series at

  • end (str | datetime.datetime | Petdt) – Timestep to end series at

  • interval (tuple[float, str]) – Time interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • inclusive (bool, optional) – Whether end time is included in retrieval. Defaults to False.

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to False.

  • transforms (Transform | TransformCollection, optional) – Extra [Transform’s][pyearthtools.data.transforms.Transform] to be applied to data. Defaults to TransformCollection().

  • verbose (bool, optional) – Print logging messages. Defaults to False.

  • force_get (bool, optional) – Use series method which loads each dataset using .get. WARNING: Takes significantly longer, as it does not use dask. Defaults to False.

  • subset_time (bool, optional) – Whether to force subset time dim. Defaults to True.

  • tolerance (tuple | pd.Timedelta, optional) – Tolerance for time subsetting. Defaults to None.

  • time_dim (str | None)

Returns:

Loaded xarray dataset

Return type:

(xr.Dataset)

class pyearthtools.data.indexes.SingleTimeIndex(data_interval=None, round=False, **kwargs)#

Introduce single time based Indexing with [Petdt][pyearthtools.data.time.Petdt].

While [Index][pyearthtools.data.indexes.indexes.Index] assumes nothing about the selection arguments, this will attempt to convert them to a [Petdt][pyearthtools.data.time.Petdt], and select that time from the data.

[Petdt][pyearthtools.data.time.Petdt] keeps a record of the resolution of the given date string, which allows for more informative warnings.

Setup TimeIndex.

Will warn a user if date is of incorrect resolution

Parameters:
  • data_interval (tuple[int, str] | str | int | TimeDelta | None) –

    Interval of data. Must follow format for [TimeDelta][pyearthtools.data.time.TimeDelta].

    • E.g.

    >>> (1, 'h') = 1 Hour
    >>> (10, 'D') = 10 Days.
    >>> 10 = 10 minutes.
    

  • round (bool) – Default value for round when retrieving data.

retrieve(querytime, *args, select=False, round=None, **kwargs)#

Retrieve Data at given timestep, uses [Index][pyearthtools.data.indexes.Index] to load data.

While [Index][pyearthtools.data.indexes.Index] assumes nothing, this will attempt to select time.

Parameters:
  • querytime (str | datetime.datetime | Petdt) – Timestep to retrieve data at

  • select (bool, optional) – Select querytime in dataset. Defaults to False.

  • round (bool, optional) – Select nearest time, when selecting. Can be configured in init. Defaults to False.

Returns:

Loaded data, with time selected

Return type:

(Any)

set_interval(data_interval=None)#

Set interval of data

Parameters:

data_interval (tuple[float, str] | str | int | TimeDelta | None) –

Interval of data. Must follow format for [TimeDelta][pyearthtools.data.time.TimeDelta].

  • E.g.

    >>> (1, 'h') = 1 Hour
    >>> (10, 'D') = 10 Days.
    >>> 10 = 10 minutes.
    

class pyearthtools.data.indexes.TimeDataIndex(transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, preprocess_transforms=None, data_interval=None, **kwargs)#

Setup TimeDataIndex

For indexing with time and applying transforms

Parameters:
  • transforms (Transform | TransformCollection, optional) – Transforms to add when retrieving data. Defaults to TransformCollection().

  • data_interval (tuple[int, str] | int, optional) – Temporal Interval of Data. Defaults to None.

  • preprocess_transforms (Transform | TransformCollection, optional) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

aggregation(*args, **kwargs)#

Get aggregation of [TimeIndex][pyearthtools.data.TimeIndex] over given dimension

!!! Warning:

Any num_divisions not a factor of the number of data steps will result in some data being missed.

Parameters:
  • DataFunction (TimeIndex) – TimeIndex to retrieve Data

  • start (str | datetime.datetime | Petdt) – Start Date

  • end (str | datetime.datetime | Petdt) – End Date

  • interval (tuple[float, str]) – Interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • aggregation (str, optional) – Aggregation Function to apply. Defaults to “mean”.

  • aggregation_dim (str, optional) – Dimension to aggregate over apply. Defaults to “time”.

  • save_location (str | Path | None, optional) – Location to automatically save the result. Defaults to None.

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to True.

  • num_divisions (int, optional) – Number of times to divide series to alleviate memory issues. Defaults to 1.

  • transforms (Transform | TransformCollection, optional) – Extra Transforms to be applied. Defaults to TransformCollection().

  • verbose (bool, optional) – Whether to log progress messages. Defaults to False.

Returns:

Dataset with aggregation applied

Return type:

xr.Dataset

range(*args, **kwargs)#

Find Minimum and Maximum of a [TimeIndex][pyearthtools.data.TimeIndex] in the given time range

Parameters:
  • DataFunction (TimeIndex) – TimeIndex to retrieve Data

  • start (str | Petdt) – Start Date

  • end (str | Petdt) – End Date

  • interval (tuple[float, str]) – Interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to True.

  • num_divisions (int, optional) – Number of times to divide series to alleviate memory issues. Defaults to 1.

  • transforms (Transform | TransformCollection, optional) – Extra Transforms to be applied. Defaults to TransformCollection().

Returns:

Dictionary with max and min populated

Return type:

dict

safe_series(*args, **kwargs)#

Safely index into the provided Data function to create a continuous series of Data.

Called by the series method

Uses [series][pyearthtools.data.operations.index_routines.series], but provides an automatic interpolation.

!!! Warning

If data is missing or if a resolution higher than the actual data resolution is provided, those missing time steps will be interpolated,

Parameters:
  • DataFunction (AdvancedTimeIndex) – Data function, must be AdvancedTimeIndex or child

  • start (str | Petdt) – Timestep to begin series at

  • end (str | Petdt) – Timestep to end series at

  • interval (TimeDelta) – Time interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • **kwargs (dict, optional) – Any extra keyword arguments to pass to [series][pyearthtools.data.operations.index_routines.series]

Returns:

Loaded xarray dataset

Return type:

(xr.Dataset)

series(*args, **kwargs)#

Index into Provided Data function to create a continuous series of Data

Parameters:
  • DataFunction (AdvancedTimeIndex) – Data function, must be AdvancedTimeIndex or child

  • start (str | datetime.datetime | Petdt) – Timestep to begin series at

  • end (str | datetime.datetime | Petdt) – Timestep to end series at

  • interval (tuple[float, str]) – Time interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • inclusive (bool, optional) – Whether end time is included in retrieval. Defaults to False.

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to False.

  • transforms (Transform | TransformCollection, optional) – Extra [Transform’s][pyearthtools.data.transforms.Transform] to be applied to data. Defaults to TransformCollection().

  • verbose (bool, optional) – Print logging messages. Defaults to False.

  • force_get (bool, optional) – Use series method which loads each dataset using .get. WARNING: Takes significantly longer, as it does not use dask. Defaults to False.

  • subset_time (bool, optional) – Whether to force subset time dim. Defaults to True.

  • tolerance (tuple | pd.Timedelta, optional) – Tolerance for time subsetting. Defaults to None.

Returns:

Loaded xarray dataset

Return type:

(xr.Dataset)

class pyearthtools.data.indexes.AdvancedTimeIndex(data_interval=None, round=False, **kwargs)#

Extend Time based indexing for Advanced uses, using the provided data_interval

Overrides retrieve, to allow a series of data to be retrieved based upon given date resolution.

Tip

“New retrieve Behaviour”

>>>    Consider a dataset with 10 minute resolution
>>>
>>>    | Date             | Behaviour             |
>>>    | -----------------|-----------------------|
>>>    |`2021-01-01T00:00`|Exact Data             |
>>>    |`2021-01-01T00`   |All Data in that hour  |
>>>    |`2021-01-01`      |All Data in that day   |
>>>    |`2021-01`         |All Data in that month |
>>>    |`2021`            |All Data in that year  |

Important

Many features of this class require the data_interval to be specified

Setup TimeIndex.

Will warn a user if date is of incorrect resolution

Parameters:
  • data_interval (tuple[int, str] | str | int | TimeDelta | None) –

    Interval of data. Must follow format for [TimeDelta][pyearthtools.data.time.TimeDelta].

    • E.g.

    >>> (1, 'h') = 1 Hour
    >>> (10, 'D') = 10 Days.
    >>> 10 = 10 minutes.
    

  • round (bool) – Default value for round when retrieving data.

retrieve(querytime, *, aggregation=None, select=True, use_simple=False, **kwargs)#

Retrieve data at timestep, but will use the resolution of the time to infer large scale retrievals.

Tip

“Date Behaviour”

>>>    | Date               | Behaviour              |
>>>    | ------------------ | -----------------------|
>>>    | '2021-01-01T00:00' | Exact Data             |
>>>    | '2021-01-01'       | All Data in that day   |
>>>    | '2021-01'          | All Data in that month |
>>>    | '2021'             | All Data in that year  |
Parameters:
  • querytime (str | datetime | Petdt) – Timestep to retrieve data at, can be exact data or range as described above.

  • aggregation (str | None) – If data becomes a range, can specify an aggregation method.

  • select (bool) – Whether to attempt to select the given timestep if date is either fully qualified or data_interval not given.

  • use_simple (bool) – Whether to simply use the DataIndex.retrieve instead.

  • kwargs – Kwargs passed to downstream retrieval function

Returns:

Loaded Dataset with transforms applied, and aggregated if aggregation_method given.

Raises:

DataNotFoundError – If Data not found at timestep.

Return type:

Dataset

Note

Extra transforms can be supplied, using `transforms = `

class pyearthtools.data.indexes.AdvancedTimeDataIndex(transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, preprocess_transforms=None, data_interval=None, **kwargs)#

Combine AdvancedTimeIndex and DataIndex,

Allows advanced temporal indexing with transforms applied.

Setup TimeDataIndex

For indexing with time and applying transforms

Parameters:
  • transforms (Transform | TransformCollection, optional) – Transforms to add when retrieving data. Defaults to TransformCollection().

  • data_interval (tuple[int, str] | int, optional) – Temporal Interval of Data. Defaults to None.

  • preprocess_transforms (Transform | TransformCollection, optional) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

class pyearthtools.data.indexes.BaseTimeIndex(data_interval=None, round=False, **kwargs)#

Indexer to combine transforms, file system searching and basic Time

Combines TimeIndex, DataIndex and FileSystemIndex, to allow transforms and searching on filesystems based on times.

Setup TimeIndex.

Will warn a user if date is of incorrect resolution

Parameters:
  • data_interval (tuple[int, str] | str | int | TimeDelta | None) –

    Interval of data. Must follow format for [TimeDelta][pyearthtools.data.time.TimeDelta].

    • E.g.

    >>> (1, 'h') = 1 Hour
    >>> (10, 'D') = 10 Days.
    >>> 10 = 10 minutes.
    

  • round (bool) – Default value for round when retrieving data.

class pyearthtools.data.indexes.DataFileSystemIndex(transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, *args, add_default_transforms=True, preprocess_transforms=None, **kwargs)#

Indexer to combine transforms and file system searching

Combines DataIndex and FileSystemIndex, to allow transforms and searching on filesystems.

Introduce [transforms][pyearthtools.data.transforms] to data loading

Parameters:
  • transforms (Transform | TransformCollection, optional) – Base Transforms to be applied to data. Defaults to TransformCollection().

  • add_default_transforms (bool, optional) – Add Default Transformations. Defaults to True

  • preprocess_transforms (Transform | TransformCollection | Callable | None) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

class pyearthtools.data.indexes.ArchiveIndex(transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, preprocess_transforms=None, data_interval=None, **kwargs)#

Default Archive Indexer, for use by on disk datasets.

Combines DataIndex, FileSystemIndex and AdvancedTimeIndex, to allow transforms, searching, and advanced temporal indexing.

!!! Help “Initialisation Arguments”

transform

Setup TimeDataIndex

For indexing with time and applying transforms

Parameters:
  • transforms (Transform | TransformCollection, optional) – Transforms to add when retrieving data. Defaults to TransformCollection().

  • data_interval (tuple[int, str] | int, optional) – Temporal Interval of Data. Defaults to None.

  • preprocess_transforms (Transform | TransformCollection, optional) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

search(*args, **kwargs)#

Find file name/path, with the underlying functionality defined by discovered location.

All arguments passed to underlying function.

Parameters:
  • *args (Any, optional) – Arguments passed to underlying search function

  • *kwargs (Any, optional) – Keyword Arguments passed to underlying search function

Returns:

Path to data defined by arguments

Return type:

(Path | list[str | Path] | dict[str, str | Path])

class pyearthtools.data.indexes.ForecastIndex(data_interval=None, round=False, **kwargs)#

Index into Forecast data, where Temporal indexing and selection is invalid.

Combines DataIndex, FileSystemIndex and TimeIndex.

Setup TimeIndex.

Will warn a user if date is of incorrect resolution

Parameters:
  • data_interval (tuple[int, str] | str | int | TimeDelta | None) –

    Interval of data. Must follow format for [TimeDelta][pyearthtools.data.time.TimeDelta].

    • E.g.

    >>> (1, 'h') = 1 Hour
    >>> (10, 'D') = 10 Days.
    >>> 10 = 10 minutes.
    

  • round (bool) – Default value for round when retrieving data.

aggregation(querytime, aggregation, *, preserve_dims=None, reduce_dims=None, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, **kwargs)#

API Function of [aggregation][pyearthtools.data.index_operations.index_operations.aggregation] for each ForecastIndex

Parameters:
  • querytime (str | Petdt) – Time to get data at

  • aggregation (str | Callable) – Aggregation method to apply.

  • transforms (TransformCollection | Transform, optional) – Extra Transforms to apply. Defaults to TransformCollection().

  • preserve_dims (list | None)

  • reduce_dims (list | None)

Returns:

Aggregation of data

Return type:

(xr.Dataset)

retrieve(basetime, *args, querytime=None, **kwargs)#

Retrieve data from a forecast product, allowing seperate specification of basetime and querytime

Parameters:
  • basetime (str | Petdt) – Basetime to get forecast from

  • querytime (str | Petdt | None, optional) – Time to select from forecast. Defaults to None.

Raises:

IndexError – If Unable to select

Returns:

Retrieved data

Return type:

(Any)

search(*args, **kwargs)#

Find file name/path, with the underlying functionality defined by discovered location.

All arguments passed to underlying function.

Parameters:
  • *args (Any, optional) – Arguments passed to underlying search function

  • *kwargs (Any, optional) – Keyword Arguments passed to underlying search function

Returns:

Path to data defined by arguments

Return type:

(Path | list[str | Path] | dict[str, str | Path])

series(start, end, interval, *, lead_time=None, inclusive=False, skip_invalid=False, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, verbose=False)#

Index into Provided Data function to create a continuous series of Data

Parameters:
  • DataFunction (AdvancedTimeIndex) – Data function, must be AdvancedTimeIndex or child

  • start (str | datetime.datetime | Petdt) – Timestep to begin series at

  • end (str | datetime.datetime | Petdt) – Timestep to end series at

  • interval (tuple[float, str]) – Time interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • inclusive (bool, optional) – Whether end time is included in retrieval. Defaults to False.

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to False.

  • transforms (Transform | TransformCollection, optional) – Extra [Transform’s][pyearthtools.data.transforms.Transform] to be applied to data. Defaults to TransformCollection().

  • verbose (bool, optional) – Print logging messages. Defaults to False.

  • force_get (bool, optional) – Use series method which loads each dataset using .get. WARNING: Takes significantly longer, as it does not use dask. Defaults to False.

  • subset_time (bool, optional) – Whether to force subset time dim. Defaults to True.

  • tolerance (tuple | pd.Timedelta, optional) – Tolerance for time subsetting. Defaults to None.

  • lead_time (tuple[float, str] | TimeDelta)

Returns:

Loaded xarray dataset

Return type:

(xr.Dataset)

class pyearthtools.data.indexes.StaticDataIndex(transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, *args, add_default_transforms=True, preprocess_transforms=None, **kwargs)#

Index into Static Data

Introduce [transforms][pyearthtools.data.transforms] to data loading

Parameters:
  • transforms (Transform | TransformCollection, optional) – Base Transforms to be applied to data. Defaults to TransformCollection().

  • add_default_transforms (bool, optional) – Add Default Transformations. Defaults to True

  • preprocess_transforms (Transform | TransformCollection | Callable | None) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

class pyearthtools.data.indexes.CachingIndex(cache, pattern=None, pattern_kwargs={}, *, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, cleanup=None, override=None, verbose=False, save_kwargs=None, **kwargs)#

Standard CachingIndex which behaves like a standard archive but with cached data

Base FileSystemCacheIndex Object to Cache data on the fly

If only cache is given, ExpandedDate, or TemporalExpandedDate will be used by default. If cache and pattern not given, will not save data, and the point of this class is lost.

cache can also be ‘temp’ to set to a TemporaryDirectory created on __init__, or include any environment variables, with $NOTATION.

Warning

Existing Cache

If the cache is set to an existing cache location, and the pattern is the same being made and exists, pattern_kwargs will be set by default to the existing cache’s kwargs, and then updated by any given.

Parameters:
  • cache (str | Path | None) – Location to save data to.

  • pattern (str | type | PatternIndex | None) – String of pattern to use or defined pattern. Defaults to ExpandedDate, or TemporalExpandedDate.

  • pattern_kwargs (dict[str, Any] | str) – Kwargs to pass to initalisation of new pattern if pattern is str.

  • transforms (Transform | TransformCollection) – Base Transforms to apply.

  • cleanup (dict[str, Any] | float | int | str | None) –

    Cache cleanup settings.

    If a number type, assumed to represent age of file in days.

    If dictionary type, the following keys can be used:

    Key

    Purpose

    Type

    delta

    Time delta to delete files past

    int, float, tuple, TimeDelta

    dir_size

    Maximum allowed directory size. Deletes oldest according to key

    int, float, str, ByteSize (if str, use ‘100 GB’ format)

    key

    Key to use to find time of file for other time based delete steps. Default ‘modified’.

    Literal[‘modified’, ‘created’]

    data_time

    Maximum difference in time the data is of and current time

    int, float, tuple, TimeDelta

    verbose

    Print files being deleted

    bool

    Cleanup is run on each initialisation and deletion of the CacheIndex, and can be triggered manually with .cleanup()

    Defaults to None.

  • override (bool, optional) – Override cached data. Defaults to False.

  • save_kwargs (dict[str, Any], optional) – Kwargs to pass to save function. Defaults to None.

  • verbose (bool)

Raises:

ValueError – If cache and pattern not given.

pyearthtools.data.indexes.CachingForecastIndex#

alias of TimeCachingIndex

class pyearthtools.data.indexes.IntakeIndex(catalog_file, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, *, add_default_transforms=True, filter_dict=None, **kwargs)#

Index designed to operate on Intake ESM Catalogs

Will not cache the data anywhere.

Example:

>>> import pyearthtools.data
>>> import intake_esm
>>>
>>> cat_url = intake_esm.tutorial.get_url("google_cmip6")
>>>
>>> intakeIndex = pyearthtools.data.IntakeIndex(cat_url)
>>> intakeIndex(experiment_id=["historical", "ssp585"],table_id="Oyr",variable_id="o2",grid_label="gn")

Intake ESM Catalog Index

Parameters:
  • catalog_file (str | Path) – Intake ESM Catalog location

  • transforms (Transform | TransformCollection) – Transforms to add to data.

  • add_default_transforms (bool) – Add default transforms.

  • filter_dict (dict[str, Any] | None) – Filter dictionary for Intake search.

Raises:

ImportError – if intake cannot be imported.

property filter: dict[str, Any]#

Get filters applied to data retrieval

Returns:

Intake ESM search kwargs

Return type:

(dict)

get(**kwargs)#

Get data directly from intake

See ._get_from_intake for docs.

pop_filter(pop=[], *args)#

Pop filter elements from intake searching

Parameters:
  • pop (list[str], optional) – Items to pop from filter

  • *args (str, optional) – Args form of pop.

Return type:

None

search(filter={}, **kwargs)#

Override for Index search,

As this is primarily an Intake Index, search Intake Catalog

Uses filter set through init and update_filter, as will as those given here.

Parameters:
  • filter (dict[str, Any], optional) – Intake search filter, updates filters given in init. Defaults to {}.

  • kwargs (Any) – Extra kwargs for filter.

Returns:

Intake catalog after search

Return type:

(intake_esm.core.esm_datastore)

search_intake(filter_dict={}, **kwargs)#

Search Intake Catalog

Uses filter set through init and update_filter

Parameters:
  • filter_dict (dict, optional) – Updates to filters. Defaults to {}.

  • kwargs (Any)

Returns:

Intake catalog after search

Return type:

(intake_esm.core.esm_datastore)

update_filter(filter_dict=None, **kwargs)#

Update filter for intake searching

Parameters:

filter_dict (dict[str, Any], optional) – Filter update. Defaults to {}.

Return type:

None

class pyearthtools.data.indexes.IntakeIndexCache(catalog_file, cache=None, pattern_kwargs=None, *, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, filter_dict=None, **kwargs)#

Intake ESM Index which caches to a local location.

Uses ArgumentExpansion in the same order as the catalog itself.

Effectively builds a local copy of the intake catalog.

!!! note “Multiple Keys”:

As the data is saved according to the given filters, list or tuples in the filters will be split during the filesystem search, and handled one after the other. This will cause the underlying pattern to not be exactly usable as a cache, elements for it will have to be atomic.

Caching Intake ESM Index

Parameters:
  • catalog_file (str | Path) – Intake ESM Catalog to load.

  • cache (str | Path | None, optional) – Cache Location. If set to None, does not cache. Defaults to None.

  • filter_dict (dict, optional) – Default filters for searching the Intake ESM Catalog. Defaults to {}.

  • **kwargs (Any, optional) – Additional filters.

  • pattern_kwargs (dict[str, Any] | None)

  • transforms (Transform | TransformCollection)

See pyearthtools.data.indexes.BaseCacheIndex for remaining arguments docs.

filesystem(**kwargs)#

Search for generated data if cache is given. If data does not exist yet, generate it, save it, and return the path to it

Data is generated here if cache is given so that .series operations, can work on filesystem, and thus any dask things work well.

Parameters:
  • args (Any) – Args to search for / generate data for

  • self (IntakeIndexCache)

  • kwargs (Any)

Returns:

Filepath to discovered / generated data

Return type:

(Path | list[str | Path] | dict[str, str | Path])

Raises:

NotImplementedError – If cache is not set, cannot cache data.

generate(**kwargs)#

Using child classes implemented _generate, generate data, and save it using the pattern.

Return the saved data as managed by the pattern.

Only args is passed to save pattern to find the path to save at.

Returns:

Saved and reloaded data

Return type:

(Any)

Parameters:
get(**kwargs)#

Retrieve Data given filter kwargs

If cache is given, automatically check to see if the file is generated, else, generate it and return the data

If cache is not given, just generate and return the data

Parameters:

**kwargs (Any) – Kwargs to generate with

Returns:

Loaded data

Return type:

(xr.Dataset | dict[str, xr.Dataset])

search(*args, **kwargs)#

Find file name/path, with the underlying functionality defined by discovered location.

All arguments passed to underlying function.

Parameters:
  • *args (Any, optional) – Arguments passed to underlying search function

  • *kwargs (Any, optional) – Keyword Arguments passed to underlying search function

Returns:

Path to data defined by arguments

Return type:

(Path | list[str | Path] | dict[str, str | Path])

class pyearthtools.data.indexes.cacheIndex.BaseCacheIndex(transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, *args, add_default_transforms=True, preprocess_transforms=None, **kwargs)#

Base CacheIndex

Cannot be used directly, see MemCache or FileSystemCacheIndex.

Introduce [transforms][pyearthtools.data.transforms] to data loading

Parameters:
  • transforms (Transform | TransformCollection, optional) – Base Transforms to be applied to data. Defaults to TransformCollection().

  • add_default_transforms (bool, optional) – Add Default Transformations. Defaults to True

  • preprocess_transforms (Transform | TransformCollection | Callable | None) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

property global_override#

Get a context manager within which data will be overridden in all caches.

property override#

Get a context manager within which data will be overridden in the cache.

class pyearthtools.data.indexes.cacheIndex.MemCache(pattern=None, pattern_kwargs=None, *, max_size=None, compute=False, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, add_default_transforms=True, **kwargs)#

Memory Cache

Examples

>>> import pyearthtools.data
...
>>> mem_cache = pyearthtools.data.indexes.FunctionalMemCacheIndex(function = pyearthtools.data.archive.ERA5.sample())
>>> mem_cache_test('2000-01-01T00')
... # Cached into memory

Cache into memory

Uses either hash of args and kwargs or pattern to create key,

Parameters:
  • pattern (str | type | PatternIndex | None) – Pattern to use to create path to act as key. Defaults to None.

  • pattern_kwargs (dict[str, Any] | None) – Kwargs for pattern if given. Defaults to None.

  • max_size (str | ByteSize | None) – Max size of cache, set to None for no limit. Defaults to None.

  • compute (bool) – Compute xarray / dask objects when given. Defaults to False.

  • transforms (Transform | TransformCollection) – Transforms to add upon data retrieval. Defaults to TransformCollection().

  • add_default_transforms (bool)

cleanup(complete=False)#

Cleanup cache, limiting size to max_size if given.

Parameters:

complete (bool, optional) – Completely remove cache. Defaults to False.

get(*args, **kwargs)#

Get data from Memory Cache

get_hash(*args)#

Get hash of args for unique key of data

If pattern is set, use it to create a path.

Return type:

str

property pattern: PatternIndex | None#

Get Pattern from __init__ args

property size#

Size of current cache,

Will fully count size of xarray objects even if delayed

class pyearthtools.data.indexes.cacheIndex.FileSystemCacheIndex(cache, pattern=None, pattern_kwargs={}, *, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, cleanup=None, override=None, verbose=False, save_kwargs=None, **kwargs)#

DataIndex Object that has no data on disk initially, but is being generated from other sources and saved in given cache.

Data Flowchart

        graph LR
    A[Data Request '.get'] --> B{Cache Given?};
    B --> | Yes | C{Data Exists...};
    C --> | No  | G;
    C --> | Yes | D[Get Data from Cache];
    B --> | No  | G[Generate Data];
    

Base FileSystemCacheIndex Object to Cache data on the fly

If only cache is given, ExpandedDate, or TemporalExpandedDate will be used by default. If cache and pattern not given, will not save data, and the point of this class is lost.

cache can also be ‘temp’ to set to a TemporaryDirectory created on __init__, or include any environment variables, with $NOTATION.

Warning

Existing Cache

If the cache is set to an existing cache location, and the pattern is the same being made and exists, pattern_kwargs will be set by default to the existing cache’s kwargs, and then updated by any given.

Parameters:
  • cache (str | Path | None) – Location to save data to.

  • pattern (str | type | PatternIndex | None) – String of pattern to use or defined pattern. Defaults to ExpandedDate, or TemporalExpandedDate.

  • pattern_kwargs (dict[str, Any] | str) – Kwargs to pass to initalisation of new pattern if pattern is str.

  • transforms (Transform | TransformCollection) – Base Transforms to apply.

  • cleanup (dict[str, Any] | float | int | str | None) –

    Cache cleanup settings.

    If a number type, assumed to represent age of file in days.

    If dictionary type, the following keys can be used:

    Key

    Purpose

    Type

    delta

    Time delta to delete files past

    int, float, tuple, TimeDelta

    dir_size

    Maximum allowed directory size. Deletes oldest according to key

    int, float, str, ByteSize (if str, use ‘100 GB’ format)

    key

    Key to use to find time of file for other time based delete steps. Default ‘modified’.

    Literal[‘modified’, ‘created’]

    data_time

    Maximum difference in time the data is of and current time

    int, float, tuple, TimeDelta

    verbose

    Print files being deleted

    bool

    Cleanup is run on each initialisation and deletion of the CacheIndex, and can be triggered manually with .cleanup()

    Defaults to None.

  • override (bool, optional) – Override cached data. Defaults to False.

  • save_kwargs (dict[str, Any], optional) – Kwargs to pass to save function. Defaults to None.

  • verbose (bool)

Raises:

ValueError – If cache and pattern not given.

cleanup(complete=False)#

Cleanup cache directory using cleanup as provided in __init__.

Parameters:

complete (bool, optional) – Complete directory cleanup. If set to True, this will delete all data in the cache. Defaults to False.

filesystem(*args)#

Search for generated data if cache is given. If data does not exist yet, generate it, save it, and return the path to it

Data is generated here if cache is given so that .series operations, can work on filesystem, and thus any dask things work well.

Parameters:

args (Any) – Args to search for / generate data for

Returns:

Filepath to discovered / generated data

Return type:

(Path | list[str | Path] | dict[str, str | Path])

Raises:

NotImplementedError – If cache is not set, cannot cache data.

generate(*args, **kwargs)#

Using child classes implemented _generate, generate data, and save it using the pattern.

Return the saved data as managed by the pattern.

Only args is passed to save pattern to find the path to save at.

Returns:

Saved and reloaded data

Return type:

(Any)

get(*args, **kwargs)#

Retrieve Data given a key

If cache is given, automatically check to see if the file is generated, else, generate it and return the data

If cache is not given, just generate and return the data

Parameters:
  • *args (Any) – Arguments to generate data for

  • **kwargs (Any) – Kwargs to generate with

Returns:

Loaded data

Return type:

xr.Dataset

property pattern: PatternIndex#

Get Pattern from __init__ args

save_record()#

Save record of the cache and pattern within the cache directory.

class pyearthtools.data.indexes.cacheIndex.CacheFactory(basecache, index, *, name=None, doc=None)#

Create Cache Subclasses

Parameters:
  • basecache (type)

  • index (type[Index])

  • name (str | None)

  • doc (str | None)

Return type:

type

class pyearthtools.data.indexes.cacheIndex.FunctionalCache(*args, function, **kwargs)#

Introduce [transforms][pyearthtools.data.transforms] to data loading

Parameters:
  • transforms (Transform | TransformCollection, optional) – Base Transforms to be applied to data. Defaults to TransformCollection().

  • add_default_transforms (bool, optional) – Add Default Transformations. Defaults to True

  • preprocess_transforms (Transform | TransformCollection | Callable | None) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

  • function (Callable[[Any], Any])

class pyearthtools.data.indexes.combine.InterpolationIndex(*ind, indexes=None, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, data_interval=None, **kwargs)#

Setup TimeDataIndex

For indexing with time and applying transforms

Parameters:
  • transforms (Transform | TransformCollection, optional) – Transforms to add when retrieving data. Defaults to TransformCollection().

  • data_interval (tuple[int, str] | int, optional) – Temporal Interval of Data. Defaults to None.

  • preprocess_transforms (Transform | TransformCollection, optional) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

  • indexes (Index | dict)

get(*args, **kwargs)#

Base Level .get call, used to retrieve data from args

retrieve(querytime, *, aggregation=None, select=True, use_simple=False, **kwargs)#

Retrieve data at timestep, but will use the resolution of the time to infer large scale retrievals.

Tip

“Date Behaviour”

>>>    | Date               | Behaviour              |
>>>    | ------------------ | -----------------------|
>>>    | '2021-01-01T00:00' | Exact Data             |
>>>    | '2021-01-01'       | All Data in that day   |
>>>    | '2021-01'          | All Data in that month |
>>>    | '2021'             | All Data in that year  |
Parameters:
  • querytime (str | Petdt) – Timestep to retrieve data at, can be exact data or range as described above.

  • aggregation (str) – If data becomes a range, can specify an aggregation method.

  • select (bool) – Whether to attempt to select the given timestep if date is either fully qualified or data_interval not given.

  • use_simple (bool) – Whether to simply use the DataIndex.retrieve instead.

  • kwargs – Kwargs passed to downstream retrieval function

Returns:

Loaded Dataset with transforms applied, and aggregated if aggregation_method given.

Raises:

DataNotFoundError – If Data not found at timestep.

Return type:

Dataset

Note

Extra transforms can be supplied, using `transforms = `

class pyearthtools.data.indexes.fake.FakeIndex(variable='data', *, interval=(1, 'hour'), max_value=1.0, data_size=(128, 128), random=True, **kwargs)#

Get fake random seed data at a given interval.

Appears to be a latitude longitude dataset.

As this implements the AdvancedTimeDataIndex, selecting lower resolutions behaves correctly.

Setup fake data indexer

Parameters:
  • variable (list[str] | str, optional) – Name/Names of variables. Defaults to “data”.

  • interval (tuple, optional) – Interval of data. Defaults to (1, “hour”).

  • max_value (float, optional) – Maximum value in random data. Defaults to 1.0.

  • data_size (tuple[int, int], optional) – Lat, Lon size. Defaults to (128, 128).

  • random (bool) – Whether to make random data, if not, will make data with max_value as all values.

get(time)#

Base Level .get call, used to retrieve data from args

Parameters:

time (Petdt | str)

Return type:

Dataset

pyearthtools.data.indexes.extensions.register_accessor(name, object=<class 'pyearthtools.data.indexes._indexes.Index'>)#

Register a custom accessor on pyearthtools.data indexes.

Any decorated class will receive the pyearthtools.data.Index as it’s first and only argument.

Parameters:
  • name (str) – Name under which the accessor should be registered. A warning is issued if this name conflicts with a preexisting attribute.

  • object (str | type | ModuleType, optional) – pyearthtools.data.indexes object to register accessor to. By default this will add to the base level index, so is available from all. Defaults to Index.

Return type:

Callable

Examples

In your library code:

>>> @pyearthtools.data.register_accessor("geo", 'DataIndex')
... class GeoAccessor:
...     def __init__(self, pyearthtools_obj):
...         self._obj = pyearthtools_obj

… # Using the pyearthtools.data.Index, retrieve data and do something. … def plot(self): … # Run plotting … pass …

Back in an interactive IPython session:

>>> era5 = pyearthtools.data.archive.ERA5(
...     variables = '2t', level = 'single'
... )
>>> era5.geo.plot()  # plots index on a map

data.modifications#

class pyearthtools.data.modifications.Modification(variable, index_class, index_kwargs, variable_keyword)#

Modifications to variables for Data Indexes

These are to be used when modifying variables at a core level, such that the more information is needed then what is returned upon a simple index into the data.

For example:
Creating an accumulation:

When getting data at particular time step, an accumulation cannot be found as it requires prior information, a modification can then go and get this to create the accumulation.

This is how it differs from a transform, as they transform the data retrieved, and this creates and modifies effectively as it is being retrieved.

Implementing:

To implement a Modification single & series must be provided.

single takes a single timestep and expects a dataset to be returned with the variable as modified.

series takes a start, end and interval, as can be parsed by pyearthtools.data.TimeRange, and expects a dataset to be returned with the variable as modified but all timesteps as defined by the range.

variable contains the variable being modified.

data contains the TimeDataIndex to retrieve the data from.

attribute_update can be overridden to specify a dictionary to update the attributes with.

Setup Modification

Parameters:
  • variable (str) – Variable being modified

  • index_class (TimeDataIndex) – Class where data is being sourced from

  • index_kwargs (dict[str, Any]) – Kwargs used to init index_class

  • variable_keyword (str) – Keyword for variable when initing index_class

property attribute_update#

Attributes to update on variable

property data#

Get the TimeDataIndex as specified by the user in which to find the modification.

abstractmethod series(start, end, interval)#

Get the modification for a series of timesteps

Return type:

Dataset

abstractmethod single(time)#

Get the modification for a single timestep

Return type:

Dataset

property variable#

Variable being modified as given by the user.

pyearthtools.data.modifications.register_modification(name)#

Register a modification for use with @pyearthtools.data.indexes.decorators.variable_modifications.

Parameters:

name (str) – Name under which the modification should be registered. A warning is issued if this name conflicts with a preexisting modification.

Return type:

Callable

pyearthtools.data.modifications.variable_modifications(variable_keyword='variable', *, remove_variables=False, skip_if_invalid_class=False)#

Allow modifications of variables dynamically,

Parameters:
  • variable_keyword (str, optional) – Parameter name of variables to parse. Defaults to “variable”.

  • remove_variables (bool, optional) – Whether to remove variables from the initialisation of the underlying class. Defaults to False.

  • skip_if_invalid_class (bool, optional) – Whether to skip if discovered class is invalid. Is invalid if class is not a subclass of TimeIndex and DataIndex

Raises:
  • KeyError – If cannot find variable_keyword in init args.

  • TypeError – If class is not a subclass of TimeIndex and DataIndex and not skip_if_invalid_class.

Return type:

Callable[[C], C]

Syntax:

Within the specification of the variables, a user can set the modifications with either,

Can be str of form '!accumulate[period: "6 hourly"]:tcwv>accum_tcwv', where:

  • !accumulate references the function to apply

  • the [init kwargs] specify the required kwargs needed, supplied in json form,

  • the string after : being the normal variable specification with anything after > being the new name.

Or dictionary with following keys:

  • source_var (REQUIRED) Variable to modify

  • modification (REQUIRED) Modification to apply

  • target_var Rename of variable

  • ** Any other keys for modification

This will be transparent to the user, and only act upon retrieval of data.

Available modifications include:

  • !accumulate

  • !mean

  • !aggregate

Examples

>>> class Archive(ArchiveIndex):
>>>     @variable_modifications(variable_keyword = 'variable')
>>>     def __init__(self, variable):
...     ...
...
... # Then usage of that Archive
>>> Archive('!accumulate[period = "6 hourly"]:tcwv)

Notes

If using this decorator with check_arguments put this one above it, and with alias_arguments put it below.

class pyearthtools.data.modifications.aggregations.Aggregation(period, align='past', **kwargs)#

Root class for the creation of an aggregated variable

Cannot be directly used.

time dimension will be renamed aggregate_dim and is expected to be aggregated over for single.

Setup aggregator

Parameters:
  • period (str) – Period to aggregate over Used here to extend time

  • inclusive (bool, optional) – Include end time. Defaults to False.

  • align (Literal['past'])

class pyearthtools.data.modifications.aggregations.AggregationGeneral(method, period, align='past', **kwargs)#

Create a general aggregation over time variable.

Aggregates as a rolling window of size period

Usage: - !aggregation[method: ‘max’, period: “6 hours”]

General aggregation

Parameters:
  • method (str) – Method name to use

  • period (str) – Period to apply method over

  • inclusive (bool, optional) – Include end time. Defaults to False.

  • align (Literal['past'])

property attribute_update#

Attributes to update on variable

series(start, end, interval)#

Get the modification for a series of timesteps

Return type:

Dataset

single(time)#

Get the modification for a single timestep

Return type:

Dataset

class pyearthtools.data.modifications.aggregations.Mean(period, align='past', **kwargs)#

Create a mean over time variable

Averages as a rolling window of size period

Usage: - !mean[period: “6 hours”]

Setup aggregator

Parameters:
  • period (str) – Period to aggregate over Used here to extend time

  • inclusive (bool, optional) – Include end time. Defaults to False.

  • align (Literal['past'])

property attribute_update#

Attributes to update on variable

series(start, end, interval)#

Get the modification for a series of timesteps

Return type:

Dataset

single(time)#

Get the modification for a single timestep

Return type:

Dataset

class pyearthtools.data.modifications.aggregations.Accumulate(period, align='past', **kwargs)#

Create an accumlated over time variable

Accumulates as a rolling window of size period

Usage: - !accumulate[period: “6 hours”]

Setup aggregator

Parameters:
  • period (str) – Period to aggregate over Used here to extend time

  • inclusive (bool, optional) – Include end time. Defaults to False.

  • align (Literal['past'])

property attribute_update#

Attributes to update on variable

series(start, end, interval)#

Get the modification for a series of timesteps

Return type:

Dataset

single(time)#

Get the modification for a single timestep

Return type:

Dataset

class pyearthtools.data.modifications.constants.Constant(query=None, memory=True, **kwargs)#

Force a variable to remain constant no matter the time requested.

Uses query if given, otherwise sets it off first time requested. Use memory to control if precomputed.

Usage: - !constant[query: ‘2000-01-01T00’, memory: True]:variable

General aggregation

Parameters:
  • query (Optional[str]) – Query to use. If None, will use first time retrieved. Defaults to None.

  • memory (bool) – Whether to hold the data in memory. Defaults to True.

property attribute_update#

Attributes to update on variable

series(start, end, interval)#

Get the modification for a series of timesteps

Return type:

Dataset

single(time)#

Get the modification for a single timestep

Return type:

Dataset

class pyearthtools.data.modifications.decorator.VariableModification(specification)#
Parameters:

specification (str | dict[str, Any])

class pyearthtools.data.modifications.decorator.Modifier(index, modifications, index_kwargs, variable_keyword)#

Transform to apply the modification to variables

Setup Modifier

Parameters:
  • index (TimeDataIndex) – Base Index in which data is being modified

  • modifications (dict[str, tuple[Type['Modification'], dict[str, Any]]]) –

    Dictionary of modifications:

    variable: (Modification Class, modification init kwargs)

  • index_kwargs (dict[str, Any]) – Kwargs used to initialise index, used to recreate the indexes.

  • variable_keyword (str) – Keyword name for variable

apply(dataset)#

Apply modifications to dataset

Will replace each variable being modified if in dataset.

Parameters:

dataset (Dataset)

Return type:

Dataset

property modifiers: dict[str, Modification]#

Get initilised dictionary of modifiers

variable: Modifier

to_repr_dict()#

Convert to dictionary ready for repr

class pyearthtools.data.modifications.reductions.Reduction(variable, index_class, index_kwargs, variable_keyword)#

Setup Modification

Parameters:
  • variable (str) – Variable being modified

  • index_class (TimeDataIndex) – Class where data is being sourced from

  • index_kwargs (dict[str, Any]) – Kwargs used to init index_class

  • variable_keyword (str) – Keyword for variable when initing index_class

class pyearthtools.data.modifications.reductions.Groupby(time_component, method, **kwargs)#

Setup Modification

Parameters:
  • variable (str) – Variable being modified

  • index_class (TimeDataIndex) – Class where data is being sourced from

  • index_kwargs (dict[str, Any]) – Kwargs used to init index_class

  • variable_keyword (str) – Keyword for variable when initing index_class

  • time_component (str)

  • method (str)

series(start, end, interval)#

Get the modification for a series of timesteps

Return type:

Dataset

single(time)#

Get the modification for a single timestep

Return type:

Dataset

pyearthtools.data.modifications.reductions.Hourly(method='mean', **kwargs)#
Parameters:

method (str)

pyearthtools.data.modifications.reductions.Daily(method='mean', **kwargs)#
Parameters:

method (str)

pyearthtools.data.modifications.reductions.Monthly(method='mean', **kwargs)#
Parameters:

method (str)

pyearthtools.data.modifications.register.register_modification(name)#

Register a modification for use with @pyearthtools.data.indexes.decorators.variable_modifications.

Parameters:

name (str) – Name under which the modification should be registered. A warning is issued if this name conflicts with a preexisting modification.

Return type:

Callable

data.operations#

pyearthtools.data.operations.percentile(dataset, percentiles)#

Find Percentiles of given data

Parameters:
  • dataset (xr.DataArray | xr.Dataset) – Dataset to find percentiles of

  • percentiles (float | list[float]) – Percentiles to find either float or list[float]

Returns:

Dataset with percentiles

Return type:

(xr.Dataset)

Examples

>>> percentile(dataset, [1, 99])
# Dataset containing 1st and 99th percentiles
pyearthtools.data.operations.aggregation(dataset, aggregation, reduce_dims=None, *, preserve_dims=None)#

Run an aggregation method over a given dataset

!!! Warning

Either reduce_dims or preserve_dims must be given, but not both.

Parameters:
  • dataset (xr.Dataset) – Dataset to run aggregation over

  • aggregation (str | Callable) – Aggregation method, can be defined function or xarray function

  • reduce_dims (list | str, optional) – Dimensions to reduce over. Defaults to None.

  • preserve_dims (list | str, optional) – Dimensions to keep. Defaults to None.

Raises:

ValueError – If invalid reduce_dims or preserve_dims are given

Returns:

Dataset with aggregation method applied

Return type:

(xr.Dataset)

pyearthtools.data.operations.binning(data, setup, *, dimension='time', expand=True, offset=None)#

Bin data based on a binning setup.

If expand is True use DELTA to create new bins until all included.

## Implemented: | name | Description | | —- | ———– | | seasonal | Daily up till first week, than weekly | | daily | Daily grouping | | weekly | Weekly grouping |

Parameters:
  • data (xr.Dataset | xr.DataArray) – Data to bin

  • setup (str) – Binning config to use.

  • dimension (str, optional) – Dimension to bin across. Defaults to ‘time’.

  • expand (bool, optional) – Whether to expand bins to encompass all the data. Defaults to True.

  • offset (int | TimeDelta | None, optional) – Offset to add to starting time. Will be the minimum value upon time axis. Defaults to None.

Raises:
  • ValueError – If setup not available, or not in DELTA while expand is True.

  • AttributeError – If dimension not in data.

Returns:

Data binned according to config.

Return type:

(xr.DatasetGroupBy | xr.DataArrayGroupBy)

class pyearthtools.data.operations.SpatialInterpolation(*datasets, reference_dataset=None, merge=True, method='linear', include_reference=True, **kwargs)#

Spatially Interpolate Datasets together Uses [pyearthtools.data.transforms.interpolation][pyearthtools.data.transforms.interpolation.InterpolateTransform], thus all kwargs passed there

Parameters:
  • *datasets (Dataset) – All datasets to be spatially and temporally interpolated

  • reference_dataset (Dataset | None) – Reference Dataset to use as base, if not given use first dataset.

  • merge (bool) – Whether to merge datasets together.

  • method (str) – Spatially interpolation method. Uses [xarray interpolation][xarray.interpolation], which itself uses [scipy.interpolate][scipy.interpolate.interpn].

  • include_reference (bool) – Whether to include reference datasets.

  • **kwargs – Extra kwargs passed to [pyearthtools.data.transforms.interpolation][pyearthtools.data.transforms.interpolation.InterpolateTransform.like]

Returns:

List of datasets if merge == false, else one merged datasets

Return type:

list[Dataset] | Dataset

class pyearthtools.data.operations.TemporalInterpolation(*datasets, reference_dataset=None, aggregation_function='mean', merge=True, include_reference=True, **kwargs)#

Temporally Interpolate Datasets together Uses [pyearthtools.data.transforms.Aggregation][pyearthtools.data.transforms.aggregation.AggregateTransform.over], thus all kwargs passed there

!!! Behaviour

All timesteps will be aggregated to match time dim of reference dataset, Will only grab time before the given timestep

Parameters:
  • *datasets (xr.Dataset) – All datasets to be spatially and temporally interpolated

  • reference_dataset (xr.Dataset, optional) – Reference Dataset to use as base, if not given use first dataset. Defaults to None.

  • aggregation_function (Callable | str, optional) – Aggregation function to use. Uses [pyearthtools.data.transforms.Aggregation][pyearthtools.data.transforms.aggregation.AggregateTransform.over]. Defaults to “mean”.

  • merge (bool, optional) – Whether to merge datasets together. Defaults to True.

  • include_reference (bool, optional) – Whether to include reference datasets. Defaults to True.

Raises:

ValueError – If time dim not present in reference_dataset

Returns:

List of datasets if merge == false, else one merged datasets

Return type:

(list[xr.Dataset] | xr.Dataset)

class pyearthtools.data.operations.FullInterpolation(*datasets, reference_dataset=None, temporal_reference_dataset=None, spatial_method='linear', aggregation_function='mean', merge=True, include_reference=True)#

Interpolate Datasets both spatially and temporally

Parameters:
  • *datasets (xr.Dataset) – All datasets to be spatially and temporally interpolated

  • reference_dataset (xr.Dataset, optional) – Reference Dataset to use as base, if not given use first dataset. Defaults to None.

  • temporal_reference_dataset (xr.Dataset, optional) – Temporal Reference Dataset to use as base, if not given use reference_dataset. Defaults to None.

  • spatial_method (str, optional) – Spatially interpolation method. Defaults to “linear”.

  • aggregation_function (Callable | str, optional) – Aggregation function to use. Uses [pyearthtools.data.transforms.Aggregation][pyearthtools.data.transforms.aggregation.AggregateTransform.over]. Defaults to “mean”.

  • merge (bool, optional) – Whether to merge datasets together. Defaults to True.

  • include_reference (bool, optional) – Whether to include reference datasets. Defaults to True.

Returns:

List of datasets if merge == false, else one merged datasets

Return type:

(list[xr.Dataset] | xr.Dataset)

pyearthtools.data.operations.index_routines.series(DataFunction, start, end, interval, *, inclusive=False, skip_invalid=False, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, verbose=False, force_get=False, subset_time=True, time_dim=None, tolerance=None, **kwargs)#

Index into Provided Data function to create a continuous series of Data

Parameters:
  • DataFunction (AdvancedTimeIndex) – Data function, must be AdvancedTimeIndex or child

  • start (str | datetime.datetime | Petdt) – Timestep to begin series at

  • end (str | datetime.datetime | Petdt) – Timestep to end series at

  • interval (tuple[float, str]) – Time interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • inclusive (bool, optional) – Whether end time is included in retrieval. Defaults to False.

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to False.

  • transforms (Transform | TransformCollection, optional) – Extra [Transform’s][pyearthtools.data.transforms.Transform] to be applied to data. Defaults to TransformCollection().

  • verbose (bool, optional) – Print logging messages. Defaults to False.

  • force_get (bool, optional) – Use series method which loads each dataset using .get. WARNING: Takes significantly longer, as it does not use dask. Defaults to False.

  • subset_time (bool, optional) – Whether to force subset time dim. Defaults to True.

  • tolerance (tuple | pd.Timedelta, optional) – Tolerance for time subsetting. Defaults to None.

  • time_dim (str | None)

Returns:

Loaded xarray dataset

Return type:

(xr.Dataset)

pyearthtools.data.operations.index_routines.safe_series(DataFunction, start, end, interval, **kwargs)#

Safely index into the provided Data function to create a continuous series of Data.

Called by the series method

Uses [series][pyearthtools.data.operations.index_routines.series], but provides an automatic interpolation.

!!! Warning

If data is missing or if a resolution higher than the actual data resolution is provided, those missing time steps will be interpolated,

Parameters:
  • DataFunction (AdvancedTimeIndex) – Data function, must be AdvancedTimeIndex or child

  • start (str | Petdt) – Timestep to begin series at

  • end (str | Petdt) – Timestep to end series at

  • interval (TimeDelta) – Time interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • **kwargs (dict, optional) – Any extra keyword arguments to pass to [series][pyearthtools.data.operations.index_routines.series]

Returns:

Loaded xarray dataset

Return type:

(xr.Dataset)

pyearthtools.data.operations.index_operations.split_ds(dataset, divisions=1, dim='time')#

Split an xarray Dataset into a set number of datasets

Parameters:
  • dataset (xr.Dataset) – Dataset to split

  • divisions (int, optional) – Number of divisions to make. Defaults to 1.

  • dim (str, optional) – Which dim to split on. Defaults to “time”.

Returns:

List of Datasets

Return type:

list[xr.Dataset]

pyearthtools.data.operations.index_operations.split_ds_gen(dataset, divisions=1, dim='time')#

Generator version of split_ds

Parameters:
  • dataset (xr.Dataset) – Dataset to split

  • divisions (int, optional) – Number of divisions to make. Defaults to 1.

  • dim (str, optional) – Which dim to split on. Defaults to “time”.

Yields:

list[xr.Dataset] – List of Datasets

Return type:

list[Dataset]

pyearthtools.data.operations.index_operations.aggregation(DataFunction, start, end, interval, *, aggregation='mean', aggregation_dim='time', save_location=None, skip_invalid=True, num_divisions=1, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, verbose=False, **kwargs)#

Get aggregation of [TimeIndex][pyearthtools.data.TimeIndex] over given dimension

!!! Warning:

Any num_divisions not a factor of the number of data steps will result in some data being missed.

Parameters:
  • DataFunction (TimeIndex) – TimeIndex to retrieve Data

  • start (str | datetime.datetime | Petdt) – Start Date

  • end (str | datetime.datetime | Petdt) – End Date

  • interval (tuple[float, str]) – Interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • aggregation (str, optional) – Aggregation Function to apply. Defaults to “mean”.

  • aggregation_dim (str, optional) – Dimension to aggregate over apply. Defaults to “time”.

  • save_location (str | Path | None, optional) – Location to automatically save the result. Defaults to None.

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to True.

  • num_divisions (int, optional) – Number of times to divide series to alleviate memory issues. Defaults to 1.

  • transforms (Transform | TransformCollection, optional) – Extra Transforms to be applied. Defaults to TransformCollection().

  • verbose (bool, optional) – Whether to log progress messages. Defaults to False.

Returns:

Dataset with aggregation applied

Return type:

xr.Dataset

pyearthtools.data.operations.index_operations.find_range(DataFunction, start, end, interval, *, skip_invalid=True, num_divisions=1, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, **kwargs)#

Find Minimum and Maximum of a [TimeIndex][pyearthtools.data.TimeIndex] in the given time range

Parameters:
  • DataFunction (TimeIndex) – TimeIndex to retrieve Data

  • start (str | Petdt) – Start Date

  • end (str | Petdt) – End Date

  • interval (tuple[float, str]) – Interval between samples. Use pandas.to_timedelta notation, (10, ‘minute’)

  • skip_invalid (bool, optional) – Whether to skip invalid data. Defaults to True.

  • num_divisions (int, optional) – Number of times to divide series to alleviate memory issues. Defaults to 1.

  • transforms (Transform | TransformCollection, optional) – Extra Transforms to be applied. Defaults to TransformCollection().

Returns:

Dictionary with max and min populated

Return type:

dict

pyearthtools.data.operations.utils.identify_time_dimension(data)#

Attempt to identify time dimension in dataset.

If cannot be identified, return ‘time’

Parameters:

data (DataArray | Dataset)

Return type:

str

pyearthtools.data.operations.forecast_op.forecast_series(DataFunction, start, end, interval, *, lead_time=None, inclusive=False, skip_invalid=False, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, verbose=False)#
Parameters:
Return type:

Dataset

pyearthtools.data.operations.forecast_op.forecast_as_basetime(DataFunction, start, end, interval, *, inclusive=False, skip_invalid=False, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, verbose=False)#

Forecast series concating by basetime

Parameters:
pyearthtools.data.operations.forecast_op.forecast_select_time(DataFunction, start, end, interval, lead_time, *, inclusive=False, skip_invalid=False, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, verbose=False)#

Forecast Series operation selecting a particular lead time

Parameters:

data.patterns#

class pyearthtools.data.patterns.PatternIndex(*args, root_dir, **kwargs)#

Introduce [transforms][pyearthtools.data.transforms] to data loading

Parameters:
  • transforms (Transform | TransformCollection, optional) – Base Transforms to be applied to data. Defaults to TransformCollection().

  • add_default_transforms (bool, optional) – Add Default Transformations. Defaults to True

  • preprocess_transforms (Transform | TransformCollection | Callable | None) – Transforms to apply in preprocessing for datasets. Does not work on other file formats. Defaults to None.

  • root_dir (str | Path)

cleanup(safe=False)#

Clean up temp_dir if it exists.

If not safe and not temp_dir raise AttributeError

Parameters:

safe (bool)

static from_pattern(pattern_function, *args, **kwargs)#

Create Pattern Index from given pattern name

Parameters:
  • *args (Any) – Passed to discovered pattern

  • pattern_function (Callable | str) – Either the function to use, or the pattern name within pyearthtools.data.patterns

  • *kwargs (Any) – Passed to discovered pattern

Raises:
  • KeyError – If pattern not found

  • TypeError – If not callable

Returns:

Loaded Pattern Index

Return type:

PatternIndex

get_root_dir()#

Get root dir if set.

Raises:

RuntimeError – If root_dir not set

Returns:

Set root_dir

Return type:

(str | Path)

save(data, *args, **kwargs)#

Save data using this pattern to find where to save

Parameters:
  • data (Any) – Data to save

  • *args (Any, optional) – Arguments to pass to search to find filepath

  • *kwargs (Any, optional) – Keyword arguments to pass to search to find filepath

class pyearthtools.data.patterns.PatternTimeIndex(*args, **kwargs)#

Temporal Pattern Index

Used for when a pattern can advanced time indexing, like [series][pyearthtools.data.AdvancedTimeIndex.series]

Setup TimeIndex.

Will warn a user if date is of incorrect resolution

Parameters:
  • data_interval

    Interval of data. Must follow format for [TimeDelta][pyearthtools.data.time.TimeDelta].

    • E.g.

    >>> (1, 'h') = 1 Hour
    >>> (10, 'D') = 10 Days.
    >>> 10 = 10 minutes.
    

  • round – Default value for round when retrieving data.

class pyearthtools.data.patterns.PatternForecastIndex(*args, **kwargs)#

Setup TimeIndex.

Will warn a user if date is of incorrect resolution

Parameters:
  • data_interval

    Interval of data. Must follow format for [TimeDelta][pyearthtools.data.time.TimeDelta].

    • E.g.

    >>> (1, 'h') = 1 Hour
    >>> (10, 'D') = 10 Days.
    >>> 10 = 10 minutes.
    

  • round – Default value for round when retrieving data.

class pyearthtools.data.patterns.PatternVariableAware(variables=None, *args, variable_parse=None, verbose=True, **kwargs)#

Base Pattern class for patterns which are variable aware.

That means, any dataset passed to be saved will be saved in individual variables, and files can be loaded from different variables.

A child class must implement root_pattern, this informs this class which pattern to use when constructing a new PatternIndex for each variable. Using variable_parse allows a user to specify which arguments the variable is added to.

A child class pattern can set a default variable_parse by setting the default_variable_parse property.

Examples

Say a pattern is initalised as, ExpandedDateVariable(root_dir = 'test', prefix = 'prefix_1')

If variable_parse was set to root_dir, any variable being requested will be added to the end of root_dir. This new root_dir = test/VARIABLE will be used to create a new pattern soley used for that variable, ExpandedDateVariable(root_dir = 'test/VARIABLE', prefix = 'prefix_1')

Construct a variable aware pattern.

Parameters:
  • variables (str | list[str], optional) – Variables to find by default. When saving, variables will be appended to this list. If not given, can’t really be used to load data, but is useful for saving data. Defaults to []

  • variable_parse (str | list[str], optional) – Initalisation argument/s to add variable to when constructing the new pattern. Defaults to ‘prefix’ if default_variable_parse not set.

  • verbose (bool, optional) – Whether to warn the user if a variable is saved and not already given. Defaults to True.

!!! Note

variable_parse can be used to reference many different types, and the following table details the behaviour. | Type | Behaviour | | —- | ——— | | Path | Added to the end as directory layer | | str | Attempt to parse to Path, or just append | | None | Replaced | | list | Appended |

filesystem(*args, variables=None, **kwargs)#

Find paths on disk for all variables given the arguments

Parameters:
  • *args (Any, optional) – Arguments to pass to underlying pattern filesystem

  • variables (list[str] | str, optional) – Extra variables to add to find. Defaults to None

  • **kwargs (Any, optional) – Keyword arguments to pass to underlying pattern filesystem

Returns:

Dictionary of paths to each variable {variable: PathToVariable}

Return type:

(dict)

abstract property root_pattern: PatternIndex#

Get pattern for finding/saving a specific variable

!!! Note

Must be implemented by the child class

Returns:

Uninitalised pattern to use to find location of variable

Return type:

(PatternIndex)

save(data, *save_args, **save_kwargs)#

Save a [dataset][xarray.Dataset] splitting it by variable.

Extra arguments are used in a search call to find path to save data at.

Parameters:
  • data (xr.Dataset) – Data to save

  • *save_args (Any, optional) – Arguments to pass to underlying pattern save

  • **save_kwargs (Any, optional) – Keyword arguments to pass to underlying pattern save

Raises:

TypeError – If data is not a [dataset][xarray.Dataset]

variable_pattern(variable)#

Using the given variable and the root_pattern, parse variable_parse so that the variable is added correctly to init arguments to construct a new pattern specific to that variable.

Parameters:

variable (str) – Variable to make pattern for

Raises:
  • TypeError – If cannot add variable to init argument

  • KeyError – If variable parse not in init_kwargs

Returns:

Initialised pattern to use for the parsed variable

Return type:

(PatternIndex)

class pyearthtools.data.patterns.Argument(root_dir, *, prefix='', extension='pyearthtools', valid_arguments=None, filename_as_arguments=False, filename_delimiter='_', expand_tuples=False, **kwargs)#

Generate FilePath Structure based upon a single argument

The argument specifies the filename, and the path is built out from __init__ params

Examples

>>> pattern = pyearthtools.data.patterns.Argument('/dir/', extension = '.nc')
>>> str(pattern.search('test'))
'/dir/test.nc'

Argument Expansion based DataIndexer.

Parameters:
  • root_dir (str | Path) – Root Path to use

  • prefix (str) – prefix to add.

  • extension (str) – File Extension to use. Used to determine saving and loading function.

  • valid_arguments (list[Any] | None) – Valid arguments to limit usability to.

  • filename_as_arguments (bool) –

    Whether the filename should be constructed from all arguments.

    • E.g.

      >>> ArgumentExpansion('name', 'dir1')
      ... # root_dir/name/dir1/name_dir1.extension
      
    • If False, filename is first argument given.

  • filename_delimiter (str) – delimiter for filename if filename_as_arguments is True.

  • expand_tuples (bool | int) – Whether to expand tuples when given in search. If True, levels = 1. If int represents how many levels to descend in the Iterable.

filesystem(filename)#

Get filepath from arguments.

If filename_as_arguments is True, filename will be made from all args. Otherwise, filename will be first arg, with remaining making up the directory.

Parameters:

filename (str)

Return type:

Path

class pyearthtools.data.patterns.ArgumentExpansion(root_dir, *, prefix='', extension='pyearthtools', valid_arguments=None, filename_as_arguments=False, filename_delimiter='_', expand_tuples=False, **kwargs)#

Generate FilePath Structure based upon expansion of arguments

If filename_as_arguments is False:

First argument specifies the FileID, and subsequent arguments are used to create folder path.

Otherwise:

Filename is made from all args, and directory is all args too.

Examples

>>> pattern = pyearthtools.data.patterns.ArgumentExpansion('/dir/')
>>> str(pattern.search('test','arg'))
'/dir/arg/test.nc'
>>> str(pattern.search('test','arg', 'another_arg'))
'/dir/arg/another_arg/test.nc'
>>> pattern = pyearthtools.data.patterns.ArgumentExpansion('/dir/', filename_as_arguments = True)
>>> str(pattern.search('test','arg'))
'/dir/test/arg/test_arg.nc'
>>> pattern = pyearthtools.data.patterns.ArgumentExpansion('/dir/', expand_tuples = True)
>>> [str(x) for x in pattern.search('test',('arg1', 'arg2'))]
['/dir/arg1/test.nc', '/dir/arg2/test.nc']

Argument Expansion based DataIndexer.

Parameters:
  • root_dir (str | Path) – Root Path to use

  • prefix (str) – prefix to add.

  • extension (str) – File Extension to use. Used to determine saving and loading function.

  • valid_arguments (list[Any] | None) – Valid arguments to limit usability to.

  • filename_as_arguments (bool) –

    Whether the filename should be constructed from all arguments.

    • E.g.

      >>> ArgumentExpansion('name', 'dir1')
      ... # root_dir/name/dir1/name_dir1.extension
      
    • If False, filename is first argument given.

  • filename_delimiter (str) – delimiter for filename if filename_as_arguments is True.

  • expand_tuples (bool | int) – Whether to expand tuples when given in search. If True, levels = 1. If int represents how many levels to descend in the Iterable.

factory(*, single_argument=False, variable=False, **kwargs)#

Create an ArgumentExpansion pattern based on the requirements

Parameters:
  • single_argument (bool, optional) – Single Argument pattern. Defaults to False.

  • variable (bool, optional) – Variable aware, splits variables when loading and saving. Defaults to False.

  • args (Any)

Returns:

Created ArgumentExpansion pattern.

Return type:

(ArgumentExpansion)

class pyearthtools.data.patterns.ArgumentExpansionVariable(variables=None, *args, variable_parse=None, verbose=True, **kwargs)#

ArgumentExpansion pattern which is variable aware

Will split each variable into a seperate file, using the variable as another layer in the root_dir

Examples

>>> pattern = ArgumentExpansionVariable(root_dir = '/test/', variables = 'variable', extension = 'nc')
>>> str(pattern.search('filename', 'arg2'))
{'variable' : '/test/arg2/variable/filename.nc'}

Construct a variable aware pattern.

Parameters:
  • variables (str | list[str], optional) – Variables to find by default. When saving, variables will be appended to this list. If not given, can’t really be used to load data, but is useful for saving data. Defaults to []

  • variable_parse (str | list[str], optional) – Initalisation argument/s to add variable to when constructing the new pattern. Defaults to ‘prefix’ if default_variable_parse not set.

  • verbose (bool, optional) – Whether to warn the user if a variable is saved and not already given. Defaults to True.

!!! Note

variable_parse can be used to reference many different types, and the following table details the behaviour. | Type | Behaviour | | —- | ——— | | Path | Added to the end as directory layer | | str | Attempt to parse to Path, or just append | | None | Replaced | | list | Appended |

property root_pattern: type[ArgumentExpansion]#

Get pattern for finding/saving a specific variable

!!! Note

Must be implemented by the child class

Returns:

Uninitalised pattern to use to find location of variable

Return type:

(PatternIndex)

class pyearthtools.data.patterns.ArgumentExpansionFactory(*args, single_argument=False, variable=False, **kwargs)#

Create an ArgumentExpansion pattern based on the requirements

Parameters:
  • single_argument (bool, optional) – Single Argument pattern. Defaults to False.

  • variable (bool, optional) – Variable aware, splits variables when loading and saving. Defaults to False.

  • args (Any)

Returns:

Created ArgumentExpansion pattern.

Return type:

(ArgumentExpansion)

class pyearthtools.data.patterns.Direct(root_dir, *, extension='.pyearthtools', prefix=None, file_resolution='minute', delimiter='', **kwargs)#

Generate Filepath structure based on time at given root directory

Examples

>>> pattern = pyearthtools.data.patterns.Direct('/dir/', extension = '.nc')
>>> str(pattern.search('2020-01-02T0030'))
'/dir/20200102T0030.nc'
>>> pattern = pyearthtools.data.patterns.Direct('/dir/', extension = '.nc', deliminator = ('@', '%'))
>>> str(pattern.search('2020-01-02T0030'))
'/dir/2020@01@02T00%30.nc'

Direct time based DataIndexer.

Parameters:
  • root_dir (str | Path) – Root Path to use

  • extension (str, optional) – File extension to load. Defaults to ‘data.patterns.default_extension’

  • prefix (None | str, optional) – File prefix to add. Defaults to None.

  • file_resolution (str | TimeResolution, optional) – Temporal resolution of the file name. Defaults to ‘minute’.

  • delimiter (str | tuple[str | None] | list[str | None], optional) – str/s to seperate time values with. If iterable, First element used to replace ‘-’ in date, and second ‘:’ in time’. Can set either element to None to not replace. Defaults to “”

  • kwargs (Any, optional) – Kwargs passed to PatternIndex

factory(*, temporal=False, variable=False, forecast=False, **kwargs)#

Create an Direct pattern based on the requirements

Parameters:
  • temporal (bool, optional) – Temporally aware, exclusive with forecast, allows for .series operations. Defaults to False.

  • variable (bool, optional) – Variable aware, splits variables when loading and saving. Defaults to False.

  • forecast (bool, optional) – Forecast product, exclusive with temporal, provides .series but with forecasts. Defaults to False.

  • args (Any)

Raises:

ValueError – If both temporal and forecast set. Cannot be both.

Returns:

Created _Direct pattern.

Return type:

(_Direct)

to_temporal(data_interval)#

Get pattern as TemporalDirect

Parameters:

data_interval (tuple[int, str] | int)

Return type:

TemporalDirect

class pyearthtools.data.patterns.TemporalDirect(root_dir, *, extension='.pyearthtools', prefix=None, file_resolution='minute', delimiter='', **kwargs)#

Direct PatternIndex which is also a AdvancedTimeIndex

Examples

>>> pattern = pyearthtools.data.patterns.TemporalDirect('/dir/', extension = '.nc', data_interval = (1, 'month'))
>>> str(pattern.search('2020-01-02'))
'/dir/202001.nc'

Direct time based DataIndexer.

Parameters:
  • root_dir (str | Path) – Root Path to use

  • extension (str, optional) – File extension to load. Defaults to ‘data.patterns.default_extension’

  • prefix (None | str, optional) – File prefix to add. Defaults to None.

  • file_resolution (str | TimeResolution, optional) – Temporal resolution of the file name. Defaults to ‘minute’.

  • delimiter (str | tuple[str | None] | list[str | None], optional) – str/s to seperate time values with. If iterable, First element used to replace ‘-’ in date, and second ‘:’ in time’. Can set either element to None to not replace. Defaults to “”

  • kwargs (Any, optional) – Kwargs passed to PatternIndex

filesystem(basetime)#

Find datafiles given args on local filesystem.

Must be implemented by child class to specify data.

Can return a dictionary[str, str], tuple, list or path representing the files to load.

Parameters:

basetime (str | Petdt)

Return type:

Path

class pyearthtools.data.patterns.ForecastDirect(root_dir, *, extension='.pyearthtools', prefix=None, file_resolution='minute', delimiter='', **kwargs)#

Direct PatternIndex which is also a ForecastIndex

Direct time based DataIndexer.

Parameters:
  • root_dir (str | Path) – Root Path to use

  • extension (str, optional) – File extension to load. Defaults to ‘data.patterns.default_extension’

  • prefix (None | str, optional) – File prefix to add. Defaults to None.

  • file_resolution (str | TimeResolution, optional) – Temporal resolution of the file name. Defaults to ‘minute’.

  • delimiter (str | tuple[str | None] | list[str | None], optional) – str/s to seperate time values with. If iterable, First element used to replace ‘-’ in date, and second ‘:’ in time’. Can set either element to None to not replace. Defaults to “”

  • kwargs (Any, optional) – Kwargs passed to PatternIndex

class pyearthtools.data.patterns.DirectVariable(variables=None, *args, variable_parse=None, verbose=True, **kwargs)#

Direct pattern which is variable aware

Will split each variable into a seperate file, using the variable as the prefix

Examples

>>> direct_var = DirectVariable(root_dir = '/test/', variables = 'variable', extension = 'nc')
>>> str(direct_var.search('2021-01'))
{'variable' : '/test/variable_202101.nc'}
>>> direct_var = DirectVariable(root_dir = '/test/', variables = 'variable', extension = 'nc', variable_parse = 'root_dir')
>>> str(direct_var.search('2021-01-01'))
{'variable' : '/test/variable/202101.nc'}

Construct a variable aware pattern.

Parameters:
  • variables (str | list[str], optional) – Variables to find by default. When saving, variables will be appended to this list. If not given, can’t really be used to load data, but is useful for saving data. Defaults to []

  • variable_parse (str | list[str], optional) – Initalisation argument/s to add variable to when constructing the new pattern. Defaults to ‘prefix’ if default_variable_parse not set.

  • verbose (bool, optional) – Whether to warn the user if a variable is saved and not already given. Defaults to True.

!!! Note

variable_parse can be used to reference many different types, and the following table details the behaviour. | Type | Behaviour | | —- | ——— | | Path | Added to the end as directory layer | | str | Attempt to parse to Path, or just append | | None | Replaced | | list | Appended |

property root_pattern: type[Direct]#

Get pattern for finding/saving a specific variable

!!! Note

Must be implemented by the child class

Returns:

Uninitalised pattern to use to find location of variable

Return type:

(PatternIndex)

class pyearthtools.data.patterns.ForecastDirectVariable(variables=None, *args, variable_parse=None, verbose=True, **kwargs)#

Direct pattern which is variable aware and retrieves Forecasts

Will split each variable into a seperate file, using the variable as the prefix

Examples

>>> direct_var = ForecastDirectVariable(root_dir = '/test/', variables = 'variable', extension = 'nc')
>>> str(direct_var.search('2021-01'))
{'variable' : '/test/variable_202101.nc'}
>>> direct_var = ForecastDirectVariable(root_dir = '/test/', variables = 'variable', extension = 'nc', variable_parse = 'root_dir')
>>> str(direct_var.search('2021-01-01'))
{'variable' : '/test/variable/202101.nc'}

Construct a variable aware pattern.

Parameters:
  • variables (str | list[str], optional) – Variables to find by default. When saving, variables will be appended to this list. If not given, can’t really be used to load data, but is useful for saving data. Defaults to []

  • variable_parse (str | list[str], optional) – Initalisation argument/s to add variable to when constructing the new pattern. Defaults to ‘prefix’ if default_variable_parse not set.

  • verbose (bool, optional) – Whether to warn the user if a variable is saved and not already given. Defaults to True.

!!! Note

variable_parse can be used to reference many different types, and the following table details the behaviour. | Type | Behaviour | | —- | ——— | | Path | Added to the end as directory layer | | str | Attempt to parse to Path, or just append | | None | Replaced | | list | Appended |

property root_pattern: type[ForecastDirect]#

Get pattern for finding/saving a specific variable

!!! Note

Must be implemented by the child class

Returns:

Uninitalised pattern to use to find location of variable

Return type:

(PatternIndex)

class pyearthtools.data.patterns.TemporalDirectVariable(variables=None, *args, variable_parse=None, verbose=True, **kwargs)#

TemporalDirect pattern which is variable aware

Will split each variable into a seperate file, using the variable as the prefix.

Examples

>>> direct_var = TemporalDirectVariable(root_dir = '/test/', variables = 'variable', extension = 'nc', data_interval = (1,'year'))
>>> direct_var.search('2021-01-01')
{'variable' : '/test/variable/2021.nc'}
>>> direct_var = TemporalDirectVariable(root_dir = '/test/', variables = 'variable', extension = 'nc', data_interval = (1,'year'), variable_parse = 'root_dir')
>>> direct_var.search('2021-01-01')
{'variable' : '/test/variable/2021.nc'}

Construct a variable aware pattern.

Parameters:
  • variables (str | list[str], optional) – Variables to find by default. When saving, variables will be appended to this list. If not given, can’t really be used to load data, but is useful for saving data. Defaults to []

  • variable_parse (str | list[str], optional) – Initalisation argument/s to add variable to when constructing the new pattern. Defaults to ‘prefix’ if default_variable_parse not set.

  • verbose (bool, optional) – Whether to warn the user if a variable is saved and not already given. Defaults to True.

!!! Note

variable_parse can be used to reference many different types, and the following table details the behaviour. | Type | Behaviour | | —- | ——— | | Path | Added to the end as directory layer | | str | Attempt to parse to Path, or just append | | None | Replaced | | list | Appended |

property root_pattern: type[TemporalDirect]#

Get pattern for finding/saving a specific variable

!!! Note

Must be implemented by the child class

Returns:

Uninitalised pattern to use to find location of variable

Return type:

(PatternIndex)

class pyearthtools.data.patterns.DirectFactory(*args, temporal=False, variable=False, forecast=False, **kwargs)#

Create an Direct pattern based on the requirements

Parameters:
  • temporal (bool, optional) – Temporally aware, exclusive with forecast, allows for .series operations. Defaults to False.

  • variable (bool, optional) – Variable aware, splits variables when loading and saving. Defaults to False.

  • forecast (bool, optional) – Forecast product, exclusive with temporal, provides .series but with forecasts. Defaults to False.

  • args (Any)

Raises:

ValueError – If both temporal and forecast set. Cannot be both.

Returns:

Created _Direct pattern.

Return type:

(_Direct)

class pyearthtools.data.patterns.ExpandedDate(root_dir, *, extension='.pyearthtools', prefix=None, delimiter='', file_resolution='minute', directory_resolution='day', **kwargs)#

Generate FilePath Structure based upon expanded date pattern

Examples

>>> pattern = pyearthtools.data.patterns.ExpandedDate('/dir/', extension = '.nc')
>>> str(pattern.search('2020-01-02T0030'))
'/dir/2020/01/02/20200102T0030.nc'
>>> pattern = pyearthtools.data.patterns.ExpandedDate('/dir/', extension = '.nc', deliminator = ('#', None))
>>> str(pattern.search('2020-01-02T0030'))
'/dir/2020/01/02/2020#01#02T00:30.nc'

Expanded Date based DataIndex

Parameters:
  • root_dir (str | Path) – Root Path to use

  • extension (str, optional) – File extension to load. Defaults to ‘data.patterns.default_extension’

  • prefix (None | str, optional) – File prefix to add. Defaults to None.

  • delimiter (str | list[str | None] | tuple[str | None], optional) – str/s to seperate time values with. If iterable, First element used to replace ‘-’ in date, and second ‘:’ in time’. Can set either element to None to not replace. Defaults to “”

  • file_resolution (str | TimeResolution, optional) – Resolution of the files. Defaults to ‘minute’.

  • directory_resolution (str | TimeResolution, optional) – Resolution of directories. Defaults to ‘day’.

  • kwargs (Any, optional) – Kwargs passed to PatternIndex

factory(*, temporal=False, variable=False, forecast=False, **kwargs)#

Create an ExpandedDate pattern based on the requirements

Parameters:
  • temporal (bool, optional) – Temporally aware, exclusive with forecast, allows for .series operations. Defaults to False.

  • variable (bool, optional) – Variable aware, splits variables when loading and saving. Defaults to False.

  • forecast (bool, optional) – Forecast product, exclusive with temporal, provides .series but with forecasts. Defaults to False.

  • args (Any)

Raises:

ValueError – If both temporal and forecast set. Cannot be both.

Returns:

Created _ExpandedDate pattern.

Return type:

(_ExpandedDate)

to_temporal(data_interval)#

Get pattern as TemporalExpandedDate

Parameters:

data_interval (tuple[int, str] | int | str)

Return type:

TemporalExpandedDate

class pyearthtools.data.patterns.TemporalExpandedDate(*args, **kwargs)#

ExpandedDate PatternIndex which is also a AdvancedTimeIndex

Will create its path using the data_interval if set.

If using this with data saved using ExpandedDate, set data_interval to (1, ‘min’), the paths will match.

Examples

>>> pattern = pyearthtools.data.patterns.TemporalExpandedDate('/dir/', extension = '.nc', data_interval = (1, 'month'))
>>> str(pattern.search('2020-01-02'))
'/dir/2020/01/202001.nc'
>>> str(pattern.search('2020-01'))
'/dir/2020/01/202001.nc'

Expanded Date based DataIndex

Parameters:
  • root_dir (str | Path) – Root Path to use

  • extension (str, optional) – File extension to load. Defaults to ‘data.patterns.default_extension’

  • prefix (None | str, optional) – File prefix to add. Defaults to None.

  • delimiter (str | list[str | None] | tuple[str | None], optional) – str/s to seperate time values with. If iterable, First element used to replace ‘-’ in date, and second ‘:’ in time’. Can set either element to None to not replace. Defaults to “”

  • file_resolution (str | TimeResolution, optional) – Resolution of the files. Defaults to ‘minute’.

  • directory_resolution (str | TimeResolution, optional) – Resolution of directories. Defaults to ‘day’.

  • kwargs (Any, optional) – Kwargs passed to PatternIndex

filesystem(basetime)#

Find datafiles given args on local filesystem.

Must be implemented by child class to specify data.

Can return a dictionary[str, str], tuple, list or path representing the files to load.

Parameters:

basetime (str | Petdt)

Return type:

Path

class pyearthtools.data.patterns.ForecastExpandedDate(root_dir, *, extension='.pyearthtools', prefix=None, delimiter='', file_resolution='minute', directory_resolution='day', **kwargs)#

ExpandedDate PatternIndex which is also a ForecastIndex

Expanded Date based DataIndex

Parameters:
  • root_dir (str | Path) – Root Path to use

  • extension (str, optional) – File extension to load. Defaults to ‘data.patterns.default_extension’

  • prefix (None | str, optional) – File prefix to add. Defaults to None.

  • delimiter (str | list[str | None] | tuple[str | None], optional) – str/s to seperate time values with. If iterable, First element used to replace ‘-’ in date, and second ‘:’ in time’. Can set either element to None to not replace. Defaults to “”

  • file_resolution (str | TimeResolution, optional) – Resolution of the files. Defaults to ‘minute’.

  • directory_resolution (str | TimeResolution, optional) – Resolution of directories. Defaults to ‘day’.

  • kwargs (Any, optional) – Kwargs passed to PatternIndex

class pyearthtools.data.patterns.ExpandedDateVariable(variables=None, *args, variable_parse=None, verbose=True, **kwargs)#

ExpandedDate pattern which is variable aware

Will split each variable into a seperate file, using the variable as another layer in the root_dir

Examples

>>> expanded_var = ExpandedDateVariable(root_dir = '/test/', variables = 'variable', extension = 'nc')
>>> str(expanded_var.search('2020-01-02'))
{'variable' : '/test/variable/2020/01/02/20200102T0000.nc'}
>>> expanded_var = ExpandedDateVariable(root_dir = '/test/', variables = 'variable', extension = 'nc', data_interval = (1,'year'), variable_parse = 'prefix')
>>> str(expanded_var.search('2020-01'))
{'variable' : '/test/2020/01/02/variable_20200102T0000.nc'}

Construct a variable aware pattern.

Parameters:
  • variables (str | list[str], optional) – Variables to find by default. When saving, variables will be appended to this list. If not given, can’t really be used to load data, but is useful for saving data. Defaults to []

  • variable_parse (str | list[str], optional) – Initalisation argument/s to add variable to when constructing the new pattern. Defaults to ‘prefix’ if default_variable_parse not set.

  • verbose (bool, optional) – Whether to warn the user if a variable is saved and not already given. Defaults to True.

!!! Note

variable_parse can be used to reference many different types, and the following table details the behaviour. | Type | Behaviour | | —- | ——— | | Path | Added to the end as directory layer | | str | Attempt to parse to Path, or just append | | None | Replaced | | list | Appended |

property root_pattern: type[ExpandedDate]#

Get pattern for finding/saving a specific variable

!!! Note

Must be implemented by the child class

Returns:

Uninitalised pattern to use to find location of variable

Return type:

(PatternIndex)

class pyearthtools.data.patterns.ForecastExpandedDateVariable(variables=None, *args, variable_parse=None, verbose=True, **kwargs)#

ForecastExpandedDate pattern which is variable aware and retrieves Forecasts

Will split each variable into a separate file, using the variable as another layer in the root_dir

Examples

>>> expanded_var = ForecastExpandedDateVariable(root_dir = '/test/', variables = 'variable', extension = 'nc')
>>> str(expanded_var.search('2020-01-02'))
{'variable' : '/test/variable/2020/01/02/20200102T0000.nc'}
>>> expanded_var = ForecastExpandedDateVariable(root_dir = '/test/', variables = 'variable', extension = 'nc', data_interval = (1,'year'), variable_parse = 'prefix')
>>> str(expanded_var.search('2020-01'))
{'variable' : '/test/2020/01/02/variable_20200102T0000.nc'}

Construct a variable aware pattern.

Parameters:
  • variables (str | list[str], optional) – Variables to find by default. When saving, variables will be appended to this list. If not given, can’t really be used to load data, but is useful for saving data. Defaults to []

  • variable_parse (str | list[str], optional) – Initalisation argument/s to add variable to when constructing the new pattern. Defaults to ‘prefix’ if default_variable_parse not set.

  • verbose (bool, optional) – Whether to warn the user if a variable is saved and not already given. Defaults to True.

!!! Note

variable_parse can be used to reference many different types, and the following table details the behaviour. | Type | Behaviour | | —- | ——— | | Path | Added to the end as directory layer | | str | Attempt to parse to Path, or just append | | None | Replaced | | list | Appended |

property root_pattern: type[ForecastExpandedDate]#

Get pattern for finding/saving a specific variable

!!! Note

Must be implemented by the child class

Returns:

Uninitalised pattern to use to find location of variable

Return type:

(PatternIndex)

class pyearthtools.data.patterns.TemporalExpandedDateVariable(variables=None, *args, variable_parse=None, verbose=True, **kwargs)#

TemporalExpandedDate pattern which is variable aware

Will split each variable into a seperate file, using the variable as another layer in the root_dir

Examples

>>> expanded_var = TemporalExpandedDateVariable(root_dir = '/test/', variables = 'variable', extension = 'nc', data_interval = (1,'year'))
>>> str(expanded_var.search('2020-01'))
{'variable' : '/test/variable/2020/2020.nc'}
>>> expanded_var = TemporalExpandedDateVariable(root_dir = '/test/', variables = 'variable', extension = 'nc', data_interval = (1,'year'), variable_parse = 'prefix')
>>> str(expanded_var.search('2020-01'))
{'variable' : '/test/2020/variable_2020.nc'}

Construct a variable aware pattern.

Parameters:
  • variables (str | list[str], optional) – Variables to find by default. When saving, variables will be appended to this list. If not given, can’t really be used to load data, but is useful for saving data. Defaults to []

  • variable_parse (str | list[str], optional) – Initalisation argument/s to add variable to when constructing the new pattern. Defaults to ‘prefix’ if default_variable_parse not set.

  • verbose (bool, optional) – Whether to warn the user if a variable is saved and not already given. Defaults to True.

!!! Note

variable_parse can be used to reference many different types, and the following table details the behaviour. | Type | Behaviour | | —- | ——— | | Path | Added to the end as directory layer | | str | Attempt to parse to Path, or just append | | None | Replaced | | list | Appended |

property root_pattern: type[TemporalExpandedDate]#

Get pattern for finding/saving a specific variable

!!! Note

Must be implemented by the child class

Returns:

Uninitalised pattern to use to find location of variable

Return type:

(PatternIndex)

class pyearthtools.data.patterns.ExpandedDateFactory(*args, temporal=False, variable=False, forecast=False, **kwargs)#

Create an ExpandedDate pattern based on the requirements

Parameters:
  • temporal (bool, optional) – Temporally aware, exclusive with forecast, allows for .series operations. Defaults to False.

  • variable (bool, optional) – Variable aware, splits variables when loading and saving. Defaults to False.

  • forecast (bool, optional) – Forecast product, exclusive with temporal, provides .series but with forecasts. Defaults to False.

  • args (Any)

Raises:

ValueError – If both temporal and forecast set. Cannot be both.

Returns:

Created _ExpandedDate pattern.

Return type:

(_ExpandedDate)

class pyearthtools.data.patterns.Static(file, variables=None, *, enforce_existence=True, capture_arguments=False, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, **load_kwargs)#

Retrieve Static File for any date retrieval

Static File based data index

Parameters:
  • file (str | Path) – File to load

  • variables (str | list[str], optional) – Variables to trim loaded data to. Defaults to None.

  • enforce_existence (bool, optional) – Enforce that file exists. Defaults to True.

  • capture_arguments (bool, optional) – Capture arguments given to retrieval without throwing an error. Defaults to False.

  • transforms (Transform | TransformCollection, optional) – Base Transforms to apply. Defaults to TransformCollection().

  • load_kwargs (dict)

Raises:

FileNotFoundError – If File not found

Examples

>>> pattern = pyearthtools.data.patterns.Static('/dir/file.nc', enforce_existence = False)
>>> str(pattern.search())
'/dir/file.nc'
filesystem(*args, **kwargs)#

Find datafiles given args on local filesystem.

Must be implemented by child class to specify data.

Can return a dictionary[str, str], tuple, list or path representing the files to load.

load(*args, **kwargs)#

Load a given list of files.

Automatically determine method to load files for file extension

Supported:
  • netcdf

  • pandas [csv]

  • numpy

Parameters:
  • files (dict[str, str | Path] | Path | list[str | Path] | tuple[str | Path]) – Files to load

  • **kwargs (Any, optional) – Kwargs passed to underlying loading function

Raises:

InvalidDataError – If an error arose when loading file

Returns:

Loaded data

Return type:

(Any)

class pyearthtools.data.patterns.ParsingPattern(root_dir, parse_str, *, transforms=TransformCollection Initialisation                 A Collection of Transforms to be applied to Data   apply_default                  False   intelligence_level             100 Transforms, add_default_transforms=True, preprocess_transforms=None, **kwargs)#

PatternIndex to parse and format paths from str formats.

Values for the formatting are expected in kwargs / if data is saved will be added.

Will split datasets based on what is specified in the parse_str. If a kwarg is given as a list, will look for all perturbations.

Create pattern from a formatting string

If being used to retrieve data without saving it first, set values in parse_str through kwargs or when using search.

Parameters:
  • root_dir (str) – Root directory to begin the path, can be ‘temp’ for temp directory.

  • parse_str (str) – str to parse to find paths. Use ‘variable’ for data vars E.g. ‘{level}/{variable}/{time:%Y%M}’.

  • transforms (Transform | TransformCollection, optional) – Transforms to add on retrieval. Defaults to TransformCollection().

  • add_default_transforms (bool, optional) – Whether to add default transforms. Defaults to True.

  • preprocess_transforms (Transform | TransformCollection | Callable | None, optional) – Transforms to always add. Defaults to None.

  • kwargs (Any, optional) – Any values to fill parse_str with, if given as a list, will look for all perturbations.

Examples

>>> pattern = ParsingPattern('temp', '{level:04d}.nc', level = 10)
>>> pattern.search()
[PosixPath('/temp/0010.nc')]
>>> pattern = ParsingPattern('temp', '{level:04d}.nc', level = [10,20])
>>> pattern.search()
[PosixPath('/temp/0010.nc'), PosixPath('/temp/0020.nc')]
>>> pattern = ParsingPattern('temp', '{time:%Y}.nc')
>>> pattern.save(data)
filesystem(options=None, **kwargs)#

Get all paths from this Index

Parameters:
  • **kwargs (Any) – Extra options to provide to the parser

  • options (dict[str, Any] | None)

  • **kwargs

Return type:

list[Path]

get(*args, load_kwargs=None, **kwargs)#

Get data by loading it from the search.

All args & kwargs are passed through to search to allow extra supply of format values

Parameters:

load_kwargs (dict[str, Any] | None, optional) – kwargs to pass to the .load function. Defaults to None.

Raises:

DataNotFoundError – Data could not be found

Returns:

Loaded Data

Return type:

(Any)

save(data, *_)#

Save data with this pattern.

Will split the dataset according to what is given in parse_str.

E.g.

If data contains a level coord, and level is in parse_str, the data will be split accordingly.

Parameters:

data (xr.Dataset | xr.DataArray) – Dataset to save

Raises:

KeyError – If variable is being split on, and not a xr.Dataset.

data.save#

pyearthtools.data.save.save(data, callback, *args, save_kwargs={}, **kwargs)#

Save data at location specified by an Index

Automatically inferes to how to save data based on the type

Uses args and kwargs in callback.search to find path

Parameters:
  • data (Any) – Data to be saved

  • callback (FileSystemIndex) – FileSystemIndex to use to discover where to save data

  • *args (Any, optional) – Arguments to be passed to callback.search to find file path

  • save_kwargs (dict, optional) – Kwargs to pass to underlying save function

  • *kwargs (Any, optional) – Keyword arguments to be passed to callback.search to find file path

Raises:

TypeError – If type that is not known is passed

Returns:

Location where data was saved

Return type:

(Path)

class pyearthtools.data.save.ManageFiles(files, timeout=5, *, lock=True, uuid=False, prefix='.tmp')#

Automatically manage the saving of files.

Using this, representative temporary files are provided to save to, and then automatically renamed.

If lock == True, prevent multiple processes from writing to the same temp files by creating lock files, and checking for their existence.

If a lock file is encountered, and after it’s removal the real file exists, the user is informed, as this data may have been saved by another process running concurrently, and may not need to be saved again.

Example:

>>> with ManageFiles('important_file.txt') as (filename, _):
>>>     print(filename) # '.tmp_important_file.txt'
>>>     with open(filename, 'w') as fd:
>>>         fd.write('42')
>>> print(os.path.exists('important_file.txt'))
... True
>>> print(os.open('important_file.txt').read())
... 42

Manage the saving of files. Save to temp file first, and lock that file.

Parameters:
  • files (VALID_PATH_TYPES) – Files for this to manage. Will return temporary files representing each file, in the same type.

  • timeout (float | int) – Max time waiting for lock release can take, in seconds. timeout < 0, will not timeout and simply block until release.

  • lock (bool) – Attempt to lock temp files when saving. Mutually exclusive with uuid. This allows the logic checking if the temp file was locked, and now the real file exists thus potentially indicating it has been made by a concurrent thread. If lock is False, this behaves exactly like ManageTemp, and always returns exist = False.

  • uuid (bool) – Add unique identifier to temp files. Mutually exclusive with lock.

  • prefix (str) – Prefix to add to indicate temp file.

check_if_locked()#

Check if data is locked

Return type:

bool

class pyearthtools.data.save.ManageTemp(files, uuid=False, prefix='.tmp')#

Manage the saving to provide temporary files and when used as a context manager, automatically renamed to real files.

Can be used as not a Context manager with calls to .temp_files and .rename.

Example:

>>> with ManageTemp('important_file.txt') as (filename, _):
>>>     print(filename) # '.tmp_important_file.txt'
>>>     with open(filename, 'w') as fd:
>>>         fd.write('42')
>>> print(os.path.exists('important_file.txt'))
    True
>>> print(os.open('important_file.txt').read())
    42

This differs from ManageFiles as this does not provide a locking functionality, and this is therefore not Thread safe.

Temp files may not exist when they should, as another thread may have already renamed it.

Create a temporary file manager

Automatically creates temp file names next to the real ones for saving. Upon exit, or .rename call, these are renamed to the real files.

Parameters:
  • files (VALID_PATH_TYPES) – Real files to manage and make temp files for

  • uuid (bool, optional) – Add a unique identifier to each temp file name. Defaults to False.

  • prefix (str, optional) – Prefix to add to indicate temp file. Defaults to ‘.tmp’.

exists()#

Check if temporary files exist

Return type:

bool

property real_files#

Real files being used

remove()#

Remove temporary files if they exist

rename()#

Rename temporary files to real ones

Raises:
  • FileNotFoundError – If temporary files do not exist

  • TypeError – If paths cannot be renamed.

property temp_files#

Temporary files being managed by this object.

Is the exact same type form as the input files.

pyearthtools.data.save.array.save(dataarray, callback, *args, save_kwargs={}, try_thread_safe=True, **kwargs)#
Parameters:
  • dataarray (ndarray)

  • callback (FileSystemIndex)

  • save_kwargs (dict[str, Any])

  • try_thread_safe (bool)

pyearthtools.data.save.dask.save(dataarray, *args, **kwargs)#
Parameters:

dataarray (Array)

pyearthtools.data.save.dataset.save(dataset, callback, *args, zarr=None, save_kwargs, **kwargs)#
Parameters:
  • zarr (bool | None)

  • save_kwargs (dict[str, Any])

pyearthtools.data.save.dataset.to_netcdf(dataset, callback, *args, save_kwargs=None, try_thread_safe=True, **kwargs)#

Saves a dataset based on a callback to an index.

Parameters:
  • dataset (tuple[Dataset] | DataArray | Dataset) – The xarray object to convert to netcdf

  • callback (FileSystemIndex) – Uses callback.search() to fetch a Path, str, or dictionary of either. If a dictionary is returned, will only save dataset, and will only save specified keys.

  • save_kwargs (dict[str, Any] | None)

  • try_thread_safe (bool)

pyearthtools.data.save.dataset.to_zarr(dataset, callback, *args, save_kwargs=None, **kwargs)#
Parameters:
  • dataset (DataArray | Dataset)

  • callback (FileSystemIndex)

  • save_kwargs (dict[str, Any] | None)

pyearthtools.data.save.jsonsave.save(data, callback, *args, save_kwargs={}, **kwargs)#

Save json files

Parameters:
pyearthtools.data.save.plot.save(plot, callback, *args, save_kwargs={}, **kwargs)#

Save plot objects

Parameters:
pyearthtools.data.save.save_utils.check_if_exists(path)#

Check if path/s exist

Parameters:

path (VALID_PATH_TYPES) – Path/s to check existence of

Returns:

Path/s existence

Return type:

(bool)

pyearthtools.data.save.save_utils.make_new_filename(path, *, add_uuid=False, prefix='.tmp', remove_suffix=False)#

Create temporary files using given path/s

Adds prefix to all paths, and if add_uuid adds a unique identifier. Can also strip suffix.

Parameters:
  • path (VALID_PATH_TYPES) – Path/s to create tmp files of

  • add_uuid (bool, optional) – Whether to add a unique identifier. Defaults to False.

  • prefix (str, optional) – Prefix to indicate temporary file to add. Defaults to ‘.tmp’

  • remove_suffix (bool, optional) – Whether to remove the suffix when creating new file name.

Returns:

path with temporary flags added to it. Is the exact same type as input.

Return type:

(VALID_PATH_TYPES)

class pyearthtools.data.save.save_utils.keep_clear(path, enter=True, exit=True)#

Keep a given path clear

Delete paths upon entrance and/or exit (this is a fully-qualified path/filename) Basically useful for temporary files with known names that can be deleted if they’re already there

Parameters:
  • path (VALID_PATH_TYPES) – Path/s to delete if they exist upon entrance or exit

  • enter (bool, optional) – Delete on entrance. Defaults to True.

  • exit (bool, optional) – Delete on exit. Defaults to True.

delete()#

Delete given files if they exist

data.transforms#

class pyearthtools.data.transforms.Transform(docstring=None)#

Base Transform Class to obfuscate a transform process.

A child class must implement .apply(self, dataset: xr.Dataset), and .info.

When using this transform, simply call it like a function. Can also add another transform to this.

Initalise root Transform class

Cannot be used as is, a child must implement the .apply function.

Parameters:

docstring (str, optional) – Docstring to set this Transform to. Defaults to None.

Raises:

TypeError – If cannot parse docstring

abstractmethod apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.TransformCollection(*transforms, apply_default=False, intelligence_level=100)#

A Collection of Transforms to be applied to Data

Can be added to or appended to & called to apply all transforms in order.

Setup new TransformCollection

Parameters:
  • *transforms (Transform | TransformCollection, Callable | None | list) – Transforms to include

  • apply_default (bool, optional) – Apply default transforms. Defaults to False.

  • intelligence_level (int, optional) – Intelligence level of default transforms. Defaults to 100.

append(transform)#

Append a transform/s to the collection

Parameters:

transform (list | FunctionType | Transform | TransformCollection) – Transform/s to add

Raises:

TypeError – If transform cannot be understood

apply(dataset)#

Apply Transforms to a Dataset

Parameters:

dataset (xr.Dataset) – Dataset to apply transforms to

Returns:

Same as input type with transforms applied

Return type:

(Any)

pop(index=-1)#

Remove and return item at index (default last). Raises IndexError if list is empty or index is out of range.

Parameters:

index (int, optional) – Index to pop from list at. Defaults to -1.

Returns:

Transform popped out

Return type:

Transform

remove(key)#

Remove first occurrence of value.

Parameters:

key (type | str | Transform) – Key to search for

Raises:

ValueError – If the value is not present.

to_repr_dict()#

Convert to dictionary ready for repr

class pyearthtools.data.transforms.FunctionTransform(function)#

Transform Function which applies a given function

Transform Function to apply a user given function

Parameters:

function (Callable) – User given function to apply

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.Derive(equation=None, drop=False, **equations)#

Derive new variables

Derive new variables from a dataset using provided equations.

Allows other variables to be used by indicating only their name, and than evaluated accordingly.

Each numerical or reference component in an equation must be seperated by a space.

If using function based symbols like ‘sqrt’ or ‘sin’, the next item will be evaluated using said function. These functions can be given with brackets next to them.

Without brackets given, the equation will be evaluated left -> right.

E.g.

` equation = {'new_variable' : 'old_variable_1 * old_variable_2'} equation = {'new_variable' : 'sqrt(old_variable_1 * old_variable_2)'} `

!!! Warn

This will evaluate an equation left -> right, but respects brackets.

‘var_1 + 9.8 * var_2’ != ‘var_1 + (9.8 * var_2)’

!!! Warning

Components of an equation should be split by ‘ ‘, a whitespace. ` a*b   # Bad a * b # Good `

Parameters:
  • equation (dict[str, str | tuple[str, dict[str, str]]] | None, optional) – Equation configuration. If str, equation is evaluated. If tuple, first element is assumed to be equation, and the second a dictionary to update the new vars attributes with. Defaults to None.

  • drop (bool, optional) – Drop variables used in the calculation. Can be overwritten per equation, by setting drop in attributes dictionary. Defaults to False.

  • **equations (dict[str, str | tuple[str, dict[str, str]]], optional) – Keyword arg form of equation.

Raises:

EquationException – If equation cannot be parsed

Returns:

Transform to apply derivation.

Return type:

(Transform)

Examples

>>> derive(new_variable = 'old_variable_1 * old_variable_2', drop = True)
# Create a `new_variable` as the product of the old two.
>>> derive(new_variable = 'old_variable_1 * 9.8', drop = True)
# Scale `old_variable_1` by 9.8
>>> derive(new_variable = ('old_variable_1 * 9.8', {'long_name': 'Scaled old_variable_1'}, drop = False)
# Scale `old_variable_1` by 9.8, and update the `long_name` to be 'Scaled old_variable_1', leaving the old var there
>>> derive(new_variable = 'old_variable_1 - old_variable_2 * 9.8', drop = True)
# Set `new_variable` as the difference scaled by 9.8. In effect acts as (old_variable_1 - old_variable_2) * 9.8
>>> derive(new_variable = 'old_variable_1 - (9.8 * old_variable_2)', drop = True)
# Multiply `old_variable_2` by 9.8 and than find difference with `old_variable_1`
>>> derive(new_var == 'sqrt(old_var)')
# Square root of the old variable
apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

pyearthtools.data.transforms.aggregation.over(*, dimension, method)#

Get Aggregation Transform to run aggregation method over given dimensions

Parameters:
  • method (Callable | str | dict) – Method to use, can be known method or user defined

  • dimension (str | list[str]) – Dimensions to run aggregation over

Returns:

Transform to apply aggregation

Return type:

(Transform)

pyearthtools.data.transforms.aggregation.leaving(method, dimension)#

Get Aggregation Transform to run aggregation method leaving only given dimensions

Parameters:
  • method (Callable | str | dict) – Method to use, can be known method or user defined

  • dimension (str | list[str]) – Dimensions to leave after aggregation

Returns:

Transform to apply aggregation

Return type:

(Transform)

class pyearthtools.data.transforms.aggregation.Aggregate(method, reduce_dims=None, keep_dims=None)#

Aggregation Transforms,

Initalise root Transform class

Cannot be used as is, a child must implement the .apply function.

Parameters:
  • docstring (str, optional) – Docstring to set this Transform to. Defaults to None.

  • method (Callable | str | dict[str, Callable | str])

  • reduce_dims (Optional[list[str] | str])

  • keep_dims (Optional[list[str] | str])

Raises:

TypeError – If cannot parse docstring

apply(dataset, **kwargs)#

Apply Aggregation to Dataset

Parameters:
  • dataset (xr.Dataset) – Dataset to apply aggregation to

  • method (Callable | str) – Method of aggregation, either func or string

  • dimension (str | list[str]) – Dimension to apply aggregation on

Returns:

Aggregated Dataset

Return type:

(xr.Dataset)

class pyearthtools.data.transforms.attributes.SetAttributes(attrs=None, reference=None, apply_on='dataset', **attributes)#

Set Attributes

Modify Attributes to a dataset

Parameters:
  • attrs (dict[str, Any] | None) – Attributes to set, key: value pairs. Set apply_on to choose where attributes are applied. | Key | Description | | — | ———– | | dataset | Attributes updated on dataset | | dataarray | If applied on a dataset, update each dataarray inside the dataset | | both | Do both above | | per_variable | Treat attrs as a dictionary of dictionaries, applying on dataarray if in dataset. | Defaults to None.

  • apply_on (Literal['dataset', 'dataarray', 'both'], optional) – On what type to update attributes. Defaults to ‘dataset’.

  • **attributes (dict) – Keyword argument form of attrs.

  • reference (xr.DataArray | xr.Dataset | None)

Returns:

Transform to set attributes

Return type:

(Transform)

apply(data_obj)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.attributes.SetEncoding(encoding=None, reference=None, limit=None, **variables)#

Set Encoding

Set encoding of a dataset.

Can get encoding from a reference dataset. That dataset is then not used, as the encoding has already been retrieved.

Parameters:
  • encoding (dict[str, dict[str, Any]] | None) – Variable value pairs assigning encoding to the given variable. Can set key to ‘all’ to apply to all variables. Defaults to None.

  • reference (xr.DataArray | xr.Dataset | None, optional) – Reference object to retrieve and update encoding from. Defaults to None.

  • limit (list[str] | None, optional) – When getting encoding from reference object, limit the retrieved encoding. If not given will get ['units', 'dtype', 'calendar', '_FillValue', 'scale_factor', 'add_offset', 'missing_value']. Defaults to None.

  • **variables (dict) – Keyword argument form of encoding

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.attributes.SetType(dtype=None, **variables)#

Set type of variables

Set type of variables/coordinates.

At least dtype or variables must be set.

Applies “same_kind” casting

Parameters:
  • dtype (str | dict[str, str] | None) – Datatype to set to. If only dtype is given, this will set all coordinates of the dataset to this dtype. Defaults to None.

  • **variables (Any, optional) – Variable dtype configuration.

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.attributes.Rename(names=None, **extra_names)#

Rename Components inside Dataset

Rename Dataset components

Parameters:
  • names (dict[str, Any] | None) – Dictionary assigning name replacements [old: new] Defaults to None.

  • **extra_names (Any, optional) – Keyword args form of names.

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

pyearthtools.data.transforms.coordinates.get_longitude(data, transform=True)#

From a given data source, attempt to identify the orientation of the longitude coordinate.

Either ‘0-360’ or ‘-180-180’

Parameters:
  • data (xr.Dataset | xr.DataArray) – Data to check

  • transform (bool, optional) – Whether to return a Transform to set to the same orientation. Defaults to True.

Raises:

ValueError – If unable to identify the longitude coordinate orientation

Returns:

Either str of orientation or Transform to set longitude of a data source to the same as data Depends on transform bool state.

Return type:

(str | Transform)

class pyearthtools.data.transforms.coordinates.StandardLongitude(type='-180-180', longitude_name='longitude')#

Standardise format of longitude.

Standardise format of longitude.

Shifts the longitude coordinate to that of the specified. Must be in [“-180-180”, “0-360”]

Parameters:

type (VALID_COORDINATE_DEFINITIONS) – Longitude Specification. Defaults to “-180-180”.

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.coordinates.ReIndex(coordinates=None, **coords)#

Reindex Coordinates

Reindex coordinates

Can be sorted, or in set list

Parameters:

coordinates (dict[str, Literal['reversed','sorted'] | Iterable | xr.Coordinates] | None, optional) – Coordinate to reindex, and Iterable to reindex at. If ‘reversed’ or ‘sorted’, take current coord and sort. If xr.Coordinates, use any coordinates with len > 1. Defaults to None.

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.coordinates.StandardCoordinateNames(replacement_dictionary=None, **repl_kwargs)#

Convert xr.Dataset Coordinate Names into Standard Naming Scheme

Convert xr.Dataset Coordinate Names into Standard Naming Scheme

Parameters:
  • replacement_dictionary (dict | None, optional) – Dictionary assigning name replacements [old: new]. One of replacement_dictionary or repl_kwargs must be provided. Defaults to None.

  • **repl_kwargs (dict, optional) – Kwarg version of replacement_dictionary

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.coordinates.Select(indexers=None, *, ignore_missing=False, tolerance=None, isel=False, **indexers_kwargs)#

Select on Coordinates

Select values on coordinates

Parameters:
  • indexers (dict[str, Any] | None, optional) – A dict with keys matching dimensions and values One of indexers or indexers_kwargs must be provided. Defaults to None.

  • **indexers_kwargs (dict) – Index keyword arguments

  • ignore_missing (bool, optional) – Ignore coordinates not in dataset. Defaults to False

  • tolerance (float | None, optional) – Tolerance for selection. Defaults to None.

  • isel (bool, optional) – Whether to use isel. Defaults to False.

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.coordinates.Drop(coordinates=None, *extra_coords, ignore_missing=False)#

Drop items from Dataset

Drop Items from xr.Dataset

Parameters:
  • coordinates (list[Hashable] | tuple[Hashable] | Hashable | None) – Coordinates to drop. Defaults to None.

  • ignore_missing (bool, optional) – Ignore coordinates not in dataset. Defaults to False

  • extra_coords (Hashable)

Returns:

Transform to apply drop

Return type:

(Transform)

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.coordinates.Assign(coordinates=None, as_dataarray=False, **coordinate_kwargs)#

Assign coordinates to object

Assign coordinates to Xarray Object.

Uses .assign_coords

Parameters:
  • coordinates (dict[str, Any] | None, optional) – Coordinates to assign. Defaults to None.

  • as_dataarray (bool, optional) – Assign coordinates seperately to each variable. Defaults to False.

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.coordinates.Pad(coordinates=None, **kwargs)#

Pad data

Pad data

This will automatically pad the coordinate values with an odd reflection to allow periodicy.

Parameters:
  • coordinates (dict[str, Any] | None) – Coordinate pad_width. From xarray docs: Mapping with the form of {dim: (pad_before, pad_after)} describing the number of values padded along each dimension. {dim: pad} is a shortcut for pad_before = pad_after = pad

  • **kwargs – Any kwargs to pass to .pad

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

pyearthtools.data.transforms.default.get_default_transforms(intelligence_level=2)#

Get Default Transforms to be applied to all datasets

Parameters:

intelligence_level (int, optional) – Level of Intelligence in operation. Defaults to 2.

Returns:

Collection of default transforms

Return type:

pyearthtools.data.transforms.TransformCollection

pyearthtools.data.transforms.derive.evaluate(eq, *, dataset=None)#
Evaluate a given equation

Use dataset to set variables.

Each numerical or reference component in an equation must be seperated by a space.

If using function based symbols like ‘sqrt’ or ‘sin’, the next item will be evaluated using said function. These functions can be given with brackets next to them.

Without brackets given, the equation will be evaluated left -> right.

Parameters:
  • eq (str) – Equation to solve

  • dataset (xr.Dataset | None, optional) – Dataset to get variables from. Defaults to None

Returns:

Result of equation

Return type:

(xr.DataArray | float)

pyearthtools.data.transforms.derive.derive_equations(dataset, equation=None, *, drop=False, **equations)#

Derive new variables from specified equation/s, and set variables in the dataset accordingly

Parameters:
  • dataset (xr.Dataset) – Dataset to get variables from, and to set new ones on

  • equation (dict[str, str | tuple[str, dict[str, Any]]] | None, optional) – Dictionary of equations, key represents new variable name. Can be tuple to set equation, and attribute update dictionary. Defaults to {}.

  • drop (bool, optional) – Drop variables used in calculations. Defaults to False.

Returns:

Dataset with equations applied to it

Return type:

(xr.Dataset)

class pyearthtools.data.transforms.dimensions.StandardDimensionNames(replacement_dictionary=None, **kwargs)#

Standardise dimension names

Convert Dataset Dimension Names into Standard Naming Scheme

Parameters:
  • replacement_dictionary (dict[Hashable, Hashable]) – Dictionary assigning dimension name replacements [old: new]

  • kwargs (str)

Returns:

Transform to replace dimension names

Return type:

(Transform)

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.dimensions.Expand(dim=None, axis=None, as_dataarray=True, missing='error', exists='error', **kwargs)#

Expand Dimensions

Expand Dimensions.

Uses xarray .expand_dims.

Parameters:
  • dim (list[str] | dict | str | None, optional) – Dimensions to include on the new variable. If provided as str or sequence of str, then dimensions are inserted with length 1. If provided as a dict, then the keys are the new dimensions and the values are either integers (giving the length of the new dimensions) or sequence/ndarray (giving the coordinates of the new dimensions).

  • axis (int | list[int] | None, optional) – Axis position(s) where new axis is to be inserted (position(s) on the result array). If a sequence of integers is passed, multiple axes are inserted. In this case, dim arguments should be same length list. If axis=None is passed, all the axes will be inserted to the start of the result array.

  • as_dataarray (bool, optional) – Expand each variable independently. Defaults to True.

  • missing (Literal['skip','error'], optional) – What to do when a missing dim is given. Defaults to ‘error’.

  • kwargs (int) – Keywords form of dim.

  • exists (Literal['skip', 'error'])

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.interpolation.Interpolate(method='linear', keep_encoding=False, skip_missing=False, pad=False, **kwargs)#

Interpolation Transform

Interpolation Transform passing kwargs

Parameters:
  • **kwargs – Kwargs to pass to xr.interp. Should be variables with new coordinates to interpolate to. e.g. latitude = [-90,-80,...,80,90]

  • method (InterpOptions) – Method to use for interpolate. Defaults to “linear”. Must be one of xarray.interp methods “linear”, “nearest”, “zero”, “slinear”, “quadratic”, “cubic”, “polynomial”, “barycentric”, “krog”, “pchip”, “spline”, “akima”

  • keep_encoding (bool) – Whether to keep the encoding of the incoming dataset.

  • skip_missing (bool) – Skip missing dimensions as given in kwargs but not in dataset.

  • pad (bool | int) – Whether to pad all coords by 1. If int size to pad by.

Returns:

Transform to interpolate datasets

Return type:

Transform

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.interpolation.XESMF(reference_dataset=None, method='bilinear', **coords)#

Interpolate using xesmf

Create Transform using xesmf

Either reference_dataset or coords must be given

Parameters:
  • reference_dataset (xr.Dataset | None) – Reference Dataset.

  • **coords – Coordinates to create reference_dataset from. Can be fully created or tuple to use to fill np.arange. Either lat = (["lat"], np.arange(16, 75, 1.0)) or lat = (16, 75, 1.0)

  • method (str) – Interpolation method to use.

Raises:
  • ImportError – xesmf could not be imported

  • KeyError – No arguments given

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.interpolation.InterpolateNan(dim, method='linear', keep_encoding=False, fill_value='extrapolate', **kwargs)#

Interpolate Nan’s

Interpolate Nan Transform.

Uses xarray.ds.interpolate_na, see for all kwargs.

Automatically reindexes to be monotonic, and reverts before pass back.

Parameters:
  • **kwargs (Any) – Kwargs to pass to xr.interpolate_na

  • method (InterpOptions, optional) –

    Method to use for interpolate. Defaults to “nearest”. Must be one of xarray.interp methods

    ”linear”, “nearest”, “zero”, “slinear”, “quadratic”, “cubic”, “polynomial”, “barycentric”, “krog”, “pchip”, “spline”, “akima”

  • keep_encoding (bool, optional) – Whether to keep the encoding of the incoming dataset. Defaults to False.

  • fill_value (str | None, optional) – See scipy.interpolate.interp1d.

  • dim (str)

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

pyearthtools.data.transforms.interpolation.like(reference_dataset, method='linear', drop_coords=None, pad=False, **kwargs)#

From reference dataset setup interpolation transform

Parameters:
  • reference_dataset (xr.Dataset | str) – Dataset to use to set coords. Can be path to dataset to open

  • method (InterpOptions, optional) – Method to use in interpolation. Defaults to “linear”.

  • drop_coords (str | list[str], optional) – Coords to drop from reference dataset. Defaults to None.

  • pad (bool | int, optional) – Whether to pad all coords by 1. If int size to pad by. Defaults to False.

Returns:

Transform to interpolate dataset like reference_dataset

Return type:

(Transform)

class pyearthtools.data.transforms.mask.UnderlyingMaskTransform(docstring=None)#

Initalise root Transform class

Cannot be used as is, a child must implement the .apply function.

Parameters:

docstring (str, optional) – Docstring to set this Transform to. Defaults to None.

Raises:

TypeError – If cannot parse docstring

filter(data, value, *, replacement_value=nan, operation='==', **kwargs)#

Run filtering, But if any of the given kwargs are dictionaries retrieve the correct element

Will raise an error if a key is missing from a dictionary when it was present in another

Parameters:
  • data (Dataset | DataArray)

  • value (dict | float | str | Path)

  • replacement_value (Dataset | ndarray | float | Path | str | dict[str, Any])

  • operation (Literal['==', '!=', '>', '<', '>=', '<='] | dict[str, ~typing.Literal['==', '!=', '>', '<', '>=', '<=']])

class pyearthtools.data.transforms.mask.Dataset(value, reference_dataset, operation='==', replacement_value=nan, squeeze='None')#

Mask data using a reference dataset

Will replace data on incoming dataset where condition is met on reference_dataset

Parameters:
  • reference_dataset (xr.Dataset | str | dict) – Reference dataset to calculate mask from. Can be dataset, str as Path, or a dictionary referencing incoming data variables containing the prior types.

  • value (Any, optional) – Value to mask at. Can be array, dataset, string or dictionary. Defaults to np.NaN.

  • operation (Literal['==', '!=', '>', '<', '>=','<='] | dict, optional) – Criteria to search by. Can be dictionary for dataset keys. Defaults to “==”.

  • replacement_value (float | str | xr.Dataset | dict, optional) – Value to replace with. Can be str pointing to dataset or dataset itself, or a dictionary. Defaults to np.nan

  • squeeze (str | list, optional) – Dims to squeeze on reference dataset. Defaults to ‘None’

Returns:

Transform to apply mask to data

Return type:

(Transform)

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.mask.Replace(value, operation='==', replacement_value=nan)#

Replace Values in dataset with replacement_value when matching criteria

Parameters:
  • value (dict | float | str) – Value to mask at. Can be array, dataset, string or dictionary. Dictionary refers to variables and values.

  • operation (Literal['==', '!=', '>', '<', '>=','<='] | dict, optional) – Criteria to search by. Can be dictionary for dataset keys. Defaults to “==”.

  • replacement_value (float | str | xr.Dataset | dict, optional) – Value to replace with. Can be str pointing to dataset or dataset itself, or a dictionary. Defaults to np.nan

Raises:

KeyError – If invalid operation is provided

apply(data)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.optimisation.Rechunk(method)#

Rechunk data

Rechunk data

Parameters:

method (int | dict[str, Any] | Literal['auto', 'encoding']) – Rechunk either by encoding, auto or by variable config.

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

pyearthtools.data.transforms.region.check_shape(data)#

Calculate multiplied shape of xarray data container

Parameters:

data (xr.Dataset | xr.DataArray) – Data to find shape for

Returns:

Multiplied shape of data

Return type:

int

pyearthtools.data.transforms.region.order(*args)#

Order arguments with sort & return as tuple

pyearthtools.data.transforms.region.like(dataset)#

Use Reference Dataset to inform spatial extent & transform geospatial extent accordingly

Parameters:

dataset (xr.Dataset | str) – Reference Dataset to use. Can be path to dataset to load

Returns:

Transform to cut region to extent of given reference dataset

Return type:

(Transform)

class pyearthtools.data.transforms.region.Bounding(min_lat, max_lat, min_lon, max_lon)#

Cut with Bounding box

Use Bounding Coordinates to transform geospatial extent

Parameters:
  • min_lat (float) – Minimum Latitude to slice with

  • max_lat (float) – Maximum Latitude to slice with

  • min_lon (float) – Minimum Longitude to slice with

  • max_lon (float) – Maximum Longitude to slice with

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.region.Select#

Select on a dataset with sel_kwargs

Parameters:

sel_kwargs (dict[str, Any] | None)

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

class pyearthtools.data.transforms.region.ISelect#

Index select on a dataset with sel_kwargs

Parameters:

sel_kwargs (dict[str, Any] | None)

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

pyearthtools.data.transforms.region.PointBox(point, size)#

Create a region bounding box of size around point

Parameters:
  • point (tuple[float]) – Latitude and Longitude point

  • size (float) – Size in degrees to expand the box Total box width / length = size * 2

Returns:

Transform to cut region to bounding box around point

Return type:

(Transform)

pyearthtools.data.transforms.region.Lookup(key, regionfile=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/pyearthtools/checkouts/latest/packages/data/src/pyearthtools/data/transforms/RegionLookup.yaml'))#

Use string to retrieve preset lat and lon extent to transform geospatial extent

Parameters:
  • key (str) – Lookup key within the preset file

  • regionfile (str | Path) – Yaml File to look for keys in. Defaults to RegionLookupFILE

Raises:

KeyError – If key not in preset file

Returns:

Transform to cut region to define bounding box

Return type:

(Transform)

pyearthtools.data.transforms.region.Geosearch(key, column=None, value=None, crs=None, **kwargs)#

Using [static.geographic][pyearthtools.data.static.geographic] retrieve a Shapefile. Allows selection of geopandas file, column and value to filter by

If no column nor value provided, use all geometry in geopandas file

Parameters:
  • key (str) – A [Geographic][pyearthtools.data.static.geographic] search key

  • column (str | None, optional) – Column in geopandas to search in. Defaults to None.

  • value (list[str] | str, optional) – Values to search for, can be list. Defaults to None.

  • crs (str | None, optional) –

    Coordinate Reference System (CRS) to apply to data. Will check if shapefile has crs information and attempt to use if not provided. Otherwise an error will be raised.

    Can be any code accepted by geopandas. See (here)[https://geopandas.org/en/stable/docs/user_guide/projections.html#coordinate-reference-systems]

class pyearthtools.data.transforms.region.ShapeFile(shapefile, crs=None)#

Use Shapefile to create region bounding.

Parameters:
Raises:

ImportError – If geopandas cannot be imported

apply(dataset)#

Apply transformation to Dataset

Parameters:

dataset (XR_TYPES) – Dataset to apply transform to

Raises:

NotImplementedError – Base Transform does not implement this function

Returns:

Transformed Dataset

Return type:

XR_TYPES

pyearthtools.data.transforms.utils.parse_dataset(value)#

Attempt to load dataset if value is str or Path Return the original value if not

Parameters:

value (str | Path | Any)

Return type:

Any

data.catalog#

class pyearthtools.data.catalog.Catalog(*, catalog_name=None, entries=None)#

Keep a Catalog of Data Sources

Used to track known kwargs for functions

Can be used for any class with specifies the function to_init_dict, which returns a dictionary with the key being the fully featured class name, and the value, a dictionary with the init kwargs. A name kwarg specifies the CatalogEntry name.

Initalise a new Catalog of Data Sources

Parameters:
  • name (str, optional) – Name for this catalog. Defaults to None.

  • named_entries – {name : (Path, ‘Catalog’, CatalogEntry | pyearthtools.data.Index)} Named entries to add to catalog Names may be None

  • catalog_name (Optional[str])

  • entries (Optional[dict])

Examples

>>> test_catalog = Catalog()
>>> def return_function(x, **kwargs):
...     return '-'.join([x,*list(kwargs.keys())])
>>> test_catalog.append(CatalogEntry(return_function, name = 'Test' wow = 1))
>>> test_catalog.Test('entry')
'entry-wow'
append(other, *, name=None)#

Append Elements to Catalog.

Parameters:
  • other (str, Path, Catalog, CatalogEntry | pyearthtools.data.DataIndex | dict) – Items to add to Catalog

  • name (str, optional) – Override for name of entry. Defaults to None.

Raises:
  • KeyError – If pyearthtools.data.Index has no attr catalog

  • TypeError – If other not recognised

static load(catalog_to_load, direct_load=False, **kwargs)#

Load saved catalog file into new catalog object

!!! Tip:

If pointed at a folder, will search the folder looking for a catalog file of .cat. If found that catalog will be loaded, instead. Used to create folders loadable from pyearthtools.

Parameters:
  • catalog_to_load (str | Path) – Filepath to catalog file All function pointers are converted from str to function pointer

  • direct_load (bool, optional) – If Catalog contains one entry, this flag can be used to return that index instead. Defaults to False

Raises:

FileNotFoundError – If file does not exist

Returns:

Loaded Catalog

Return type:

Catalog

pop(key)#

Pop element from Catalog

Parameters:

key (str) – Key to pop

Raises:

KeyError – If key not in catalog

Returns:

Popped entry

Return type:

CatalogEntry

remove(key)#

Remove element from catalog

Parameters:

key (str) – Key to remove

Raises:

KeyError – If Key not catalog

save(output_file=None, direct_load=False)#

Save Catalog to specified file

Auto converts any function pointers to fully qualified path

Parameters:
  • output_file (str | Path | None, optional) – Save file path. Defaults to None.

  • direct_load (bool, optional) – If Catalog contains one entry, this flag can be used so that when the catalog is loaded, the index is returned instead. Defaults to False

Return type:

None | dict

to_dict()#

Get catalog as dictionary

class pyearthtools.data.catalog.CatalogEntry(item_class, args=[], *extra_args, name=None, class_path=None, kwargs={}, **extra_kwargs)#

Catalog Entry

Setup Catalog Entry.

Can be used to catalog any class, and the args and kwargs to initalise it.

Parameters:
  • item_class (Callable | None) – Class for which to setup a catalog entry

  • args (list[Any]) – args to be passed to item_class

  • *extra_args – also passed to item_class

  • name (str | None) – Name of this entry

  • class_path (str | None) – Override for class path. If not given will be auto found.

  • **kwargs (dict) – kwargs to be passed to item_class

property call_underlying_function: Any#

Get underlying Class of the catalog entry

Returns: Underlying Class

del_kwargs(key)#

Remove kwargs

Parameters:

key (str) – Key to remove

Raises:

KeyError – If key not found

static from_dict(init_dict, **kwargs)#

Create CatalogEntry from dictionary

This dictionary can be of two forms, one that is the result of CatalogEntry.to_dict(), and the other a more general form.

Form of the init_dict

>>> {
>>>     CLASS:
>>>         { # All are optional
>>>         args: #Arguments to initalise with
>>>         kwargs: #Keyword arguments to initalise with
>>>         name: #Name of entry
>>>         }
>>>
>>> }
Parameters:
  • init_dict (dict) – Initialisation Dictionary.

  • **kwargs – Kwargs to replace init_dict[‘kwargs’] with.

Return type:

CatalogEntry

Returns: Loaded CatalogEntry

save(output_file=None, direct_load=True)#

Save this CatalogEntry as a catalog at Path

Parameters:
  • output_file (str | Path | None, optional) – Path to savefile. Defaults to None.

  • direct_load (bool, optional) – When loading this catalog entry, should the index be directly returned Defaults to True.

Return type:

None | dict

set_kwargs(**kwargs)#

Add extra kwargs

Parameters:

**kwargs (Any) – Extra kwargs

to_dict()#

Convert CatalogEntry into dict

Returns:

Dictionary containing all info needed to reconstruct the object.

Structure:

item_class: Function class path
name: Catalog Entry name
args: Args used to init
kwargs: Kwargs used to init

Return type:

dict

pyearthtools.data.catalog.get_name(obj)#

Get name of object

Parameters:

obj (Any)

Return type:

str

data.collection#

class pyearthtools.data.collection.Collection(*args, **kwds)#

A modified tuple type object which allows attributes and methods to be accessed.

Attributes and methods will be returned as a Collection, thus allowing their attributes and methods to be accessed.

Any item in a Collection can be accessed by using the [] syntax, and can be iterated over.

Examples

>>> collec = pyearthtools.data.Collection({'item_1':10}, {'item_2':42})
>>> collec
Collection Containing:
    {'item_1': 10}
    {'item_2': 42}
>>> collec.keys()
Collection Containing:
    dict_keys(['item_1'])
    dict_keys(['item_2'])
>>> collec[0]
{'item_1': 10}
Parameters:

args (Any)

class pyearthtools.data.collection.LabelledCollection(*args, **kwds)#

A modified unmutable dict like object which allows attributes and methods to be accessed of the underlying objects, while retaining the original names. This allows for a name to be given to a root object, and any operations or attributes from said object will remain linked to that name.

Attributes and methods will be returned as a LabelledCollection, thus allowing their attributes and methods to be accessed.

Any item in a LabelledCollection can be accessed by it’s given name, and can be iterated over.

Parameters:

kwargs (Any)

data.exceptions#

class pyearthtools.data.exceptions.InvalidIndexError(message, *args)#

If an invalid index was provided

class pyearthtools.data.exceptions.InvalidDataError(message, *args)#

If data cannot be loaded

class pyearthtools.data.exceptions.DataNotFoundError(message, *args)#

If Data was not found

data.load#

pyearthtools.data.load.load(stream, **kwargs)#

Load a saved pyearthtools.data.Index

Parameters:

stream (Union[str, Path]) – Stream to load, can be either path to config or yaml str

Returns:

Loaded Index

Return type:

(pyearthtools.data.Index)

data.time#

pyearthtools.data.time.multisplit(element, splits)#

Split a str by multiple characters.

Parameters:
  • element (str)

  • splits (tuple[str | int, ...])

Return type:

list[str]

pyearthtools.data.time.find_components(time)#

Find Specified Time components in given time str (e.g. indicate which of year, month, day, hour etc set is set in the time string)

Parameters:

time (str) – String of time, usually in isoformat e.g. ‘2021-02-03T0000’

Returns:

resolution_component -> flag

Return type:

dict[str, bool]

Examples

>>> pyearthtools.data.time.find_components('2020-01')
{'year': True, 'month': True, 'day': False, 'minute': False, 'second': False}
pyearthtools.data.time.strip_to_common_resolution(component)#

Remove common suffix for time resolution vernacular

Parameters:

component (str)

Return type:

str

pyearthtools.data.time.time_delta(time_amount)#

Create a pandas timedelta

Parameters:
  • time (Any) – time of delta, can be: int: automatic unit of ‘minutes’ applied tuple: (int, str) with str being unit

  • time_amount (Any)

Returns:

Discovered pandas timedelta

Return type:

pd.Timedelta

pyearthtools.data.time.time_delta_resolution(timedelta)#

Find resolution of timedelta

Parameters:

timedelta (pd.Timedelta) – Given timedelta

Returns:

Resolution of timedelta

Return type:

TimeResolution

pyearthtools.data.time.range_samples(start, end, step, inclusive=False)#

Cache generation of time samples

Parameters:
class pyearthtools.data.time.Petdt(time, *, resolution=None)#

PyEarthTools Datetime object which has additional functionality relating to temporal resolution and resolution conversion compared to other libraries, and also supports alternative calendars to some degree.

Examples

>>> str(Petdt('2021-01'))
"2021-01"
>>> str(Petdt('2021-01-12'))
"2021-01-12"
Parameters:
  • time (Any) – Time to get resolution of. Can use ‘today’ to get today

  • resolution (str | TimeResolution | None) – Override for resolution specification. Defaults to None.

Notes

time must be a str or Petdt for resolution awareness to take effect, If str, it must be in isoformat

Valid time resolutions are:

“year”, “month”, “day”, “hour”, “minute”, “second”, “nanosecond”,

Time when supplied as a string may be underspecified (e.g. just the year).

The resolution of a supplied time string will be inferred from the time components which are present in the string.

If a resolution is specified lower than the specified time string, the datetime will be down-sampled to match the specified resolution.

at_resolution(resolution)#

Get Petdt at specified resolution

Parameters:

resolution (VALID_RESOLUTIONS | Petdt | TimeResolution | TimeDelta, optional) – Temporal Resolution of resulting pyearthtools_datetime.

Raises:

(KeyError) – If resolution is not recognised

Returns:

Petdt at given resolution

Return type:

(Petdt)

property datetime: datetime#

Get datetime.datetime object

datetime64(time_unit='ns')#

Get Petdt as a np.datetime64 in given unit

Parameters:

time_unit (str, optional) – Time unit to get datetime64 in. Defaults to “ns”.

Returns:

Defined time as a np.datetime64

Return type:

np.datetime64

static is_time(time_to_parse)#

Check if object can be parsed to a Petdt

Attempts to make Petdt but catches all exceptions.

Parameters:

time_to_parse (Any) – Object to check if can be Petdt

Returns:

Boolean value of if can be Petdt

Return type:

(bool)

to_cftime(calendar='noleap')#

This method will throw an exception if cftime is not installed.

class pyearthtools.data.time.TimeDelta(timedelta=None, *args)#

Create a TimeDelta Object

Effectively a wrapper around the pandas.Timedelta.

If no units are supplied, minutes is automatically assumed.

Parameters:
  • timedelta (Any) – Timedelta arguments, can be int or tuple

  • *args (Any) – Extra Timedelta arguments. If timedelta is int, set unit.

Examples

>>> TimeDelta(10, 'days')
10 days 00:00:00
>>> TimeDelta((10, 'days'))
10 days 00:00:00
>>> TimeDelta(10)
0 days 00:10:00
property np_timedelta: timedelta64#

Numpy timedelta64 of TimeDelta

property pd_timedelta: Timedelta#

Pandas Timedelta

property resolution: TimeResolution#

Resolution of the TimeDelta

class pyearthtools.data.time.TimeRange(start, end, step, *, inclusive=False, use_tqdm=False, desc='', **kwargs)#

Get all timesteps between two points at an interval

Generate all timesteps between start & end at step interval.

Parameters:
  • start (Petdt | str) – Starting time

  • end (Petdt | str) – Ending Time

  • step (TimeDelta | int | tuple) – Step Interval

  • inclusive (bool, optional) – Include end time. Defaults to False.

  • use_tqdm (bool, optional) – Format iterator with tqdm for interactive use. Defaults to False.

  • desc (str, optional) – Description if use_tqdm == True. Defaults to False.

  • **kwargs (Any, optional) – If using tqdm, all kwargs passed through

data.warnings#

class pyearthtools.data.warnings.pyearthtoolsDataWarning#

General warning for pyearthtools.data processes.

class pyearthtools.data.warnings.IndexWarning#

Data Index Warning.

class pyearthtools.data.warnings.AccessorRegistrationWarning#

Warning for conflicts in object registration.