Modular Sequence of Operations

Modular Sequence of Operations#

While pyearthtools.data has the concept of transforms, operations to be applied to the data upon retrieval or at the users whim, they lack a sense of modularity, composability and reversability needed for data preparation of ML and other downstream tasks.

Enter pyearthtools.pipeline, it is built to provide a way to create pipelines, sequences of operations to be applied step by step upon data, and crucially, the ability for those steps to be reversed. Ideally, data should be able to be retrieved, been run through said operations, and the result then undone with the same pipeline, appearing as if no operations had been done it at all.

[1]:

%%capture
import pyearthtools.pipeline
import site_archive_nci

The construction of these pipelines can be quite complex, so it is best to take an iterative approach, slowly adding more steps and checking the output to ensure it is what you expect it to be.

pyearthtools.pipeline consists of by default the basic blocks to prepare data, and should be enough for most cases. But can be easily extended to add additonal features

As PyEarthTools is an ecosystem of tools, pyearthtools.pipeline builds heavily on pyearthtools.data. Most critically, it expects pyearthtools.data.Indexes as the sources of data.

The crucial class for pyearthtools.pipeline is Pipeline. It is the controller of the operations, handling retrieval, iterating and much more.

[2]:

sample = pyearthtools.pipeline.Pipeline.sample()
sample

Pipeline
    Description                    `pyearthtools.pipeline` Data Pipeline


    Initialisation
             exceptions_to_ignore           None
             iterator                       None
             sampler                        None
    Steps
             ERA5                           {'ERA5': {'level_value': 'None', 'product': "'reanalysis'", 'variables': "['2t']"}}
             conversion.ToNumpy             {'ToNumpy': {'reference_dataset': 'None', 'run_parallel': 'False', 'saved_records': 'None', 'warn': 'True'}}

Graph

../../_images/notebooks_pipeline_Basics_5_2.svg

Here is a basic pipeline, consisting of only two steps, the root pyearthtools.data.Index of ERA5, and a conversion from xarray to numpy.

Just like an pyearthtools.data.Index this can be indexed, but the operations will be applied in sequential order.

[3]:

sample['2000-01-01T00']

[3]:

array([[[[264.69238383, 264.69238383, 264.69238383, ..., 264.69238383,
          264.69238383, 264.69238383],
         [265.4507953 , 265.4507953 , 265.44750501, ..., 265.45244044,
          265.45244044, 265.45244044],
         [265.80779158, 265.80614644, 265.80450129, ..., 265.81272702,
          265.81108187, 265.80943673],
         ...,
         [242.63429067, 242.63758096, 242.64087125, ..., 242.62277466,
          242.62606495, 242.62935524],
         [243.04722186, 243.05051215, 243.05215729, ..., 243.04228643,
          243.04393157, 243.04557671],
         [243.44041132, 243.44041132, 243.44041132, ..., 243.44041132,
          243.44041132, 243.44041132]]]])

See, we now have a numpy array directly from the ERA5 index.

As mentioned above, undo is an important part of pyearthtools.pipeline

[4]:

sample.undo(sample['2000-01-01T00'])

And thus the data has been fully converted back into its original form.

[ ]: