Interfacing to Data at NCI

Interfacing to Data at NCI#

PyEarthTools provides a standardised interface to data. Code is already written to nicely interface to the full ERA5 that’s on disk at NCI (not just the 5.625 degree data from weatherbench). We’ll start by taking a look at the PyEarthTools ERA5 interface, then walk through the process of connecting to the low-res dataset to see how to connect to a new data source.

[1]:
import pyearthtools.data
import pyearthtools.pipeline
import warnings

with warnings.catch_warnings(action="ignore"):
    import site_archive_nci
[2]:
pyearthtools.data.archive.NCI?
Type:        module
String form: <module 'site_archive_nci' from '/g/data/kd24/tjl/src/nci/src/site_archive_nci/__init__.py'>
File:        /g/data/kd24/tjl/src/nci/src/site_archive_nci/__init__.py
Docstring:
National Computing Infrastructure specific Indexes

| Name        | Description |
| :---        |       ----: |
| [ERA5][site_archive_nci.ERA5]                | ECWMF ReAnalysis v5       |
| [ACCESS][site_archive_nci.ACCESS]            | Australian Community Climate and Earth-System Simulator       |
| [AGCD][site_archive_nci.AGCD]                | Australian Gridded Climate Data        |
| [BRAN][site_archive_nci.BRAN]                | Bluelink ReANalysis        |
| [OceanMaps][site_archive_nci.OceanMaps]      | Ocean Modelling and Analysis Prediction System        |
| [MODIS][site_archive_nci.MODIS]              | MODerate resolution Imaging Spectroradiometer       |
| [Himawari][site_archive_nci.Himawari]        | Himawari 8/9 satellite data       |
| [BARRA][site_archive_nci.BARRA]              | Bureau of meteorology Atmospheric high-resolution Regional Reanalysis for Australia       |
| [BARPA][site_archive_nci.BARPA]              | Bureau of Meteorology Atmospheric Regional Projections for Australia       |
| [BARRA_V2][site_archive_nci.BARRA_V2]        | Bureau of meteorology Atmospheric high-resolution Regional Reanalysis for Australia v2    |
[3]:
var=['u', 'v']  # Note - there is no really straightforward way to just list the variables in the archive
                # However, mismatches will cause PyEarthTools to list what's available with a "did you mean" prompt
                # A specific listing function should be added in future.
UandV = pyearthtools.data.archive.ERA5(var)
UandV
[3]:
ERA5
    Description                    ECWMF ReAnalysis v5
             range                          '1970-current'
             Documentation                  'https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation'


    Initialisation
             level_value                    None
             product                        'reanalysis'
             variables                      ['u', 'v']
    Transforms
             StandardCoordinateNames        {'latitude': "['lat', 'Latitude', 'yt_ocean', 'yt']", 'longitude': "['lon', 'Longitude', 'xt_ocean', 'xt']", 'replacement_dictionary': 'None', 'time': "['Time']"}
             Rename                         {'names': {'t2m': "'2t'", 'u10': "'10u'", 'v10': "'10v'", 'siconc': "'ci'"}}
[4]:
UandV['1984-01-01']  # ERA5 is an analysis product, so is indexed by its analysis time. This request
                     # fetches all of the data from a particular date
[4]:
<xarray.Dataset> Size: 15GB
Dimensions:    (longitude: 1440, latitude: 721, level: 37, time: 24)
Coordinates:
  * longitude  (longitude) float32 6kB -180.0 -179.8 -179.5 ... 179.5 179.8
  * latitude   (latitude) float32 3kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0
  * level      (level) int32 148B 1 2 3 5 7 10 20 ... 875 900 925 950 975 1000
  * time       (time) datetime64[ns] 192B 1984-01-01 ... 1984-01-01T23:00:00
Data variables:
    u          (time, level, latitude, longitude) float64 7GB dask.array<chunksize=(4, 5, 405, 900), meta=np.ndarray>
    v          (time, level, latitude, longitude) float64 7GB dask.array<chunksize=(4, 5, 405, 900), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.6
    license:      Licence to use Copernicus Products: https://apps.ecmwf.int/...
    summary:      ERA5 is the fifth generation ECMWF atmospheric reanalysis o...

What this buys you#

PyEarthTools has abstracted the relationship between the data loading and the data source, in this case the filesystem. By writing a specific connector to the filesystem, it is possible to thereafter treat all data sources in a compatible and generic fashion, and also modularise the code which deals with filesystem connections (such as handling different filesystem layouts e.g. YYYY, YYMM, YYYMMDD layouts, or others)

It also applies standard coordinate naming, so that data sources can be more easily combined. It is common, for example for one data set to pick ‘lat’ and ‘lon’ instead of ‘latitude’ and ‘longitude’, or ‘t2m’ instead of ‘2t’ and suchlike.

At this point, it would be possible to write a simple loop such as:

for every date in a training period: load it call a training update function

This is a valid use case. The next notebook, “Data Pipelines with PyEarthTools”, will explore how to make further use of these data objects for data processing and options for presentation to the ML training frameworks.

[ ]: