Accessing ERA5 Data

Accessing ERA5 Data#

  • NCI, NIWA and the Met Office have all downloaded a local copy of ERA5 5.625 data.

  • The locations of the data are presented below, uncomment the organisation you have access to.

  • If you work outside these organisations, follow the instructions in the Downloading_ERA5.ipynb notebook to download a local copy.

Assumptions:

  1. You should already have downloaded a copy or partial copy of ERA5 5.625 degree resolution

  2. You have checked out the PyEarthTools monorepo and have a functional PyEarthTools environment into which you can install new packages

This notebook will work through creating a new PyEarthTools package which can interface to the ERA5 dataset, referred to hereafter as “ERA5lowres” for convenience and naming.

This notebook will present two things:

  1. The quick install-and-use demo

  2. How it was done slowly and carefully so you can do it on new data

[ ]:
# Uncomment the organisation you have access to.

# wbench_data_dir = '/g/data/wb00/NCI-Weatherbench/5.625deg'                # NCI
# wbench_data_dir = "/nesi/nobackup/niwa00004/riom/weatherbench/5.625deg/"  # NIWA HPC
wbench_data_dir = "/data/users/infolab/weatherbench/5.625deg"               # Met Office

# List the contents of the data directory.
!ls $wbench_data_dir
10m_u_component_of_wind  geopotential_500              total_cloud_cover
10m_v_component_of_wind  potential_vorticity           total_precipitation
2m_temperature           relative_humidity             u_component_of_wind
baselines                specific_humidity             v_component_of_wind
constants                temperature                   vorticity
geopotential             temperature_850
Geopotential             toa_incident_solar_radiation
[ ]:
# Select the first 2m temperature directory.
!ls $wbench_data_dir/2m_temperature/ | head -n 1
2m_temperature_1979_5.625deg.nc
[ ]:
# Select all 2m_temperature files with '198' *wildcard* to select all files from the 1980s.
!ls $wbench_data_dir/2m_temperature/*198*
/data/users/infolab/weatherbench/5.625deg/2m_temperature/2m_temperature_1980_5.625deg.nc
/data/users/infolab/weatherbench/5.625deg/2m_temperature/2m_temperature_1981_5.625deg.nc
/data/users/infolab/weatherbench/5.625deg/2m_temperature/2m_temperature_1982_5.625deg.nc
/data/users/infolab/weatherbench/5.625deg/2m_temperature/2m_temperature_1983_5.625deg.nc
/data/users/infolab/weatherbench/5.625deg/2m_temperature/2m_temperature_1984_5.625deg.nc
/data/users/infolab/weatherbench/5.625deg/2m_temperature/2m_temperature_1985_5.625deg.nc
/data/users/infolab/weatherbench/5.625deg/2m_temperature/2m_temperature_1986_5.625deg.nc
/data/users/infolab/weatherbench/5.625deg/2m_temperature/2m_temperature_1987_5.625deg.nc
/data/users/infolab/weatherbench/5.625deg/2m_temperature/2m_temperature_1988_5.625deg.nc
/data/users/infolab/weatherbench/5.625deg/2m_temperature/2m_temperature_1989_5.625deg.nc

Here we see that the data is stored in subdirectories with human-readable names, and within those subdirectories, the files are names with the year and month of the data.

This layout is somewhat different to the layout of the full ERA5 dataset as taken directly from CDS, so we will need to tell PyEarthTools how to deal with this difference.

The challenge for PyEarthTools is to understand this directory structure, figure out the shorthand variables names that are actually present in the files (e.g. total_cloud_cover is called “tcc” inside the netcdf file), and work out how to index the whole thing by shorthand-variable-name and date, including interpreting things like the unit of the variable and handling any variable renaming that might be needed for standardisation of naming conventions.

This requires some configuration, and a lot of PyEarthTools’ code is about this kind of dataset comprehension.

Let’s summarise the easy way.

  1. Install the era5lowres Python module (part of the tutorial package)

  2. Set an environment variable called ERA5LOWRES

  3. Import the era5lowres Python module

Assuming you have already installed the era5lowres module, let’s do step 2 and 3.

[ ]:
# Create an environment variable to store the path to the ERA5 data.
%env ERA5LOWRES=$wbench_data_dir
env: ERA5LOWRES=/data/users/infolab/weatherbench/5.625deg
[ ]:
# Import pyearthtools packages.
import pyearthtools.data
import pyearthtools.tutorial
[ ]:
# Display information about era5lowres.
pyearthtools.data.archive.era5lowres?
Init signature:
pyearthtools.data.archive.era5lowres(
    variables: 'list[str] | str',
    *,
    level_value: 'int | float | list[int | float] | tuple[list | int, ...] | None' = None,
    transforms: 'Transform | TransformCollection | None' = None,
)
Docstring:      ECWMF ReAnalysis v5
Init docstring:
Setup ERA5 Low-Res Indexer

Args:
    variables (list[str] | str):
        Data variables to retrieve
    resolution (Literal[ERA_RES], optional):
        Resolution of data, must be one of 'monthly-averaged','monthly-averaged-by-hour', 'reanalysis'.
        Defaults to 'reanalysis'.
    level_value: (int, optional):
        Level value to select if data contains levels. Defaults to None.
    transforms (Transform | TransformCollection, optional):
        Base Transforms to apply.
        Defaults to TransformCollection().
File:           ~/Projects/PyEarthTools/packages/tutorial/src/pyearthtools/tutorial/ERA5DataClass.py
Type:           ABCMeta
Subclasses:
[27]:

var = ['u', 'v'] # Note - there is no really straightforward way to just list the variables in the archive # However, mismatches will cause PyEarthTools to list what's available with a "did you mean" prompt # A specific listing function should be added in future. UandV = pyearthtools.data.archive.era5lowres(var) UandV
[27]:
ERA5LowResIndex
    Description                    ECWMF ReAnalysis v5
             range                          '1970-current'
             Documentation                  'https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation'


    Initialisation
             level_value                    None
             variables                      ['u', 'v']
    Transforms
             StandardCoordinateNames        {'latitude': "['lat', 'Latitude', 'yt_ocean', 'yt']", 'longitude': "['lon', 'Longitude', 'xt_ocean', 'xt']", 'replacement_dictionary': 'None', 'time': "['Time']"}
             Rename                         {'names': {'t2m': "'2t'", 'u10': "'10u'", 'v10': "'10v'", 'siconc': "'ci'"}}
[ ]:
# Inspect the data for a specific date as an Xarray dataset.
# Note that the Data Variables include 'u' and 'v' as expected.
data = UandV['1984-01-01']
data
<xarray.Dataset> Size: 5MB
Dimensions:    (latitude: 32, longitude: 64, level: 13, time: 24)
Coordinates:
  * latitude   (latitude) float64 256B -87.19 -81.56 -75.94 ... 81.56 87.19
  * longitude  (longitude) float64 512B 0.0 5.625 11.25 ... 343.1 348.8 354.4
  * level      (level) int32 52B 50 100 150 200 250 300 ... 600 700 850 925 1000
  * time       (time) datetime64[ns] 192B 1984-01-01 ... 1984-01-01T23:00:00
Data variables:
    u          (time, level, latitude, longitude) float32 3MB dask.array<chunksize=(24, 8, 19, 39), meta=np.ndarray>
    v          (time, level, latitude, longitude) float32 3MB dask.array<chunksize=(24, 8, 19, 39), meta=np.ndarray>
Attributes:
    Conventions:  CF-1.6
[ ]:
# Inspect the 'u' variable in the dataset.
data.u
<xarray.DataArray 'u' (time: 24, level: 13, latitude: 32, longitude: 64)> Size: 3MB
dask.array<getitem, shape=(24, 13, 32, 64), dtype=float32, chunksize=(24, 8, 19, 39), chunktype=numpy.ndarray>
Coordinates:
  * latitude   (latitude) float64 256B -87.19 -81.56 -75.94 ... 81.56 87.19
  * longitude  (longitude) float64 512B 0.0 5.625 11.25 ... 343.1 348.8 354.4
  * level      (level) int32 52B 50 100 150 200 250 300 ... 600 700 850 925 1000
  * time       (time) datetime64[ns] 192B 1984-01-01 ... 1984-01-01T23:00:00
Attributes:
    units:          m s**-1
    long_name:      U component of wind
    standard_name:  eastward_wind
[ ]:
# Inspect the time dimension of the 'u' variable.
data.u.time
<xarray.DataArray 'time' (time: 24)> Size: 192B
array(['1984-01-01T00:00:00.000000000', '1984-01-01T01:00:00.000000000',
       '1984-01-01T02:00:00.000000000', '1984-01-01T03:00:00.000000000',
       '1984-01-01T04:00:00.000000000', '1984-01-01T05:00:00.000000000',
       '1984-01-01T06:00:00.000000000', '1984-01-01T07:00:00.000000000',
       '1984-01-01T08:00:00.000000000', '1984-01-01T09:00:00.000000000',
       '1984-01-01T10:00:00.000000000', '1984-01-01T11:00:00.000000000',
       '1984-01-01T12:00:00.000000000', '1984-01-01T13:00:00.000000000',
       '1984-01-01T14:00:00.000000000', '1984-01-01T15:00:00.000000000',
       '1984-01-01T16:00:00.000000000', '1984-01-01T17:00:00.000000000',
       '1984-01-01T18:00:00.000000000', '1984-01-01T19:00:00.000000000',
       '1984-01-01T20:00:00.000000000', '1984-01-01T21:00:00.000000000',
       '1984-01-01T22:00:00.000000000', '1984-01-01T23:00:00.000000000'],
      dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 192B 1984-01-01 ... 1984-01-01T23:00:00
Attributes:
    long_name:  time