End-to-end CNN Training Example

End-to-end CNN Training Example#

This notebook illustrates how to use PyEarthTools pipeline to train a simple machine learning Convolutional Neural Network (CNN) model using the WeatherBench2 ERA5 dataset.

The general aim of this machine learning project is to predict the future state of the atmosphere by taking the current, or previous state and predicting a number of hours ahead in time.

We will select specific ERA5 variables in our data_pipeline however, feel free to experiment by changing these.

Model input data: We will use a range of specific ERA5 variables at T+0hr as our input features.
Model target data: We will try and predict these variables at T+6hrs ahead (try changing this to predict further ahead).

Import packages and set parameters#

[1]:

import sys
from pathlib import Path

import numpy as np
import xarray as xr
import scores
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from lightning import Trainer, LightningModule
from lightning.pytorch.callbacks import RichProgressBar
from rich.progress import track

import pyearthtools.data
import pyearthtools.tutorial
import pyearthtools.pipeline
import pyearthtools.training

[2]:

# training data split
train_start = "2013-01-01T00"
train_end = "2015-01-12T00"

# Validation data uses the same dates and time, but 1 year after the training data.
val_start = "2016-01-01T00"
val_end = "2016-01-12T00"

# Test data uses the same dates and time, but 2 years after the training data.
test_start = "2017-01-01T00"
test_end = "2017-01-12T00"

# data loader parameters
batch_size = 1
n_workers = 2

# trainer parameters
max_epochs = 10

# folder to download data and cache intermediate results
workdir = Path("cnn_training")

Model fitting#

[15]:

class CNN(LightningModule):
    def __init__(
        self,
        *,
        n_features: int,
        layer_sizes: list[int],
        dropout: float,
        learning_rate: float,
    ):
        super().__init__()
        self.save_hyperparameters()

        layer_sizes = (n_features,) + tuple(layer_sizes)
        layers = []
        for chan_in, chan_out in zip(layer_sizes[:-1], layer_sizes[1:]):
            layers.extend(
                [
                    nn.Conv2d(chan_in, chan_out, kernel_size=3, stride=1, padding=1),
                    nn.ReLU(),
                    nn.Dropout(p=dropout),
                ]
            )
        layers.append(
            nn.Conv2d(layer_sizes[-1], n_features, kernel_size=3, stride=1, padding=1)
        )
        self.cnn = nn.Sequential(*layers)

        self.learning_rate = learning_rate
        self.loss_function = F.l1_loss

    def forward(self, x):
        return self.cnn(x)

    def training_step(self, batch, batch_idx):
        inputs, targets = batch
        outputs = self(inputs)
        loss = self.loss_function(outputs, targets)
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        inputs, targets = batch
        outputs = self(inputs)
        loss = self.loss_function(outputs, targets)
        self.log("val_loss", loss)

    def predict_step(self, batch, batch_idx):
        # handle case when data comes from prediction pipeline
        if len(batch) == 2:
            return self(batch[0])
        return self(batch)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return {"optimizer": optimizer}

[16]:

# Print the shape of the normalised training data for the given start date
print(data_preparation_normed[train_start][0].shape)

# Extract the number of features from the normalised training data
n_features = data_preparation_normed[train_start][0].shape[-3]
print(f"Number of features: {n_features}")

(5, 64, 32)
Number of features: 5

[17]:

# Define the parameters for the CNN model
model_params = {
    'n_features': n_features,
    'layer_sizes': [64, 64],
    'dropout': 0.6,
    'learning_rate': 1e-5
}

# Initialise the CNN model with the specified parameters
model = CNN(**model_params)

[18]:

model

[18]:

CNN(
  (cnn): Sequential(
    (0): Conv2d(5, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): Dropout(p=0.6, inplace=False)
    (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU()
    (5): Dropout(p=0.6, inplace=False)
    (6): Conv2d(64, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  )
)

[19]:

# Uncomment the following line to use the CPU even if a GPU is available.
#%env CUDA_VISIBLE_DEVICES=

Integration With PyTorch Lightning#

The Lightning Data Module#

The data module encapsulates all the data-related operations, including:

Data Preparation: Applying the data preparation pipeline to preprocess the data.
Data Splitting: Splitting the data into training and validation sets.
Batching: Creating batches of data for training and validation.
Multiprocessing: Handling data loading in parallel.

The data module integrates seamlessly with PyTorch Lightning, allowing the user to focus on defining their model and training logic without worrying about the data loading and preprocessing details. When one passes the data module to a PyTorch Lightning trainer, it automatically handles the data loading and batching during training and validation.

Note: Here we use forkserver to prevent deadlocks on Linux platform when using more than one worker in the data loader.

[20]:

# Initialise the lightning data module for training
data_module = pyearthtools.training.data.lightning.PipelineLightningDataModule(
    data_preparation_normed,    # Data preparation pipeline
    train_split=train_split,    # Training data split
    valid_split=val_split,      # Validation data split
    batch_size=batch_size,      # Batch size for training
    num_workers=n_workers,
    multiprocessing_context="forkserver",
    persistent_workers=True     # Keep workers alive between epochs
)

[21]:

data_module

[21]:

PipelineLightningDataModule
    Initialisation                 Pytorch Lightning DataModule.
             batch_size                     1
             iterator_dataset               False
             multiprocessing_context        'forkserver'
             num_workers                    2
             persistent_workers             True
             pipelines                      {'Pipeline': {'__args': '(Pipeline\n\tDescription                    `pyearthtools.pipeline` Data Pipeline\n\n\n\tInitialisation                 \n\t\t exceptions_to_ignore           None\n\t\t iterator                       None\n\t\t sampler                        None\n\tSteps                          \n\t\t weatherbench.WB2ERA5           {\'WB2ERA5\': {\'download_dir\': "PosixPath(\'cnn_training/download\')", \'level\': \'[850]\', \'license_ok\': \'True\', \'resolution\': "\'64x32\'", \'variables\': "[\'2m_temperature\', \'u\', \'v\', \'geopotential\', \'vorticity\']"}}\n\t\t sort.Sort                      {\'Sort\': {\'order\': "[\'2m_temperature\', \'u_component_of_wind\', \'v_component_of_wind\', \'vorticity\', \'geopotential\']", \'strict\': \'False\'}}\n\t\t coordinates.StandardLongitude  {\'StandardLongitude\': {\'longitude_name\': "\'longitude\'", \'type\': "\'0-360\'"}}\n\t\t reshape.CoordinateFlatten      {\'CoordinateFlatten\': {\'__args\': \'()\', \'coordinate\': "\'level\'", \'skip_missing\': \'False\'}}\n\t\t idx_modification.TemporalRetrieval {\'TemporalRetrieval\': {\'concat\': \'True\', \'delta_unit\': \'None\', \'merge_function\': \'None\', \'merge_kwargs\': \'None\', \'samples\': \'((0, 1), (6, 1))\'}}\n\t\t conversion.ToNumpy             {\'ToNumpy\': {\'reference_dataset\': \'None\', \'run_parallel\': \'False\', \'saved_records\': \'None\', \'warn\': \'True\'}}\n\t\t reshape.Rearrange              {\'Rearrange\': {\'rearrange\': "\'c t h w -> t c h w\'", \'rearrange_kwargs\': \'None\', \'reverse_rearrange\': \'None\', \'skip\': \'False\'}}\n\t\t reshape.Squeeze                {\'Squeeze\': {\'axis\': \'0\'}}, Deviation\n\tInitialisation                 Deviation Normalisation\n\t\t deviation                      PosixPath(\'cnn_training/std.npy\')\n\t\t expand                         False\n\t\t mean                           PosixPath(\'cnn_training/mean.npy\'), Cache\n\tInitialisation                 An `pyearthtools.pipeline` implementation of the `CachingIndex` from `pyearthtools.data`.\n\t\t cache                          \'/var/home/riomaxim/Synced/work/en_cours/PyEarthTools/notebooks/tutorial/cnn_training/cache\'\n\t\t cache_validity                 \'warn\'\n\t\t pattern                        None\n\t\t pattern_kwargs                 {\'extension\': "\'npy\'"}\n\t\t save_kwargs                    None)', 'exceptions_to_ignore': 'None', 'iterator': 'None', 'sampler': 'None'}}
             train_split                    {'DateRandomise': {'iterator': {'DateRange': {'allowlist': 'None', 'blocklist': 'None', 'end': "'2015-01-12T00'", 'interval': "'6h'", 'start': "'2013-01-01T00'"}}, 'seed': '42'}}
             valid_split                    {'DateRange': {'allowlist': 'None', 'blocklist': 'None', 'end': "'2016-01-12T00'", 'interval': "'6h'", 'start': "'2016-01-01T00'"}}

PipelineLightningDataModule

Initialisation:
Pytorch Lightning DataModule.
- batch_size
  1
  1
- iterator_dataset
  False
  False
- multiprocessing_context
  'forkserver'
  'forkserver'
- num_workers
  2
  2
- persistent_workers
  True
  True
- pipelines
  {'Pipeline': {'__args': (Pipeline Description `pyearthtools.pipeline` Data Pipeline Initialisation exceptions_to_ignore None iterator None sampler None Steps weatherbench.WB2ERA5 {'WB2ERA5': {'download_dir': "PosixPath('cnn_training/download')", 'level': '[850]', 'license_ok': 'True', 'resolution': "'64x32'", 'variables': "['2m_temperature', 'u', 'v', 'geopotential', 'vorticity']"}} sort.Sort {'Sort': {'order': "['2m_temperature', 'u_component_of_wind', 'v_component_of_wind', 'vorticity', 'geopotential']", 'strict': 'False'}} coordinates.StandardLongitude {'StandardLongitude': {'longitude_name': "'longitude'", 'type': "'0-360'"}} reshape.CoordinateFlatten {'CoordinateFlatten': {'__args': '()', 'coordinate': "'level'", 'skip_missing': 'False'}} idx_modification.TemporalRetrieval {'TemporalRetrieval': {'concat': 'True', 'delta_unit': 'None', 'merge_function': 'None', 'merge_kwargs': 'None', 'samples': '((0, 1), (6, 1))'}} conversion.ToNumpy {'ToNumpy': {'reference_dataset': 'None', 'run_parallel': 'False', 'saved_records': 'None', 'warn': 'True'}} reshape.Rearrange {'Rearrange': {'rearrange': "'c t h w -> t c h w'", 'rearrange_kwargs': 'None', 'reverse_rearrange': 'None', 'skip': 'False'}} reshape.Squeeze {'Squeeze': {'axis': '0'}}, Deviation Initialisation Deviation Normalisation deviation PosixPath('cnn_training/std.npy') expand False mean PosixPath('cnn_training/mean.npy'), Cache Initialisation An `pyearthtools.pipeline` implementation of the `CachingIndex` from `pyearthtools.data`. cache '/var/home/riomaxim/Synced/work/en_cours/PyEarthTools/notebooks/tutorial/cnn_training/cache' cache_validity 'warn' pattern None pattern_kwargs {'extension': "'npy'"} save_kwargs None), 'exceptions_to_ignore': None, 'iterator': None, 'sampler': None}}
  Pipeline : {'__args': (Pipeline Description `pyearthtools.pipeline` Data Pipeline Initialisation exceptions_to_ignore None iterator None sampler None Steps weatherbench.WB2ERA5 {'WB2ERA5': {'download_dir': "PosixPath('cnn_training/download')", 'level': '[850]', 'license_ok': 'True', 'resolution': "'64x32'", 'variables': "['2m_temperature', 'u', 'v', 'geopotential', 'vorticity']"}} sort.Sort {'Sort': {'order': "['2m_temperature', 'u_component_of_wind', 'v_component_of_wind', 'vorticity', 'geopotential']", 'strict': 'False'}} coordinates.StandardLongitude {'StandardLongitude': {'longitude_name': "'longitude'", 'type': "'0-360'"}} reshape.CoordinateFlatten {'CoordinateFlatten': {'__args': '()', 'coordinate': "'level'", 'skip_missing': 'False'}} idx_modification.TemporalRetrieval {'TemporalRetrieval': {'concat': 'True', 'delta_unit': 'None', 'merge_function': 'None', 'merge_kwargs': 'None', 'samples': '((0, 1), (6, 1))'}} conversion.ToNumpy {'ToNumpy': {'reference_dataset': 'None', 'run_parallel': 'False', 'saved_records': 'None', 'warn': 'True'}} reshape.Rearrange {'Rearrange': {'rearrange': "'c t h w -> t c h w'", 'rearrange_kwargs': 'None', 'reverse_rearrange': 'None', 'skip': 'False'}} reshape.Squeeze {'Squeeze': {'axis': '0'}}, Deviation Initialisation Deviation Normalisation deviation PosixPath('cnn_training/std.npy') expand False mean PosixPath('cnn_training/mean.npy'), Cache Initialisation An `pyearthtools.pipeline` implementation of the `CachingIndex` from `pyearthtools.data`. cache '/var/home/riomaxim/Synced/work/en_cours/PyEarthTools/notebooks/tutorial/cnn_training/cache' cache_validity 'warn' pattern None pattern_kwargs {'extension': "'npy'"} save_kwargs None), 'exceptions_to_ignore': None, 'iterator': None, 'sampler': None}
- train_split
  {'DateRandomise': {'iterator': {'DateRange': {'allowlist': None, 'blocklist': None, 'end': '2015-01-12T00', 'interval': '6h', 'start': '2013-01-01T00'}}, 'seed': 42}}
  DateRandomise : {'iterator': {'DateRange': {'allowlist': None, 'blocklist': None, 'end': '2015-01-12T00', 'interval': '6h', 'start': '2013-01-01T00'}}, 'seed': 42}
- valid_split
  {'DateRange': {'allowlist': None, 'blocklist': None, 'end': '2016-01-12T00', 'interval': '6h', 'start': '2016-01-01T00'}}
  DateRange : {'allowlist': None, 'blocklist': None, 'end': '2016-01-12T00', 'interval': '6h', 'start': '2016-01-01T00'}

[22]:

%%time
# Initialise the trainer with the specified parameters
trainer = pyearthtools.training.lightning.Train(
    model,                         # The model to be trained
    data_module,                   # The data module for training
    workdir,                       # Directory to save logs and checkpoints
    max_epochs=max_epochs,         # Maximum number of training epochs
    callbacks=[RichProgressBar(refresh_rate=50)]  # Callbacks for training (e.g., progress bar)
)

# Fit the model
trainer.fit(load=False)

💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA GeForce RTX 4060 Laptop GPU') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

┏━━━┳━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃   ┃ Name ┃ Type       ┃ Params ┃ Mode  ┃
┡━━━╇━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ 0 │ cnn  │ Sequential │ 42.8 K │ train │
└───┴──────┴────────────┴────────┴───────┘

Trainable params: 42.8 K
Non-trainable params: 0
Total params: 42.8 K
Total estimated model params size (MB): 0
Modules in train mode: 8
Modules in eval mode: 0

`Trainer.fit` stopped: `max_epochs=10` reached.

Calculated indexes
Calculated indexes

CPU times: user 2min 14s, sys: 15.7 s, total: 2min 30s
Wall time: 2min 40s

End-to-end CNN Training Example

Contents

End-to-end CNN Training Example#

Import packages and set parameters#

Data Preparation Pipeline#

Explanation of Pipeline Steps#

Pipeline

Graph

Using the pipeline#

Complete Pipeline#

NumPy Conversion#

Pipeline

Graph

Train/test split#

DateRandomise

DateRange

Data normalisation#

Pipeline

Graph

Model fitting#

Integration With PyTorch Lightning#

The Lightning Data Module#

PipelineLightningDataModule

Predictions#

ReversedPipeline

Graph

Evaluation#