HadISD Tutorial Two - Downloading The HadISD Dataset

HadISD Tutorial Two - Downloading The HadISD Dataset#

NOTE Before beginning this tutorial, you should first read HadISD Tutorial One - Introduction to Station Data and ensure you complete the tutorials in order

Initial Download#

Note - if you are at NCI, you may be able to use an already-downloaded version if you are in project kd24.

If you are working at another facility, it is highly recommended to start with just the test data, and run the entire sequence of tutorials from that data to get the hang of working with the data. This is just because of the download volume and processing time required.

That said, this data set is entirely reasonable to work with on many laptops, workstations or general computing environments, it just requires a little patience to get up and running smoothly.

[4]:

# A spot to put the data on disk. We keep both the data as-downloaded and the reprocessed version, so you might need up to 50GB free in order to make this work.

import requests
from pathlib import Path
from tqdm.auto import tqdm

DOWNLOAD_DIR = Path('/g/data/kd24/data/') / 'hadisd' / 'as_downloaded'  # We will download data here and keep a copy
DOWNLOAD_DIR.mkdir(exist_ok=True)

# For testing, we download just under 4GB data
testing_download = [
    "000000-029999", "500000-549999", "722000-722999", "800000-849999",
]

# Download list for all files - these approximately map to station IDs
full_download = [
    "000000-029999", "030000-049999", "050000-079999", "080000-099999",
    "100000-149999", "150000-199999", "200000-249999", "250000-299999",
    "300000-349999", "350000-399999", "400000-449999", "450000-499999",
    "500000-549999", "550000-599999", "600000-649999", "650000-699999",
    "700000-709999", "710000-714999", "715000-719999", "720000-721999",
    "722000-722999", "723000-723999", "724000-724999", "725000-725999",
    "726000-726999", "727000-729999", "730000-799999", "800000-849999",
    "850000-899999", "900000-949999", "950000-999999",
]

[5]:

def download_wmo_range(wmo_id_range, download_dir):
    wmo_str = f"WMO_{wmo_id_range}"
    url = f"https://www.metoffice.gov.uk/hadobs/hadisd/v343_2025f/data/{wmo_str}.tar.gz"
    tar_name = f"{wmo_str}.tar.gz"
    filename = download_dir / tar_name

    head = requests.head(url, allow_redirects=True)
    remote_size = int(head.headers.get('content-length', 0))
    local_size = filename.stat().st_size if filename.exists() else 0

    if filename.exists() and local_size == remote_size:
        print(f"File already fully downloaded: {filename} ({local_size/1024**2:.2f} MB)")
        return filename, tar_name

    if filename.exists() and local_size != remote_size:
        # Users may have done this deliberately, so just print a message
        print(f"Local filesize of {filename} does not match. Attempting to resume. You may need delete it and re-download it")

    headers = {}
    mode = 'wb'
    initial_pos = 0
    if filename.exists() and local_size < remote_size:
        headers['Range'] = f'bytes={local_size}-'
        mode = 'ab'
        initial_pos = local_size
        print(f"Resuming download for {filename.name} at {local_size/1024**2:.2f} MB...")
    else:
        print(f"Starting download for {filename.name}...")

    response = requests.get(url, stream=True, headers=headers)
    total = remote_size
    with open(filename, mode) as f, tqdm(
        desc=f"Downloading {filename.name}",
        total=total,
        initial=initial_pos,
        unit='B', unit_scale=True, unit_divisor=1024
    ) as bar:
        for chunk in response.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)
                bar.update(len(chunk))

    final_size = filename.stat().st_size
    if final_size == remote_size:
        print(f"Download complete: {filename} ({final_size/1024**2:.2f} MB)")
    else:
        print(f"Warning: Download incomplete. Local size: {final_size}, Remote size: {remote_size}")

    return filename, tar_name

[ ]:

# for wrange in testing_download:
#     download_wmo_range(wrange, DOWNLOAD_DIR)

# FOR FULL STATION DOWNLOAD
# Note, if at NCI doing the hackathon, use the pre-downloaded data

# Note - need to make the DOWNLOAD_DIR directory

for wrange in full_download:
    try:
        download_wmo_range(wrange, DOWNLOAD_DIR)
    except:
        # This is a fault-tolerant approach which will print error messages but continue
        # to try to fetch the remaining files
        import traceback
        traceback.print_exc()

Unpacking the Data#

The next step is easiest to do manually, and is a bit awkward to put in a notebook step. We eventually want a directory structure something like

/<home>/hadisd
   - as_downloaded
   - unpacked
   - processed
   - by_decade

To set this up, chose a top-level ‘home’ directory for your data downloads. On individual workstations for a single user, the default home directory (e.g. Path.home() ) may be suitable. On shared environments (such as HPC environments) a full path is recommended.

Then in your terminal, go to your top-level download directory. Make a new directory called unpacked, then run the following command to unpack the data into the target directory:

for file in *.tar.gz; do tar -xzf "$file" --directory ../unpacked; done

This will result in a lot of individual .nc.gz files being created on disk in the target directory. Once this is done, change directory into the unpacked directory and unzip the files by running the command

gunzip *

After reprocessing – only once HadISD tutorial three has been run – it is okay to delete these interim files.

Running these commands and the command-line (terminal) is much faster for some reason than trying to use Python to get the job done.

[ ]:

HadISD Tutorial Two - Downloading The HadISD Dataset

Contents

HadISD Tutorial Two - Downloading The HadISD Dataset#

Initial Download#

Unpacking the Data#