Create intake description files
To get rid of multi readings of the metadata of a set of files and to quicky access your dataset of interest (or a subset), the user is encouraged to create a single .json file that describes the metadata of the full set. This .json file will be read to speed up the creation of any future xarray dataset.
The creation of the .json file requires the use of auto-kerchunk pre-processing tool.
First create the python environment dedicated to auto-kerchunk tool:
source "/appli/anaconda/versions/4.8.3/etc/profile.d/conda.csh"
conda create -c conda-forge -n auto-kerchunk python==3.9
conda activate auto-kerchunk
conda install xarray kerchunk ujson h5py zarr fsspec dask rich typer zstandard intake intake-xarray
And install dask-hpcconfig and auto-kerchunk modules to your environment following these instructions (use your own extranet login!):
python -m pip install git+https://username@gitlab.ifremer.fr/iaocea/dask-hpcconfig.git
python -m pip install git+https://username@gitlab.ifremer.fr/iaocea/auto-kerchunk.git
or from IAOCEA projects downloaded locally:
git clone https://gitlab.ifremer.fr/iaocea/auto-kerchunk
git clone https://gitlab.ifremer.fr/iaocea/dask-hpcconfig
cd pathto/dask-hpcconfig
python -m pip install .
cd pathto/auto-kerchunk
python -m pip install .
Create the metadata and intake configuration files related to your own dataset:
#!/bin/bash
#PBS -q mpi_1
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=28:mem=128000mb
#PBS -m e
# Create auto-kerchunck (see https://gitlab.ifremer.fr/iaocea/auto-kerchunk)
#conda create -n auto-kerchunk python==3.9 xarray kerchunk ujson h5py zarr fsspec dask rich typer zstandard intake intake-xarray
#conda activate auto-kerchunk
#python -m pip install . (add dask-hpcconfig and auto-kerchunk packages in this order)
# requires:
# - auto-kerchunk
# - dask-hpcconfig
# This script create intake configuration file to speed up the reading of a set of netCDF files.
# The user needs to adapted lines marked as TOBECHANGED
source /usr/share/Modules/3.2.10/init/bash
module purge
# >>> conda initialize >>>
source "/appli/anaconda/versions/4.8.3/etc/profile.d/conda.sh"
which conda
conda activate auto-kerchunk
which python
# TOBECHANGED
# Here, update the FILES and NAME
FILES="file:///pathto/best_estimate/20??/MARC_F2-MARS3D-MENOR1200_20??????T??00Z.nc"
NAME="marc_f2_1200_sn_hourly"
TMP=$TMPDIR/JSONS
CATALOGNAME=$NAME
RESULT="/pathto/catalog/kerchunk/$NAME.json.zst"
INTAKE="/pathto/catalog/intake/$NAME.yaml"
# TOBECHANGED
# create cluster and wait for the scheduler to have started
python -m dask_hpcconfig create datarmor-local --workers 14 --pidfile scheduler_address --silent &
until [ -f scheduler_address ]; do sleep 1; done
# DO NOT TOUCH BELOW
date
# create a .json metadata file for each file of interest
python -m auto_kerchunk single-hdf5-to-zarr \
--cluster $(cat scheduler_address) \
$FILES \
$TMP
# create .json.zst
python -m auto_kerchunk multi-zarr-to-zarr \
--cluster $(cat scheduler_address) \
--compression zstd \
"file://$TMP/*.json" \
$RESULT
chmod go+w $RESULT
# create .yaml related to the selected dataset
python -m auto_kerchunk create-intake \
--catalog-name $CATALOGNAME \
--name $NAME \
"file://$RESULT" \
"file://$INTAKE"
chmod go+w $INTAKE
date
#cat /home/datawork-lops-iaocea/catalog/intake/*.yaml >/home/datawork-lops-iaocea/catalog/intake.yaml
# shut down the cluster
python -m dask_hpcconfig shutdown $(cat scheduler_address) --silent
# Once this is done, you should be able to do following
#import intake
#cat=intake.open_catalog("file:///home/datawork-lops-iaocea/catalog/intake/marc_f1_2500_agrif_seine_hourly.yaml")
#ds=cat.marc_f1_2500_agrif_sein_hourly.to_dask()
#ds.UZ.mean().compute()
#in the ds you get all the data you listed in the $FILES
# and you can compute mean, for example