Create intake description files

To get rid of multi readings of the metadata of a set of files and to quicky access your dataset of interest (or a subset), the user is encouraged to create a single .json file that describes the metadata of the full set. This .json file will be read to speed up the creation of any future xarray dataset.

The creation of the .json file requires the use of auto-kerchunk pre-processing tool.

First create the python environment dedicated to auto-kerchunk tool:

source "/appli/anaconda/versions/4.8.3/etc/profile.d/conda.csh"
conda create -c conda-forge -n auto-kerchunk python==3.9
conda activate auto-kerchunk
conda install xarray kerchunk ujson h5py zarr fsspec dask rich typer zstandard intake intake-xarray

And install dask-hpcconfig and auto-kerchunk modules to your environment following these instructions (use your own extranet login!):

python -m pip install git+https://username@gitlab.ifremer.fr/iaocea/dask-hpcconfig.git
python -m pip install git+https://username@gitlab.ifremer.fr/iaocea/auto-kerchunk.git

or from IAOCEA projects downloaded locally:

git clone https://gitlab.ifremer.fr/iaocea/auto-kerchunk
git clone https://gitlab.ifremer.fr/iaocea/dask-hpcconfig

cd pathto/dask-hpcconfig
python -m pip install .
cd pathto/auto-kerchunk
python -m pip install .

Create the metadata and intake configuration files related to your own dataset:

#!/bin/bash
#PBS -q mpi_1
#PBS -l walltime=24:00:00
#PBS -l select=1:ncpus=28:mem=128000mb
#PBS -m e

# Create auto-kerchunck  (see https://gitlab.ifremer.fr/iaocea/auto-kerchunk)
#conda create -n auto-kerchunk python==3.9 xarray kerchunk ujson h5py zarr fsspec dask rich typer zstandard intake intake-xarray
#conda activate auto-kerchunk
#python -m pip install .   (add dask-hpcconfig and auto-kerchunk packages in this order)

# requires:
# - auto-kerchunk
# - dask-hpcconfig

# This script create intake configuration file to speed up the reading of a set of netCDF files.
# The user needs to adapted lines marked as TOBECHANGED

source /usr/share/Modules/3.2.10/init/bash

module purge

# >>> conda initialize >>>
source "/appli/anaconda/versions/4.8.3/etc/profile.d/conda.sh"
which conda
conda activate auto-kerchunk
which python

# TOBECHANGED
# Here, update the FILES and NAME
FILES="file:///pathto/best_estimate/20??/MARC_F2-MARS3D-MENOR1200_20??????T??00Z.nc"
NAME="marc_f2_1200_sn_hourly"

TMP=$TMPDIR/JSONS
CATALOGNAME=$NAME

RESULT="/pathto/catalog/kerchunk/$NAME.json.zst"
INTAKE="/pathto/catalog/intake/$NAME.yaml"

# TOBECHANGED
# create cluster and wait for the scheduler to have started
python -m dask_hpcconfig create datarmor-local --workers 14 --pidfile scheduler_address --silent &
until [ -f scheduler_address ]; do sleep 1; done

# DO NOT TOUCH BELOW

date
# create a .json metadata file for each file of interest
python -m auto_kerchunk single-hdf5-to-zarr \
       --cluster $(cat scheduler_address) \
       $FILES \
       $TMP
# create .json.zst
python -m auto_kerchunk multi-zarr-to-zarr \
       --cluster $(cat scheduler_address) \
       --compression zstd \
       "file://$TMP/*.json" \
       $RESULT
chmod go+w $RESULT
# create .yaml related to the selected dataset
python -m auto_kerchunk   create-intake \
       --catalog-name $CATALOGNAME \
       --name  $NAME \
       "file://$RESULT" \
       "file://$INTAKE"
chmod go+w $INTAKE
date

#cat /home/datawork-lops-iaocea/catalog/intake/*.yaml >/home/datawork-lops-iaocea/catalog/intake.yaml

# shut down the cluster
python -m dask_hpcconfig shutdown $(cat scheduler_address) --silent

# Once this is done, you should be able to do following
#import intake
#cat=intake.open_catalog("file:///home/datawork-lops-iaocea/catalog/intake/marc_f1_2500_agrif_seine_hourly.yaml")
#ds=cat.marc_f1_2500_agrif_sein_hourly.to_dask()
#ds.UZ.mean().compute()
#in the ds  you get  all the data you listed in the $FILES
# and you can compute mean, for example