CodeBreak 08/12/2021

Trajectory Filtering

With a source dataset of trajectories identifying the numbered region a particle is in at each point of time, identify which particles are moving consistently from unique region to region and which are circling back and forth between a small number of regions.

The most efficient way will be to use array operations, e.g. find transitions between regions with diff = region[1:] - region[:-1]. Count of total regions entered will then be diff.where(diff != 0).count(). Explored in more detail on the CMS blog

Fortran code strange behaviour

Some inherited Fortran Code seemed to modify the content of an array depending on what index one starts to output of the array. We were not able to resolve this. The code is of questionable coding practices with a lot of implicit typing, questionable scoping, and complete lack of indentation. More in-depth analysis is needed that cannot be completed within the CodeBreak time.

Old Python code crash because of package not updated

An old Python code was unable to locate a required module, `global_land_mask` when run in the environment `conda/analysis3` and `conda/analysis3-unstable`. However it worked in `conda/analysis3-21.04`. It turns out that the library had been installed by the user as it is not a conda package and was located in `.local/lib/python3.8`. Since the Python version has been upgraded since analysis3-21.04, the user installed package needs to be reinstalled by the user. 

Using ERA5 data as input for HySplit

HySplit, a particle trajectory code, needs a very specific format for its input meteorological data. On Github, we found this repositorywith code to prepare ERA5 data in GRIB to the HySplit input format. The requestor wanted to use their own Windows computer but this didn’t work well. We’ll follow up with instructions to install and use on Gadi.

Cosima time slicing

When opening a dataset with Cosima cookbook, the dates specified aren't exact - the cookbook will return to the nearest file boundary, so if a file is containing a decade of data and you ask for 1994 to 1996, the cookbook will return a dataset covering 1990 to 2000 since that's what's in the file. You can then get the exact period you need with .sel(time=slice('1994','1996'))

Understanding xarray, Dask and chunking 

Starting from an example script provided by Ram, we reviewed the basics of accessing big datasets. In particular:

  • You can use ncdump -sh <filename> to find out how the original files are chunked  

The -h option shows the file header, adding -s shows the “hidden” attributes, like chunks, compression and more details on the file format

  • Trying to use chunks which are aligned to the way the files are chunked is important
  • Consider how you will be accessing the data in the analysis, if you are averaging along the entire time axis and you have only 1 timestep per chunk (as it is often true for climate data) then you will need to load the entire variable for each grid point.
  • Keeping chunks size around 200MB 
  • Use of a cluster to run in parallel with dask
  • Use vectorise operations as much as you can. I.e. don’t use a loop to execute an operation on the entire variable.
  • Make use of lazy computation by dask. When running in parallel, Dask won’t compute what you write, it only computes the way to do it. So it’s ok to write operations that might be too big for the memory knowing it will only be computed one subset at a time.

To see more on this  

We also discussed useful commands for jobs on the PBS scheduler: uqstat and qcost. Those are included in the nci_scripts module on hh5. See the information on the nci_scripts repository.