CodeBreak 16/02/2022

Write efficient code

One student had a general question, they wanted pointers towards more manageable and reusable code. 

They had an NCL script that compared the output of many models and configurations. 

I advised them to make more use of subroutines. I also suggested that they should consider more Python in the future, and gave them a few examples on how to use functions in Python. They are also using mostly ASCII files for their data, so I recommended that they use more NetCDF.

Looping and code organisation in Python

First of all, it is generally a bad idea to use loops in Python. Loops are very inefficient in Python, it is recommended to use vectorised operations, i.e. operations that are designed to work on the whole subset of data at once.

The problem was to calculate various heat averages over different time periods. Each time period was defined with start date, peak date and end date. Each of these were stored as lists of indexes over the time axis of the heat variables. The averages need to be calculated between [start, peak] and [peak, end]. Because there were less than 100 time periods, all variables were for a very small domain and the way to code was started, we actually used a loop over the time periods. It was also explained:

  • How to store the results in a list and then concatenate them in an array
  • How to build a function for the loop so it’s easy to calculate the means for the different heat variables and the 2 different sets of means ([start, peak] and [peak, end]) without copy-pasting too much code.
  • How to build the function so the same function can be used to calculate a mean over time or a mean over time and space (or any subset of dimensions).

You can see here an exampleof what was produced.   

Dynamic height

This was a query about an error when using a function from the python wrapper for Gibbs-SeaWater (GSW) Oceanographic Toolbox. Using Argo data in xarray objects it worked fine, but after interpolating the data on to fixed pressure levels it no longer worked. The issue was that the function geo_strf_dyn_heightrequired a 2D pressure level array, as the pressure levels varied at each location. The interpolated levels were a simple 1D array, so the solution was to manually broadcast the pressure level array to the same dimensions and coordinates as the original pressure level data. This is easily done with xarray.broadcast_likemethod. This could be done inline in the function call itself, so something like  

Original:

gsw.geo_strf_dyn_height(SA, CT, p, p_ref=0, axis=1)  

Fixed:

gsw.geo_strf_dyn_height(SA, CT, p.broadcast_like(ds.PRESSURE), p_ref=0, axis=1)

Memory Consumption and optimisation

A student wrote a workflow to analyse a combination of several CMIP6 models and experiments. Their workflow is running smoothly when applied to monthly data. However, they were concerned that it might become inefficient when applied to daily or higher temporal resolution data. 

As their code is well structured, we reviewed instead how to manage and monitor memory usage on the OOD JupyterLab. How to set up a cluster and the dask-jupyterlab workspace panel is covered in the OOD documentation. The workspace panel offers monitoring tools like the task manager and workers cpus and memory that show how a running process is utilizing (or not) the available resources.

When running a parallel task, the chunking of a file can have a massive impact, so we also discussed chunking both in the file itself and when using dask.

As their use case can be easily parallelised as she is running the same workflow on several simulations which are all stored in separate files, we also covered the use of the GNU parallel library. This allows you to run the same task on several inputs in a parallel way.

Both chunking and `parallel` are covered in Scott Wales’ Parallel programming in Climate and Weather trainingmaterial.

UM Model crash

A common error in running UM models is a segmentation fault in the bi_linear_h routine. This is generally caused by the model becoming unstable at a grid point, with values of a variable increasing dramatically with each timestep. Generally these errors can be avoided by adding a small perturbation to the most recent restart and continuing the run from there. You can also try reducing the timestep for a while.