The Two Towers¶
This is part two of a three part post detailing my RSE work on the MetaWards project. In part one I talked about trust in software and data, and why my first step was to port the original code from C to Python.
In this part I am going to discuss how I tackled the two major challenges of research software; how to make it fast, while also keeping it correct.
Climbing Orthanc¶
The first tower to climb is that of robustness, reproducibility and provenance. The most important requirement of any piece of research software is that it is correct. But what does correct mean? In my opinion, correct codes are ones which;
are robust - give the correct output for a given input,
are reproducible - give the same outputs for the same inputs,
report provenance - provide enough information in the output to enable someone else to rerun and reproduce it.
Against some, I have not yet been tested¶
Level one is robustness, which is tackled by testing. While you cannot
test everything, it is
really important to set up an infrastructure and workflow that makes
it easy to add tests as a project develops. To this end, the first thing
I did was set up pytest, and made
sure that this was run regularly. I asked Lester in my team to set up
continuous integration using
GitHub actions. He also
kindly added a status badge to the
main website so that it was very clear
when something was broken. I also made sure that we adopted robust
development practices of using devel
as the development branch,
master
as the release branch, and using feature branches for day-to-day
development. As this way of working would be new for some developers,
we also wrote a
detailed developer guide
that showed how to use feature branches, how to write tests, how to
use virtual environments, how to document code, together with a general
coding style guide that would make it easier for others to learn MetaWards.
Some things that should not have been forgotten were lost¶
Level two is reproducibility, which is tackled using inputs. First, you have to look through the code and find all of the inputs. Once you’ve located them all, you then run the code with those inputs, and verify that you always get the same output.
Hunting for inputs is not as easy as you might think. Yes, there are obvious inputs, like input files and command line arguments. But there also hidden inputs that lurk, like goblins, in the shadows of the code. These goblins include random number seeds, time of day, number of parallel threads, compile-time constants, and the size of the variables (floats or doubles). MetaWards had them all!
This became clear when I first tried to run MetaWards. It’s command line looks something like this;
./wards.o 15324 Testing/ncovparams.csv 0 1.0
A check in the code revealed that the first argument (15324) was a random number
seed, the second was the name of the input file, the third was the line
in the input file to read, and the last set a variable called UV
.
I ran MetaWards several times using these arguments, and each time, it came back with a different answers. Something was definitely up…
Full of darkness and danger¶
A look in the code revealed that this was because the random number seed
was being adjusted using the current time.
On checking with Leon it turned out he was doing this because he was running
many jobs and was not confident that the bash $RANDOM
variable that he
was using in his job script would have given him good random number seeds.
Because MetaWards is a stochastic program, this method of seeding meant that it was impossible to reproduce any MetaWards output without knowing exactly what time the program was run. I immediately changed this.
Now, every time I ran MetaWards I got the same output. This reproducible output was then used as the integration test for my Python code.
To stop me chasing my tail, I double-checked that both C and Python gave the same sequence of random numbers. I changed both the C and Python codes to print the first five generated random numbers to the screen. This way, I would see immediately if there were any differences (there were not) and if anything changed.
His gaze pierces cloud, shadow, earth and flesh¶
A deeper look at the C code showed that there a significant number of inputs which weren’t specified on the command line. These included compile time flags to set the disease being modelled or to switch on or off control measures, such as self isolation or whether the model treated weekends differently.
There were also definitions of disease parameters specified in code, as well as lots of hard coded parameters.
The program also relied on reading data from extra files that were bundled
with the repository, and it looked for those files in a
hard coded directory
(e.g. $HOME/GitHub/MetaWards/2011Data
).
This reveals something that I find generally true when I read other people’s
code - how it is written gives me a window into how the developer works and how
they think. In this case, the code is written to work well on the developers
own computer only. They make changes to inputs in the source code,
compile a new MetaWards binary,
and then run that to get an output. There was a little bit of tracking of
versions by saving main
functions into
files with different names,
but it is clear that the code was being changed, recompiled and re-run, with
inputs, executables and outputs likely becoming jumbled together.
Maintaining reproducibility with such a workflow may be challenging.
It is hardly possible to separate you¶
To encourage adoption of a reproducible workflow my next step was to separate all data inputs into a different GitHub repository. I divided the inputs into three classes, depending on whether they were used to describe a disease, a model or simulation parameters.
I used JSON as the format for any new files I created, as this was both very easy for a human to read and edit, and for a computer to load. I then changed my Python code to use these inputs, and verified that my integration test still passed. I then removed the hard-coded paths, and instead supplied either a clear command line option or an environment variable to enable the user to specify the location of their MetaWardsData clone. Once this worked, I discussed this with Leon and we agreed to make similar changes to the C code (which Chris in my team quickly did).
We want a map, not a Gollum¶
With the second level of reproducibility climbed, the last level was to embed provenance. To understand why provenance is important, it is useful to consider the hobbits’ journey and the role of Gollum.
Gollum is an interesting character because the hobbits needed him to travel with them, despite the danger he poses, and the huge personal cost he suffers.
Gollum was needed as he had been to Mordor before, and so could use his memory of that journey to act as a guide. If he’d kept a detailed map, then the hobbits could have simply used that to recreate Gollum’s steps, and much misadventure and perilous encouters with spiders would have been avoided.
In much the same way, the path to reproducibility is rocky if we have to enlist the original researcher to manually guide us through the process of recreating their outputs. It costs the researcher time that they likely don’t have, plus risks adventure if their memory is imperfect, or the terrain (operating system, libraries or software) has changed since their last visit.
The solution is that the software itself must create a map of how it was used. This map must be automatically generated, and contain sufficient detail to guide others in the future.
Original or Extended Edition?¶
Embedding provenance of every input and piece of software used in a calculation is a way to start the automatic generation of such a map. To this end, Lester in my team used versioneer to automatically add version tags to MetaWards Python. We also had them added to the files in the MetaWardsData directory. We made sure that these versions, together with all inputs and random number seeds, were written to the outputs produced by MetaWards, for example:
***********************
metawards version 0.6.0
***********************
-- Source information --
repository: https://github.com/metawards/MetaWards
branch: devel
revision: fc15e6280c288f6e445780bf04f2feb67de384e2
last modified: 2020-04-09T21:40:51+0100
is printed at the top of every run. Similarly these versions (and values) for parameter and disease data were also printed, e.g.:
Using disease
Disease ncov
loaded from /Users/chris/GitHub/MetaWardsData/diseases/ncov.json
repository: https://github.com/metawards/MetaWardsData
repository_branch: master
repository_version: 0.2.0
beta = [0.0, 0.0, 0.95, 0.95, 0.0]
progress = [1.0, 0.1923, 0.909091, 0.909091, 0.0]
too_ill_to_move = [0.0, 0.0, 0.0, 0.0, 0.0]
contrib_foi = [1.0, 1.0, 1.0, 1.0, 0.0]
This meant that someone reading the outputs would know exactly which versions
of the code and data we used. We even added in a check to see if the
code had not been committed to Git. This let us alert the user that
the code was dirty
, and that
any outputs may not be reproducible.
Climbing Barad-dûr¶
Now that the Python and C codes were reproducible and in-agreement, the next tower to climb was to fix the mûmakil in the room, namely that the Python code was horribly slow.
Level one was that I needed something magical that would quickly speed
up the Python. The bulk
of the code was iterating through the Node
data structures that represented
an electoral ward, and the Link
structs that connected them. I briefly
considered using an existing graph library, but put that on hold as it would
have taken too long (weeks+) to completely refactor. What I thought would be
a quick win would be refactoring the Python list of Node
and list of
Link
objects into a Nodes
and Links
structs of lists (e.g.
an array of structs into a struct of arrays
transformation). This would allow me to switch to iterating over numpy arrays.
I quickly made the change using test_integration.py as a guardrail. Feeling duly proud of myself I ran the test and the code was four times slower! I switched the numpy arrays for standard Python arrays and now the code was just two times slower. How could Python lists be quicker?
I checked the memory usage and Python lists were quicker, but consumed
1.5GB compared to the ~95MB using numpy or arrays. I couldn’t stick with
lists because the memory consumption would have been awful, so I double-checked
what was going on with numpy and arrays. Another profile showed that arrays
were two times slower at indexing for reading and writing compared to lists,
while numpy was two times slower for reading, and four times slower for
writing. A bit of searching on the internet revealed the obvious reason -
arrays and numpy are not designed for random access. They have to do lot of
boilerplate to verify the index is correct, then convert the raw float
or
int
into a Python float
or int
object.
I need eagles¶
At this point I realised that I needed something to come to the rescue. In short, I needed to find a way to compile the code that performed random array access so that I could avoid the costs of moving between raw numbers and Python objects. I tried numba, but it didn’t work. The code was still painfully slow. I needed something more powerful. I needed cython.
Now is the hour! Riders of Rohan!¶
Cython is a fantastic mix of C and Python that I have now grown to love
(it is my current hammer).
You write .pyx
files that merge Python and C. These are compiled to
C, from where they are then compiled into machine code. The basic idea
is that you use cdef
statements to type your objects. For example,
this Python code…
a = 0.5
b = 10.2
total = 0.0
for i in range(0, 10);
total += a * b
becomes this Cython code
cdef int i = 0
cdef double a = 0.5
cdef double b = 10.2
cdef double total = 0.0
for i in range(0, 10):
total += a * b
(I know that this is a pretty silly example - take a look at the excellent cython documentation and basic tutorial for some more real-world examples)
There is a great template package
that makes it easy to write a Python setup.py
for a Cython project,
which I used and have adapted. After some quick tests that showed that
Cython would really work, I then set about the process of Cythonizing
my code. It was a lot of fun :-)
I know what hunts you¶
One of the reasons I enjoyed cythonizing was because I really enjoy the chase of optimising code. Software is slow for unexpected and oftentimes invisible reasons. You have to use profiling to actually find where a code is slow, and then use your good understanding of how computers work to fix the bottleneck and make it quicker. It is the ultimate experiment-hypothesis-implement-test design cycle, but repeated dozens of times as you move through each bottleneck. It is the best crossword puzzle mixed with the best scientific experiment, with enough dark art and magic to make it a creative as well as an intellectual endeavour.
Eyes always watching¶
I like profilers and use them a lot. But they are blunt tools which disrupt the workflow of running a program. I much prefer in-code instrumenting, whereby a simple profiler is embedded in code and switched on or off with a command line argument.
Simple profilers are very easy to write. They are just calls to a timing function either side of the code you want to profile. The difference between your two timing calls is the time it took to run your code. Timers on computers now have nanosecond resolutions, and obtaining the time is essentially free, meaning that profiling adds no overhead to your code runtime. This means that you can leave profiling switched on, and thus always be able to know how fast your code is running, regardless of which machine it is running on. Indeed, (eventually!) I will write some performance regression tests to add to the suite that will use the output of profiling to make sure that I don’t inadvertently slow down the code.
To make embedded profiling easier to use, I created a very simple Profiler class that recursively measured times of instrumented blocks, e.g.
p = Profiler()
p = p.start("loop")
for i in range(0, 10):
p = p.start(f"iteration_{i}")
# do stuff
p = p.stop()
p = p.stop()
print(p)
This gave fantastically useful output that guided my cythonizing, e.g.:
157 406
S: 55833958 E: 215 I: 126 R: 247778 IW: 11 TOTAL POPULATION 56081862
Total time: 121.805 ms (121.805 ms)
\-timing for iteration 157: 121.805 ms (121.723 ms)
\-additional_seeds: 0.008 ms
\-iterate: 86.709 ms (86.687 ms)
\-iterate: 86.687 ms (86.586 ms)
\-setup: 0.035 ms
\-loop_over_classes: 11.721 ms (11.310 ms)
\-work_0: 2.565 ms
\-play_0: 0.044 ms
\-work_1: 2.573 ms
\-play_1: 0.665 ms
\-work_2: 2.426 ms
\-play_2: 0.226 ms
\-work_3: 2.562 ms
\-play_3: 0.249 ms
\-recovery: 19.200 ms
\-fixed: 21.143 ms
\-play: 34.487 ms
\-extract_data: 17.452 ms (17.434 ms)
\-extract_data: 17.434 ms (17.354 ms)
\-loop_over_classes: 17.291 ms
\-write_to_files: 0.063 ms
\-extract_data_for_graphics: 17.554 ms (17.542 ms)
\-extract_data_for_graphics: 17.542 ms (17.515 ms)
\-loop_over_n_inf_classes: 17.515 ms
All we have to decide is what to do with the time that is given us¶
Now I knew where the code was slow, level two of the tower was to
decode on how I could make it faster. Optimising
Python code using cython is a whole blog post in itself as
there are a lot of tips and tricks that I learned. The main advice
I’d give is to disable all of the indexing and other cleverness
that cython uses to make C behave like Python. Do this by adding
the following to the top of your .pyx
files;:
#cython: boundscheck=False
#cython: cdivision=True
#cython: initializedcheck=False
#cython: cdivision_warnings=False
#cython: wraparound=False
#cython: binding=False
#cython: initializedcheck=False
#cython: nonecheck=False
#cython: overflowcheck=False
Note
Update: I now add these checks to my setup.py script
While clever, these cython checks add in extra functions around your code that prevent vectorisation, increase indirect memory lookup, and just add unnecessary bloat if you already trust that your Python code is looping through arrays correctly.
Another piece of advice is to make sure you cdef
everything that is
in a loop. Even forgetting a few variables
has a dramatic impact on performance.
The other advice I have is to not use memory views or other abstractions, but to instead just use raw C pointers for contiguous arrays. When I originally used memory views like this;:
cdef double [:] a = python_array
the code was really slow. This was because every index lookup involved
cython trying to work out how to convert index i
into the index
in python_array
as cython didn’t know that the array was contiguous.
Changing this to;:
cdef double [::1] a = python_array
gave a big speed up, as now cython knew that this was a one-dimensional contiguous array. However, when looking at the C code generated, it was clear that cython was still adding in weird pointer redirections and type conversions that were confusing the compiler. In the end, I switched to;:
cdef double * get_double_array_ptr(double_array):
"""Return the raw C pointer to the passed double array which was
created using create_double_array
"""
cdef double [::1] a = double_array
return &(a[0])
cdef double *a = get_double_array_ptr(python_array)
Cython now saw the array as just a double pointer, and so the resulting C code was exactly if I had written it manually.
Where many paths and errands meet. And whither then? I cannot say¶
With the code optimised, and now faster than the C, the third and final level was to parallelise the code. MetaWards is a single-threaded C code, but inspection of the loops within them suggested that they should be parallelisable.
The great thing about cython is that is trivial to use OpenMP to parallelise code. Again, how to do this and how to optimise parallel code in cython is worth another blog post (or, indeed workshop), but in a few days a serial code that took minutes was now a code that scaled over 32-64 cores and took seconds. Indeed, OpenMP was so good that loops that took 10s-100s of milliseconds were sped up to only take 3-5ms. It is unlikely, given parallel overhead, that I could make them much faster.
Key pointers for performance are to drop the
gil (use nogil
),
double-check you’ve dropped the
gil, and then take a look at the C code to triple-check that there isn’t
anything that you or cython have implicitly added that causes
something to be generated in the C code that takes the #!!#!! gil!
Note
Update: Since writing I’ve learned that it is easier to put performance
code into a with nogil:
section. This causes a easily-readable
compiler error if I accidentally take the gil, and then add a
with gil:
when I really want the gil.
Also, any variables you assign to are automatically lastprivate
,
which means that this doesn’t do what you expect;
cdef int i
cdef broken = 0
with nogil, parallel():
for i in prange(0, 10):
if i > 10:
broken = 1
break
if broken:
print("This will be printed as 'broken' has become undefined")
This means that if you want to save a value, then you have to put it into
an array. Also, increment operators, like val += something
, are automatically
converted into OpenMP reductions, which is wonderful except they are not
supported when you use parallel()
. Hence I had lots of code to convert
back into val = val + something
.
Welcome, my lords, to Isengard¶
So two towers climbed, surely I have now finished? Unfortunatley, not yet.
Remember I said that there were hidden inputs to programs, like floating point precision and number of threads. Well, the journey to optimise the Python code exposed these inputs are sources of reproducibility error.
The first error was when I was optimising the code. Before cythonising, the Python version was so slow that my integration test only compared outputs up to the 20th day of the simulated outbreak. After half-cythonising, I was able to compare a full outbreak and saw that the results were really different.
Line-by-line comparison showed that a few extra simulated agents were infected on day 28 in the C code who were not infected in the Python code. This propogated over the outbreak to give a very different result.
To understand why this was the case I added an if statement to both the C and Python codes to print the values of all variables on day 27 and 28. From this it was clear that one value was very slightly different. This made me realise that I had a precision error. I had mistakenly changed
float(something) / float(something_else)
into
<float>(something) / <float>(something_else)
when cythonizing. This was an error as float
in Python is
a double
, but float
in cython is a float
.
I quickly changed all <float>
statements into <double>
, amended the test
to run more days, and then everything agreed.
March of the Ents¶
But this was not all… Once the code was parallelised I found another
reproducibility problem. I had made the random number generator thread-safe
by giving each thread its own generator, which was deterministically seeded
based on the main random number generator seeded by the user. I had also
been very careful to use techniques that ensured a deterministic order of
operation of calculations, plus order of assignment of work to threads
(e.g. using schedule="static"
).
This meant that, in theory, MetaWards would give the same answer for the
same input, when using the same random number seed,
and running on same number of threads.
However, when I ported MetaWards onto Catalyst, one of the University’s large ARM64 supercomputers, I was able to run large numbers of parallel jobs and quickly saw that a small percentage of jobs were giving different outputs.
Again, I saw that the changes occured at a specific day of the simulated outbreak. Suspecting this was a parallelisation problem I systematically disabled parallelisation for different parts of the code by changing
with nogil, parallel(num_threads=num_threads):
to
with nogil, parallel(num_threads=1):
This showed that the difference was occuring on a specific day within one of the parallel loops.
The problem was that the add_to_buffer
function I was using to manage
the large number of reductions in the loop was adding the buffer contents
when each thread had filled it up. As different threads
filled the buffer at different times, this meant that the order of addition
was different for different runs. This was not a problem for most reductions,
as MetaWards is predominantly an integer code. But the variables that
hold the force of infection (foi) are doubles, and so the order of these
reductions did matter. While for >95% of runs the order was the same, there
were a few runs where the difference caused small numerical differences
that caused one or two more agents to become infected. These small differences
then propogated through the model outbreak to create different results.
For now, I’ve had to leave this section of the code in serial, and so sacrifice a little speed for reproducibility. I’ve worked out how to do the reduction safely and reproducibly, and plan to fix this problem later this week.
Note
This is now fixed and the function runs in parallel.
What’s next?¶
Well done for getting this far. I said at the beginning that this blog was going to be a long read. So now we have walked from a C code to a Python code, and then via cython we have walked MetaWards back to being a very C-like code. But it has now been transformed into something that is much faster, more modular, more reproducible, more robust and more usable. In the next (and final) part, I will show why I have worked so hard to translate MetaWards to Python. It is the King of Languages, and its power is now enabling me to turn MetaWards into an adaptable Minas Tirith that will cope with anything that the next few weeks could bring.