Healthy code, healthy patients: coding best practices in medical Data Science (Part 2)

By
Michele Tonutti - Lead Software Engineer
Date:
March 6, 2019
Reading time:
10 min

“Will writing tidy code really help patients when they are rushed into the Intensive Care Unit?”

“Who cares if my code is 100 or 1 billion lines long, if the doctor will only see a probability and a graph?”

“If in order to test my code I need to write more code, do I just keep writing code forever to test the tests?”

These (and many other) questions have probably, in one form or another, popped into the mind of all beginner coders who have started a project in medical data science. The answer to all of the above can be wrapped up in one of my favourite programming-related quotes:

The only useful code is production-ready code.

In a nutshell, production-ready code means code that is bug-free, scalable, well documented, easily readable, reusable, and reproducible. Following this principle will save you endless time, costs, and frustration, and it will ensure that the right results are obtained from the very beginning of a project.

The first part of this article covered version control, IDEs, repository structure, and virtual environments. In this second part I will give some insight on how to write production-ready code in medical data science, using some real-life examples from Pacmed’s own software development process. In particular, I will talk about code design, describing the concepts of abstraction and modularity; I will touch upon the importance of code style and documentation; and I will illustrate how and why we should always write extensive tests.

(Once again my examples will use Python, but the principles apply to any other language!)

Abstract code yields concrete results

During the long Dutch winters, staying DRY does not only refer to needing a raincoat while biking in the rain. In the programming world, it stands for Don’t Repeat Yourself, and it should be…well, repeated like a mantra. The concept of abstraction is a cornerstone for scalable software: each distinct functional operation should be present in just one place in the source code, usually in the form of functions or classes. When similar tasks are carried out by different functions, they should be combined by abstracting them out.

An example of code abstraction applied to Intensive Care data: to calculate the maximum value of each measurement for a patient, we can write just one function the takes two inputs, and call it repeatedly.

In the end, we want our code to look a bit like Lego: beautiful, robust, and modular. Indeed, abstraction makes the code look beautiful by enhancing readability: the functionality of tens, or even hundreds of lines of code can be reduced to just one function call in your application. Abstraction also increases the scalability of our development process, since each individual function only needs to be written and tested once, and can then be reused in any other script, or even other projects. For instance, at Pacmed we have recently reused big portions of the code written for predicting the incidence of Acute Kidney Injury at the VU Medical Center Intensive Care Unit (ICU), in order to build a model that predicts patients’ length of stay in the ICU at the UMC Utrecht. This allowed us to reach a robust version of the data processing pipeline in just a few weeks time, rather than the several months it took the first time around.

Sharing is caring

To achieve efficient code sharing, it is important to have one or more central repositories which are well maintained, clean, and readable. At Pacmed we have our own general code repository, fittingly called PacMagic, which is effectively a custom Python package that contains all the functions needed for any step of a data science pipeline, from preprocessing to modeling, from data analysis to visualization.

Splitting data, scaling it, and plotting it can take as few as four lines of code using a shared code repository.

All of our Data Scientists contributes to it, and we make sure every piece of code in PacMagic is fully tested, documented, and properly structured. This means more time for fun modelling, and less time wasted re-writing the same pre-processing code a bizillion times. Once the data has been processed, we can train a model and analyze its results in less than 10 minutes. It is then possible to build up quickly from a working baseline model, and invest the saved time on researching and implementing more complex techniques, such as Natural Language Processing algorithms for emergency care or Bayesian Neural Networks to process Electronic Health Records in the ICU.

Readability means reliability

As already mentioned above, readability is a necessary condition for code to be shared and reused across projects and even within teams. It is pointless to spend time writing modular and abstract functions if the next person is not going to be able to understand how to use them. To write readable code, it is important to properly document it, comment it, and most of all use good syntax and style.

Good documentation saves lives…

In Python especially, where functions are ubiquitous, docstringsshort for documentation strings– are the main and most efficient approach to documenting code. They are small pieces of text that explain what a function does (not how!), and should include a list, description, and data type of every parameter of the function. The Python community has come up with a few handy conventions for writing docstrings; it’s good to pick one of those and stick to it.

An example of a function with a docstring written in numpy style, which is one of the styles natively supported and recognized by PyCharm. Just by looking at the docstring, one is able to fully understand what the function does and what kind of parameters must be used as an input.

In fact, using one of the supported docstrings conventions has many advantages. Apart from enforcing consistency and therefore efficiency, a good IDE will be able to create them automatically for you, given the function inputs and the parameter types. Other tools, such as Sphinx, will recognise docstrings in your code and will enable you to automatically generate full documentation for every function in your repository, which can then be stored or hosted on a private webpage for easy consultation.

…but good code should explain itself

At a certain point in their professional life, every programmer is taught that good code is well-commented code. While this is certainly true to some extent, the reality is that the best code is self-explanatory code. The name of variables, functions, classes, and even files should describe exactly what each of them does. In the end, one should be able to understand what a piece of code does without the need for explanation. This is true especially for Python, whose syntax has the advantage of being particularly human-readable.

For instance, in-line comments should only be used to explain small pieces of logic and workflow, or blocks of code that would be difficult to follow otherwise. If you want to print('Hello world') , there’s really no need to add # Printing hello world next to it — this would just add clutter and actually reduce the readability of the whole script.

To further ensure consistency and readability, following a set of style rules helps greatly. In Python, the most widely used style convention is PEP 8: it dictates a number of rules for code style, variable and function naming, and general code design.

Static typing

As enthusiastic Pythonistas, at Pacmed we have whole-heartedly welcomed the recent addition of static type checking, meaning the possibility to specify the data type a variable is supposed to hold when it is initialized, yielding more understandable code, quicker computations, and allowing for automatic error checking in more advanced IDEs. While it may seem a bit cumbersome and unfamiliar at first, I recommend to take a look at this tutorial, which will clear things up and introduce you the magical world of not having to worry about TypeErrors.

The beauty of static typing: the function signature now specifies that the input “df” must be a pandas DataFrame, and PyCharm will issue a warning if a Series is used instead.
No test left unturned

Great! Now we know how to write readable, modular, and abstract code. But we still have no idea whether that code is going to output what we intended it to. Knowing that the code you write and use works as expected is one of the most crucial parts of software development ­­– this can, and must be done through careful testing.

Generally, there are different types of tests. Unit tests make sure that individual components (units) of a software work as it should, independently from other code. In most cases, these units are generally single functions. Integration tests, on the other hand, ensure that these units work as expected when put together, for example in a data processing pipeline. In data science projects this is often very important, because often one might only see the result of a long sequence of processing steps, without seeing the intermediate outputs. Python’s native unittest module offers everything you need to implement your own unit and integration tests.

You may have noticed that what we are talking about here is very different from ‘testing’ in the classical data science meaning, which usually refers to obtaining predictions from a model for a set of patients, checking the performance, analyzing the outputs, etc. Software testing refers to making sure that a piece of code, or a whole pipeline, does exactly what it is meant to be doing; even for the best programmers, this is not always guaranteed to happen. Effectively, this means writing extra code to test previously-written code. This may seem extremely annoying, and possibly a waste of time — so why should we do it? Here are some of the countless benefits:

1) Agile software development: code can be changed easily and at any time without breaking old code. When new code is implemented in the codebase, all tests can be run automatically to make sure that the rest of the code has not been affected by the changes. Most version control platforms, such as Gitlab, offer the possibility to implement automatic pipelines to run all tests in a repository every time changes are made.

2) Better code quality and software design: You are always sure the code you are using works as expected. You can safely re-use code written by other people without having to worry about its performance.

3) Learning: Writing tests also makes you think about why and how your code should work, effectively teaching you to become a better coder in the process.

4) Find bugs early: Many issues can be solved before the code is merged or reviewed by others.

5) Reduce time waste and costs: All of the above ultimately means less time wasted in debugging, which means less frustration for you and fewer time and cost investments for your employer.

6) Make your data engineers happy: an often-overlooked benefit, but actually the most important one! Whoever is going to implement your code in a production environment will be eternally grateful.

When developing medical software, there is also a third type of test, arguably the hardest to implement: end-to-end tests. They are meant to check whether the output of your software makes logical sense, given a raw input; for this reason, they require a high degree of domain knowledge. For instance, if a model predicts the probability of a patient having diabetes, a senior patient with high glucose values and a high BMI will be expected to yield a high probability. At Pacmed, data scientists work side-to-side with doctors from the very initial phases of every project, in order to design and perform sensible end-to-end tests and obtain meaningful output from the resulting code.

KISS goodbye to complexity

Ultimately, test everything and test often! In fact, some say it’s even better to write tests before writing the code itself: this forces you to think about what you want your code to do, what you don’t want it to do, and how you want to achieve that.

However, it can be hard to write tests that cover every single corner case; often functions can become so complex that it becomes practically impossible to foresee every single bug. To avoid this, we like to work by the principle that “if your code is hard to test, it’s hard to use”. This can be summarised by the KISS principle: “Keep It Simple, Stupid!”. Simple, modular (see above), and functional code will be better performing, more maintainable, and also more readable.

Conclusion

Data science is, by all means, a lot of fun; creativity and curiosity play a huge role in building successful models and getting great results. When building medical software, structured and organized coding practices are paramount in order to obtain results that will make a real impact on the lives of patients.

Remember: as we talked about in the first part of the article, good writing tools and developing environments will help you follow these guidelines, and allow you to easily produce beautiful, useful, and well-performing code in an efficient way.

Happy coding!