As a data analyst mostly working with high level languages and GUI tools, I thought of GNU Make as a tool (from 1976!) for building Linux binaries from compiled languages, and it didn’t occur to me how it could be useful in my workflow. Make is usually part of any curriculum for software engineers learning C/C++, but I seldom if ever see it mentioned in data science courses or tutorials.
Today, I use Make for all of my software and data science projects, and I
consider it the most important tool for reproducibility. It ships with
Linux and Mac, and there is a Windows installer
available. I love that it’s language-agnostic, so it’s perfect for teams
where different people use R, Python, or other languages for their analysis code.
It also gives you a “single point of contact” for all the scripts and commands
in your project, so you can just type
make [task name] instead of having to remember and re-type each command.
The GNU Make Manual
is quite approachable while also providing great detail, but in a nutshell,
you create a file called
Makefile in your project directory and add to it
a set of “rules” for producing output files from input files, in this form:
targets : prerequisites recipe
A “target” and a “prerequisite” are usually filenames, but can also be just names for tasks (called “phony” targets).
So why is this so useful for data science projects?
Data analysis code typically involves five steps for processing data:
- Getting it
- Cleaning it
- Exploring it
- Modeling it
- Interpreting it
Without a disciplined approach, a typical R or Python script or notebook file will do all of these steps together, perhaps interleaved, making it difficult to pull the steps apart so that they can be executed separately.
Why is it important to separate them? Some steps may process large quantities of data, so you don’t want to re-run everything each time you change your script. Caching intermediate results saves you a lot of time as you’re developing.
Also, if you have mutli-language scenarios, say you want to visualize your data with ggplot2 before training a predictive model with Keras, it’s easiest to separate these steps into different scripts in different languages, rather than doing something complicated like spawning a Python subprocess from within R or vice versa.
Here’s an example of what a Makefile might look like, for generating a paper written in LaTeX, containing two plots generated by an R script and a Python script, from a CSV file created by a SQL database query:
default : paper.pdf paper.pdf : paper.tex figure-01.png figure-02.png pdflatex paper.tex figure-01.png : plot.R data.csv Rscript plot.R figure-02.png : plot.py data.csv python plot.py data.csv: query.sql sql2csv --db "sqlite:///database.db" query.sql
Given this Makefile, when you type
make, it will execute each of these
recipes in dependency order, and then if you type
make again, it will
do nothing, because the dependencies haven’t changed. Make looks at the
file modified timestamps to see which steps actually need to be redone.
If you change the
query.sql file and type
make again, it will rerun
everything because that’s the first item in the dependency chain, but if you
only change the
paper.tex file, then only the
paper.pdf will be rerun the next time you type
nothing upstream of it changed. This is an incredible time-saver when some of
your recipes take a long time to run, as is often the case with data science projects.
I’ve seen many other tools designed to solve this problem, most of them
language-specific and much more complex to learn and use than Make.
As far as how to structure a new project, there isn’t really an accepted standard data science code project structure or framework. The closest thing might be Cookiecutter Data Science (which also utilizes a Makefile), though it’s Python-specific and I haven’t seen it used “in the wild” yet. If you use a Makefile in your project, though, it doesn’t really matter that much how you structure your directories or even what language you use, as thinking in terms of Make recipes will encourage you to break up your analysis scripts into a pipeline of small, dependent, repeatable steps with cached results. Then when a colleague picks up your project to review your work or enhance it, just looking at the Makefile will give them a clear idea of how it’s put together. So use Make for your data science projects and encourage your coworkers to do the same–you will thank each other!