Dec 30, 2019

Make Is All You Need

As a data analyst mostly working with high level languages and GUI tools, I thought of GNU Make as a tool (from 1976!) for building Linux binaries from compiled languages, and it didn’t occur to me how it could be useful in my workflow. Make is usually part of any curriculum for software engineers learning C/C++, but I seldom if ever see it mentioned in data science courses or tutorials.

Today, I use Make for all of my software and data science projects, and I consider it the most important tool for reproducibility. It ships with Linux and Mac, and there is a Windows installer available. I love that it’s language-agnostic, so it’s perfect for teams where different people use R, Python, or other languages for their analysis code. It also gives you a “single point of contact” for all the scripts and commands in your project, so you can just type make or make [task name] instead of having to remember and re-type each command.

The GNU Make Manual is quite approachable while also providing great detail, but in a nutshell, you create a file called Makefile in your project directory and add to it a set of “rules” for producing output files from input files, in this form:

targets : prerequisites
    recipe

A “target” and a “prerequisite” are usually filenames, but can also be just names for tasks (called “phony” targets).

So why is this so useful for data science projects?

Data analysis code typically involves five steps for processing data:

Getting it
Cleaning it
Exploring it
Modeling it
Interpreting it

Without a disciplined approach, a typical R or Python script or notebook file will do all of these steps together, perhaps interleaved, making it difficult to pull the steps apart so that they can be executed separately.

Why is it important to separate them? Some steps may process large quantities of data, so you don’t want to re-run everything each time you change your script. Caching intermediate results saves you a lot of time as you’re developing.

Also, if you have mutli-language scenarios, say you want to visualize your data with ggplot2 before training a predictive model with Keras, it’s easiest to separate these steps into different scripts in different languages, rather than doing something complicated like spawning a Python subprocess from within R or vice versa.

Here’s an example of what a Makefile might look like, for generating a paper written in LaTeX, containing two plots generated by an R script and a Python script, from a CSV file created by a SQL database query:

default : paper.pdf

paper.pdf : paper.tex figure-01.png figure-02.png
    pdflatex paper.tex

figure-01.png : plot.R data.csv
    Rscript plot.R

figure-02.png : plot.py data.csv
    python plot.py

data.csv: query.sql
    sql2csv --db "sqlite:///database.db" query.sql

Given this Makefile, when you type make, it will execute each of these recipes in dependency order, and then if you type make again, it will do nothing, because the dependencies haven’t changed. Make looks at the file modified timestamps to see which steps actually need to be redone. If you change the query.sql file and type make again, it will rerun everything because that’s the first item in the dependency chain, but if you only change the paper.tex file, then only the recipe for paper.pdf will be rerun the next time you type make, because nothing upstream of it changed. This is an incredible time-saver when some of your recipes take a long time to run, as is often the case with data science projects. I’ve seen many other tools designed to solve this problem, most of them language-specific and much more complex to learn and use than Make.

As far as how to structure a new project, there isn’t really an accepted standard data science code project structure or framework. The closest thing might be Cookiecutter Data Science (which also utilizes a Makefile), though it’s Python-specific and I haven’t seen it used “in the wild” yet. If you use a Makefile in your project, though, it doesn’t really matter that much how you structure your directories or even what language you use, as thinking in terms of Make recipes will encourage you to break up your analysis scripts into a pipeline of small, dependent, repeatable steps with cached results. Then when a colleague picks up your project to review your work or enhance it, just looking at the Makefile will give them a clear idea of how it’s put together. So use Make for your data science projects and encourage your coworkers to do the same–you will thank each other!