<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Posts on Alex Kyllo</title>
		<link>https://alexkyllo.com/posts/</link>
		<description>Recent content in Posts on Alex Kyllo</description>
		<generator>Hugo -- gohugo.io</generator>
		<language>en-us</language>
		<copyright>Alex Kyllo 2020</copyright>
		<lastBuildDate>Sun, 02 Jan 2022 22:16:24 -0800</lastBuildDate>
		<atom:link href="https://alexkyllo.com/posts/index.xml" rel="self" type="application/rss+xml" />
		
		<item>
			<title>Questions to ask to keep your data science projects on track</title>
			<link>https://alexkyllo.com/posts/questions-to-ask/</link>
			<pubDate>Sun, 02 Jan 2022 22:16:24 -0800</pubDate>
			
			<guid>https://alexkyllo.com/posts/questions-to-ask/</guid>
			<description>Data science projects fail a lot. It&amp;rsquo;s highly speculative work, so as a data scientist, some crucial career skills are knowing how to assess the viability of a project, when to cut your losses, and why things didn&amp;rsquo;t work out, so that you can learn from it and adjust your approach. I&amp;rsquo;ve put together a list of questions to ask at each stage in a project&amp;rsquo;s lifecycle, to help you avoid wasting your precious time on projects that are not going to make it.</description>
			<content type="html"><![CDATA[<p>Data science projects fail a lot. It&rsquo;s highly speculative work, so as a data
scientist, some crucial career skills are knowing how to assess the viability of
a project, when to cut your losses, and why things didn&rsquo;t work out, so that you
can learn from it and adjust your approach. I&rsquo;ve put together a list of
questions to ask at each stage in a project&rsquo;s lifecycle, to help you avoid
wasting your precious time on projects that are not going to make it.</p>
<p>I&rsquo;m assuming the general sort of projects that I have experience with: small
business data science projects (think one to a few people over one to a few
months), which we can roughly break down into a generic five-phase lifecycle:</p>
<ol>
<li>Scoping</li>
<li>Analysis</li>
<li>Development</li>
<li>Validation</li>
<li>Release</li>
</ol>
<p>This is not a prescriptive procedure but a rubric to help you assess progress
and viability. If you&rsquo;re not sure how to move your project forward or whether
it&rsquo;s even worth investing further time into, the answers to these questions may
help you decide the best next steps.</p>
<h2 id="scoping">Scoping</h2>
<p>Have discussions with the proposed project&rsquo;s customers and write a one-pager
that answers the following questions:</p>
<ul>
<li>What is the business decision or problem motivating the project?</li>
<li>Which type of deliverable (presentation, static report, dashboard etc.) is
most appropriate?</li>
<li>Who will use the deliverables?</li>
<li>How will they use the deliverables to achieve business impact?</li>
<li>Have the necessary data been collected, or can they be collected in time to
analyze them and act upon the analysis?</li>
<li>When the project is done, how and when will we know whether it was successful?</li>
<li>Who will agree to review my work for quality?</li>
<li>Where and when will I present or distribute my work to its intended users?</li>
</ul>
<p>Review this document with a peer or your manager first if possible, and then
with the customer, to align their expectations. Revise it as needed to achieve
agreement, but don&rsquo;t force it if the case isn&rsquo;t compelling or you can&rsquo;t get
clarity.</p>
<p>Many, if not most, data science project ideas should be terminated at this stage
for reasons such as there isn&rsquo;t a clear problem-solution fit, there is no
decisionmaker who can act upon the findings, or the desired data doesn&rsquo;t exist
and can&rsquo;t be collected in time.</p>
<p>Beware of project ideas that constitute a post-hoc analysis to justify a
decision that&rsquo;s already been made or a program that&rsquo;s already launched. These
sorts of projects are not actionable, tend to get political, and are subject to
<a href="https://towardsdatascience.com/focus-on-decisions-not-outcomes-bf6e99cf5e4f">outcome
bias</a>,
where a decision is evaluated using information that was unknowable to the
decisionmaker at the time.</p>
<p>Also beware of the &ldquo;solution in search of a problem,&rdquo; where you pick the kind of
model you want to develop first and then try to find data, use cases and users
for it. This mode of work is common in data science teams and it doesn&rsquo;t work
well when your goal is to influence business decisions rather than to conduct
research.</p>
<h2 id="analysis">Analysis</h2>
<p>This is the exploratory data analysis (EDA) phase. Locate and study the data
sources and determine their provenance. Start writing queries to profile and
extract the data. Start an R or Python script for plotting the data
distributions. Don&rsquo;t spend a lot of time on writing clean, maintainable code at
this point, it&rsquo;s likely to be throwaway work. Your goal here is to get familiar
with the data and identify its problems and limitations.</p>
<h3 id="data-provenance-questions">Data provenance questions</h3>
<ul>
<li>Where are the necessary data for this project located?</li>
<li>How was the data collected? What, if any, sampling method was used?</li>
<li>Who owns the data sources, what level of support do they offer, and
how will we contact each other if there is a problem?</li>
<li>How often will we need to refresh the data from the source, and is the source
updated at least that often?</li>
</ul>
<h3 id="data-profiling-questions">Data profiling questions</h3>
<ul>
<li>What is the user/customer ID grain?</li>
<li>What is the time grain?</li>
<li>What are the variables, their data types and business meanings?</li>
<li>What should be the dependent variable? Is it present in the data? How is it
distributed?</li>
<li>What is the distribution of each ordinal and nominal categorical variable?</li>
<li>What are the distributions and moments of each integer or real variable?</li>
<li>What are the pairwise joint distributions and correlations among the variables?</li>
<li>Which variables have missing data and what are the reasons for missingness?
Missing data in business are usually missing not at random (MNAR), so what is
the pattern?</li>
<li>Can I tell if there are missing observations? Are they missing for certain
classes or on certain dates?</li>
<li>Do any variables have extreme outliers? Can I find out how they happen?</li>
<li>Are the observations
<a href="https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables">IID</a>?</li>
</ul>
<p>Reasons to cancel a project at this phase would be discovering that the data are
inaccessible, unreliable, no longer maintained or supported, or are of
insufficient quantity, quality or completeness to produce a deliverable data
product that meets the goals of the project. If you&rsquo;re working on an otherwise
promising project and find a dealbreaker in the data, it behooves you to give
specific, actionable feedback to the customer and data owner so that they have
an opportunity to correct the problem for future work.</p>
<h2 id="development">Development</h2>
<p>The development stage of the project is where you build your deliverable data
product&ndash;you write code, develop models, craft visualizations, and document your
findings. Unlike the EDA phase where your code can be throwaway scripts or
notebooks, here you&rsquo;ll want to start introducing some structure (and comments,
and documentation) to your project because it will allow you to change and
re-run steps without manual rework, and will also help your peers who need to
understand and review your work.</p>
<h3 id="modeling-tasks">Modeling tasks</h3>
<p>For many of us, building a predictive or causal model is the fun part of data
science. But before jumping into modeling, ask yourself a few questions to
determine if it&rsquo;s really worth building a model at all:</p>
<ul>
<li>Is the model attempting to automate a human decision? How often does that
human decision need to be made, how much labor does it require, and what is
the human error rate?</li>
<li>What is a simple heuristic baseline model for the task and how well does it
perform?</li>
<li>Does the improved performance of a more complex model justify the increase in
complexity and resource-intensiveness and decrease in interpretability?</li>
<li>How will the model outputs be consumed by a system, business process, or human
decisionmaker?</li>
<li>What is the cost of a false positive (or overestimation) relative to a false
negative (or underestimation)?</li>
<li>If the data is wholly or partially labeled, where do the labels come from? Can
I detect any label noise?</li>
<li>Could a model predict the label before the point in time when it is needed in
the business process?</li>
<li>What is the appropriate unit for random sampling or assignment?</li>
<li>What validation or model selection strategy is appropriate?</li>
<li>What biases and confounders could exist in the data and how will I control for
them? (This is the paramount question for a causal model, but even for
purely predictive models, biases can cause serious problems!)</li>
<li>How will I assess model residuals to understand how the model fails, and
compare these to the errors produced by the existing business process?</li>
<li>Is there potential for unfair or disparate impact on human subjects? How can I
address this?</li>
</ul>
<p>The field of machine learning is littered with failed projects where a
predictive model sounded like a good idea, but the turned out not to be an
improvement over the existing process. For a timely example, see <em>Nature</em>&rsquo;s
<a href="https://www.nature.com/articles/s42256-021-00307-0">meta-analysis of COVID-19 detection and prognostication
models</a> that found none of
the 62 studies reviewed produced anything usable in practice.</p>
<h3 id="visualization-tasks">Visualization tasks</h3>
<p>It&rsquo;s easy to create clutter and confuse your audience when presenting data in
plots. For each data visualization that will be included in a deck, report or
dashboard, consider whether and why the plot is necessary:</p>
<ul>
<li>Is the visual intended to:
<ul>
<li>show change in a metric or progress toward a goal over time</li>
<li>compare multiple groups by a metric</li>
<li>characterize the distribution of a metric within a group</li>
<li>compare distributions of a metric across two or more groups</li>
<li>depict a relationship between two or more variables</li>
<li>call attention to outliers</li>
<li>show the performance of a model</li>
</ul>
</li>
<li>Does the visual require interactivity? <em>Does it, though?</em></li>
<li>How often should the viewer look at the visual? (Once? Weekly? Monthly?)</li>
<li>Would the plot be better as a table?</li>
</ul>
<h3 id="metrics-tasks">Metrics tasks</h3>
<p>For some data science projects, the primary deliverable is a metric, typically
visualized on some dashboard that&rsquo;s refreshed every hour, day, week or month. If
you are designing and implementing a new metric, consider:</p>
<ul>
<li>How does the metric relate to customer success or other business objectives?</li>
<li>Is higher better or worse?</li>
<li>Is there a goal or target?</li>
<li>How is it flawed as a proxy of what you really want to measure?</li>
<li>Is it being used to measure the performance of a person or team?</li>
<li>How might it be &ldquo;gamed&rdquo; or introduce a perverse incentive or feedback loop?</li>
</ul>
<p>You get what you measure (and reward), for better or for worse. A badly designed
metric can do much more damage than not measuring at all, so <a href="https://en.wikipedia.org/wiki/Goodhart%27s_law">Goodhart&rsquo;s
law</a> and its many corollaries
should be in the front of your mind when designing metrics, especially those
that attempt to quantify the achievements of human workers, because the act of
measuring will cause their behavior to change in unintended ways.</p>
<h3 id="presentation">Presentation</h3>
<p>For each slide or report section, ask yourself critical questions about why it
belongs and what purpose it serves. Your audience is giving you their precious
time and attention, so maximize the value they get in return.</p>
<ul>
<li>What is my finding? (Lead with the punchline)</li>
<li>What is the evidence and how does it support the finding?</li>
<li>What are the limitations of the evidence? What untestable assumptions did I
have to make? (Savvy audiences will ask!)</li>
<li>What is the business implication of the finding?</li>
<li>What am I asking the viewer to do with this information?</li>
<li>What are the obvious questions that a viewer would ask when they see this?</li>
</ul>
<h2 id="validation">Validation</h2>
<p>Validation is a chance to check if you made the right thing and made the thing
right. Seek input from peers to determine if the deliverables need any quality
improvements, and from stakeholders as appropriate to determine if the
deliverables (still) meet the business need identified at the start of the
project. Do this throughout the project, particularly if it&rsquo;s on the larger side
or if you&rsquo;re working on something that makes sense to develop iteratively, but
you should always have some sort of quality gate before you &ldquo;ship.&rdquo;</p>
<ul>
<li>Does the original business problem or need for the project deliverable still
exist? What changed in the meantime?</li>
<li>Are my code files source-controlled and my datasets backed up?</li>
<li>Can I reproduce the deliverable from the raw data in one step? Do the
quantitative results turn out the same after I re-run the entire process?</li>
<li>Has a peer read over my queries, code and report to check them for errors?</li>
<li>Have I shown a draft or pre-release version to a customer to validate
that the deliverable meets their needs?</li>
</ul>
<h2 id="release">Release</h2>
<p>The release stage is where you deliver the presentation, publish the
paper, or deploy the model or dashboard to production.</p>
<p>For a production model or dashboard, someone will need to support it, so before
you deploy it, ask questions like:</p>
<ul>
<li>What are the ongoing maintenance responsibilities?</li>
<li>What computing resources are required to operate in production?</li>
<li>What monitoring and alerts are required?</li>
<li>How will I track user adoption and engagement?</li>
</ul>
<p>For a report or presentation:</p>
<ul>
<li>Did I successfully deliver my work to its intended consumer?</li>
<li>If so, what were the decisions and follow-up actions?</li>
<li>Did the stakeholder commit to taking any recommended action based on my work?
Why or why not?</li>
</ul>
<p>Finally, it&rsquo;s a good idea to have a project retrospective discussion or written
report that summarizes, without blaming any individuals or teams, what went
right, what went wrong, and what lessons were learned that could be applied to
future work.</p>
<p>While I&rsquo;ve attempted to make this guide sufficiently general to apply to a broad
swath of business data science projects, every project is a little different, so
ask the critical questions that make the most sense for your project. So long as
you start small, fail fast, and insist on actionable results, you should see an
improved return on your time investment in your projects.</p>
]]></content>
		</item>
		
		<item>
			<title>Fighting Fakes Fairly</title>
			<link>https://alexkyllo.com/posts/fake-faces/</link>
			<pubDate>Tue, 22 Dec 2020 21:42:53 -0800</pubDate>
			
			<guid>https://alexkyllo.com/posts/fake-faces/</guid>
			<description>I just wrapped up the first quarter of my master&amp;rsquo;s degree program. In my project-driven Machine Learning class, I chose to work on a fake face detection task. I picked this topic mostly because I had no experience with computer vision or convolutional neural networks (CNNs) and wanted to try something totally new to me, plus I knew that there was an obvious ethics component to facial recognition AI and wanted to feature that in my work.</description>
			<content type="html"><![CDATA[<p>I just wrapped up the first quarter of my master&rsquo;s degree program. In
my project-driven Machine Learning class, I chose to work on a fake
face detection task. I picked this topic mostly because I had no
experience with computer vision or convolutional neural networks
(CNNs) and wanted to try something totally new to me, plus I knew that
there was an obvious ethics component to facial recognition AI and
wanted to feature that in my work.</p>
<p>I wasn&rsquo;t expecting the topic to be <em>quite</em> so timely, but after Google
fired Timnit Gebru for questioning both their AI ethics and their
diversity, equality and inclusion practices, I saw <a href="https://twitter.com/Mantzarlis/status/1338220767042002945?s=19">a jaw-dropping
thread</a>
pop up in my Twitter feed. Someone had created a fake account, using
GAN generated fake face profile images, claiming to be one of
Dr. Gebru&rsquo;s former colleagues, in order to smear her reputation.</p>
<p><img src="/alexios-thread.png" alt=""></p>
<p><img src="/jeff-jeffries.jpeg" alt=""></p>
<p><img src="/julia-smith-kleinberg.jpeg" alt=""></p>
<p>Even worse, a professor emeritus from the UW CSE department, Pedro
Domingos, engaged with and even retweeted this account, apparently not
realizing it was a fake:</p>
<p><img src="/julia-smith-kleinberg-thread.jpeg" alt=""></p>
<p>This made it very clear to me that &ldquo;deep fakes&rdquo; is <em>not</em> a toy problem
or a hypothetical future problem, but that the technology is being
deployed <em>right now</em> to spread disinformation online. So this points
to the need for technology to detect and flag fakes, in order to keep
up in the &ldquo;arms race.&rdquo;</p>
<p>My project team took the obvious approach to the problem&ndash;get a
dataset consisting of real human face images and another dataset of
fake ones generated by <a href="https://github.com/NVlabs/stylegan">StyleGAN</a>,
and then train a CNN model to distinguish between the two datasets. We
used Keras, and while it was a bit frustrating to get it configured
and working on top of CUDA, because TensorFlow requires older versions
of the NVIDIA libraries, we were able to get it working.  I hacked
together a little Python framework to run model training with a
specified set of model hyperparameters and image preprocessing steps,
saving the model weights and accuracy results to files to track our
progress. We ended up going with a 10-layer network based on the VGG-16
architecture but simplified to avoid overfitting, because the problem
is much less complex than ImageNet.</p>
<p>While we got promising results (&gt;97% test accuracy) on the original
datasets, we knew that face image datasets tend to have selection bias
toward middle-aged, white faces, so we wanted to test whether our
model performed equally well on subjects of different demographics.</p>
<p>In order to conduct that fairness assessment, we needed a
demographically labeled dataset of both real and fake faces, so we
took the <a href="https://github.com/joojs/fairface">FairFace</a> dataset, then
used
<a href="https://github.com/eladrich/pixel2style2pixel">pixel2style2pixel</a> as
an autoencoder, to embed each image into the latent space of the
StyleGAN network and then extract it back out into an image.</p>
<p>The results of this process were pretty interesting. The
pixel2style2pixel model was able to reconstruct faces very similar to
the originals, removing things like hands partially covering faces and
bruises and blemishes, though it sometimes made mistakes like
misinterpreting head coverings as hair. Here are some samples:</p>
<p><img src="/fair2fake.jpg" alt=""></p>
<p>Here&rsquo;s my own face before and after running through the face falsifier
network, which feels pretty eerie to look at:</p>
<p><img src="/alex-fake.jpg" alt=""></p>
<p>What we found in the process, though, was that our model trained on
the first dataset <strong>totally</strong> failed to generalize to the second
dataset, with accuracy only in the mid 50s, only marginally better
than a coin toss. So we decided not to bother with a fairness
assessment on a clearly useless model.  Instead, we retrained and
scored the model on a combined dataset drawn from both datasets. We
found that this new model performed better than our first model on
both the original dataset and the combined dataset, probably due to
the additional training examples, and it also performed with almost
exactly the same metrics across demographic splits of male/female,
white/non-white, black/non-black, elderly/non-elderly, and
child/non-child.</p>
<p>There&rsquo;s a ton more work to be done in the space, and I still have
doubts about whether the model really detected something intrinsic to
fake faces that will generalize well across more datasets. But for
now, I think this experience demonstrates the value of using a
diverse, heterogenous input dataset, testing the model on truly
out-of-sample data, and considering and planning for a model fairness
assessment up front.</p>
<p>Our code is available on GitHub: <a href="https://github.com/alexkyllo/fake-faces/">https://github.com/alexkyllo/fake-faces/</a></p>
<p>Winter quarter will be a little change of pace as I&rsquo;m taking High
Performance Computing, which will focus on GPU programming with CUDA.
I&rsquo;m pretty excited for it because I hope to brush up on my
computational linear algebra and better understand what is actually
happening when I train a neural network model on my GPU.</p>
]]></content>
		</item>
		
		<item>
			<title>Database Learning Resources</title>
			<link>https://alexkyllo.com/posts/database-learning/</link>
			<pubDate>Sat, 09 May 2020 11:04:29 -0700</pubDate>
			
			<guid>https://alexkyllo.com/posts/database-learning/</guid>
			<description>Over the last few years working in analytics and data science, I&amp;rsquo;ve developed a strong interest in the internal workings of database engines, particularly distributed database for analytical workloads. Here I&amp;rsquo;m going to build a list of computer science concepts and learning resources that I am finding helpful in building understanding of how these systems work and how to implement them.
File I/O Concepts  On Disk I/O a series of blog posts by Alex Petrov, explaining the different methods of file I/O in Linux (buffered, direct, memory-mapped, asynchronous, vectored) and their use cases in the context of database applications, as well as on-disk data structures (SSTables, B+ Trees, Log-Structured Merge Trees)  Probabilistic Data Structures  Bloom Filter A hash bitmap for probabilistic answers to set membership queries (true/false with no false negatives) HyperLogLog A hash based algorithm for fast approximate cardinality (distinct count) estimation Count Min Sketch an approximate frequency table similar to a counting Bloom filter  On-Disk Data Structures  LSM Tree Paper Log Structured Merge Trees&amp;ndash;trees of immutable sorted runs of on-disk data that are periodically merged using external merge sort Google BigTable Paper Application of LSM Trees  Distributed Consensus Algorithms  The Raft Consensus Algorithm Implementing Raft by Eli Bendersky  Comprehensive Database Concept Books  Database System Concepts Database Internals Designing Data Intensive Applications  Open Courses  CMU 15-445 Database Systems covers basics of on-disk database system implementation CMU 15-721 Advanced Database Systems covers advanced topics in in-memory database implementation, with emphasis on concurrency control and optimizations.</description>
			<content type="html"><![CDATA[<p>Over the last few years working in analytics and data science, I&rsquo;ve developed a
strong interest in the internal workings of database engines, particularly
distributed database for analytical workloads. Here I&rsquo;m going to build a list of
computer science concepts and learning resources that I am finding helpful in
building understanding of how these systems work and how to implement them.</p>
<h1 id="file-io-concepts">File I/O Concepts</h1>
<ul>
<li><a href="https://medium.com/databasss/on-disk-io-part-1-flavours-of-io-8e1ace1de017">On Disk I/O</a>
a series of blog posts by Alex Petrov, explaining the different methods of
file I/O in Linux (buffered, direct, memory-mapped, asynchronous, vectored)
and their use cases in the context of database applications, as well as on-disk
data structures (SSTables, B+ Trees, Log-Structured Merge Trees)</li>
</ul>
<h2 id="probabilistic-data-structures">Probabilistic Data Structures</h2>
<ul>
<li><a href="https://www.jasondavies.com/bloomfilter/">Bloom Filter</a> A hash bitmap for
probabilistic answers to set membership queries (true/false with no false negatives)</li>
<li><a href="http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf">HyperLogLog</a> A hash based
algorithm for fast approximate cardinality (distinct count) estimation</li>
<li><a href="https://redislabs.com/blog/count-min-sketch-the-art-and-science-of-estimating-stuff/">Count Min Sketch</a>
an approximate frequency table similar to a counting Bloom filter</li>
</ul>
<h1 id="on-disk-data-structures">On-Disk Data Structures</h1>
<ul>
<li><a href="http://paperhub.s3.amazonaws.com/18e91eb4db2114a06ea614f0384f2784.pdf">LSM Tree Paper</a>
Log Structured Merge Trees&ndash;trees of immutable sorted runs of on-disk data that are periodically
merged using external merge sort</li>
<li><a href="https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf">Google BigTable Paper</a> Application of LSM Trees</li>
</ul>
<h1 id="distributed-consensus-algorithms">Distributed Consensus Algorithms</h1>
<ul>
<li><a href="https://raft.github.io/">The Raft Consensus Algorithm</a></li>
<li><a href="https://eli.thegreenplace.net/2020/implementing-raft-part-0-introduction/">Implementing Raft by Eli Bendersky</a></li>
</ul>
<h1 id="comprehensive-database-concept-books">Comprehensive Database Concept Books</h1>
<ul>
<li><a href="https://www.db-book.com/db7/index.html">Database System Concepts</a></li>
<li><a href="https://learning.oreilly.com/library/view/database-internals/9781492040330/">Database Internals</a></li>
<li><a href="https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/">Designing Data Intensive Applications</a></li>
</ul>
<h1 id="open-courses">Open Courses</h1>
<ul>
<li><a href="https://15445.courses.cs.cmu.edu/fall2019/">CMU 15-445 Database Systems</a>
covers basics of on-disk database system implementation</li>
<li><a href="https://15721.courses.cs.cmu.edu/spring2020/">CMU 15-721 Advanced Database Systems</a>
covers advanced topics in in-memory database implementation, with emphasis on concurrency
control and optimizations.</li>
</ul>
]]></content>
		</item>
		
		<item>
			<title>Make Is All You Need</title>
			<link>https://alexkyllo.com/posts/make-is-all-you-need/</link>
			<pubDate>Mon, 30 Dec 2019 18:03:52 -0800</pubDate>
			
			<guid>https://alexkyllo.com/posts/make-is-all-you-need/</guid>
			<description>As a data analyst mostly working with high level languages and GUI tools, I thought of GNU Make as a tool (from 1976!) for building Linux binaries from compiled languages, and it didn&amp;rsquo;t occur to me how it could be useful in my workflow. Make is usually part of any curriculum for software engineers learning C/C++, but I seldom if ever see it mentioned in data science courses or tutorials.</description>
			<content type="html"><![CDATA[<p>As a data analyst mostly working with high level languages and GUI tools,
I thought of <a href="https://www.gnu.org/software/make/">GNU Make</a> as a tool
(from 1976!) for building Linux binaries from compiled languages,
and it didn&rsquo;t occur to me how it could be useful in my workflow. Make is
usually part of any curriculum for software engineers learning C/C++, but
I seldom if ever see it mentioned in data science courses or tutorials.</p>
<p>Today, I use Make for all of my software and data science projects, and I
consider it the most important tool for reproducibility. It ships with
Linux and Mac, and there is a <a href="http://gnuwin32.sourceforge.net/packages/make.htm">Windows installer</a>
available. I love that it&rsquo;s language-agnostic, so it&rsquo;s perfect for teams
where different people use R, Python, or other languages for their analysis code.
It also gives you a &ldquo;single point of contact&rdquo; for all the scripts and commands
in your project, so you can just type <code>make</code> or
<code>make [task name]</code> instead of having to remember and re-type each command.</p>
<p>The <a href="https://www.gnu.org/software/make/manual/make.html">GNU Make Manual</a>
is quite approachable while also providing great detail, but in a nutshell,
you create a file called <code>Makefile</code> in your project directory and add to it
a set of &ldquo;rules&rdquo; for producing output files from input files, in this form:</p>
<div class="highlight"><pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-make" data-lang="make"><span style="color:#a6e22e">targets </span><span style="color:#f92672">:</span> prerequisites
    recipe
</code></pre></div><p>A &ldquo;target&rdquo; and a &ldquo;prerequisite&rdquo; are usually filenames, but can also be just
names for tasks (called &ldquo;phony&rdquo; targets).</p>
<p>So why is this so useful for data science projects?</p>
<p>Data analysis code typically involves five steps for processing data:</p>
<ol>
<li>Getting it</li>
<li>Cleaning it</li>
<li>Exploring it</li>
<li>Modeling it</li>
<li>Interpreting it</li>
</ol>
<p>Without a disciplined approach, a typical R or Python script or notebook file
will do all of these steps together, perhaps interleaved, making it difficult
to pull the steps apart so that they can be executed separately.</p>
<p>Why is it important to separate them? Some steps may process large quantities of
data, so you don&rsquo;t want to re-run everything each time you change your script.
Caching intermediate results saves you a lot of time as you&rsquo;re developing.</p>
<p>Also, if you have mutli-language scenarios, say you want to visualize your data
with <a href="https://ggplot2.tidyverse.org/">ggplot2</a>
before training a predictive model with <a href="https://keras.io/">Keras</a>,
it&rsquo;s easiest to separate these steps into
different scripts in different languages, rather than doing something
complicated like spawning a Python subprocess from within R or vice versa.</p>
<p>Here&rsquo;s an example of what a Makefile might look like, for generating a paper
written in LaTeX, containing two plots generated by an R script and a Python
script, from a CSV file created by a SQL database query:</p>
<div class="highlight"><pre style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4"><code class="language-make" data-lang="make"><span style="color:#a6e22e">default </span><span style="color:#f92672">:</span> paper.pdf

<span style="color:#a6e22e">paper.pdf </span><span style="color:#f92672">:</span> paper.tex figure-01.png figure-02.png
    pdflatex paper.tex

<span style="color:#a6e22e">figure-01.png </span><span style="color:#f92672">:</span> plot.R data.csv
    Rscript plot.R

<span style="color:#a6e22e">figure-02.png </span><span style="color:#f92672">:</span> plot.py data.csv
    python plot.py

<span style="color:#a6e22e">data.csv</span><span style="color:#f92672">:</span> query.sql
    sql2csv --db <span style="color:#e6db74">&#34;sqlite:///database.db&#34;</span> query.sql
</code></pre></div><p>Given this Makefile, when you type <code>make</code>, it will execute each of these
recipes in dependency order, and then if you type <code>make</code> again, it will
do nothing, because the dependencies haven&rsquo;t changed. Make looks at the
file modified timestamps to see which steps actually need to be redone.
If you change the <code>query.sql</code> file and type <code>make</code> again, it will rerun
everything because that&rsquo;s the first item in the dependency chain, but if you
only change the <code>paper.tex</code> file, then only the
recipe for <code>paper.pdf</code> will be rerun the next time you type <code>make</code>, because
nothing upstream of it changed. This is an incredible time-saver when some of
your recipes take a long time to run, as is often the case with data science projects.
I&rsquo;ve seen many other tools designed to solve this problem, most of them
language-specific and much more complex to learn and use than Make.</p>
<p>As far as how to structure a new project, there isn&rsquo;t really an accepted standard
data science code project structure or framework.
The closest thing might be
<a href="https://drivendata.github.io/cookiecutter-data-science/#directory-structure">Cookiecutter Data Science</a>
(which also utilizes a Makefile), though it&rsquo;s Python-specific and I haven&rsquo;t seen
it used &ldquo;in the wild&rdquo; yet. If you use a Makefile in your project, though, it doesn&rsquo;t
really matter that much how you structure your directories or even what language you use,
as thinking in terms of Make recipes will encourage you to break up your analysis
scripts into a pipeline of small,
dependent, repeatable steps with cached results. Then when a colleague picks up
your project to review your work or enhance it, just looking at the Makefile will
give them a clear idea of how it&rsquo;s put together.
So use Make for your data science projects and encourage your coworkers to do the
same&ndash;you will thank each other!</p>
]]></content>
		</item>
		
		<item>
			<title>Reboot</title>
			<link>https://alexkyllo.com/posts/reboot/</link>
			<pubDate>Mon, 30 Dec 2019 17:16:15 -0800</pubDate>
			
			<guid>https://alexkyllo.com/posts/reboot/</guid>
			<description>I&amp;rsquo;ve read many times that the key to success is just to do things and tell people.
In the new decade, I am going to put much more effort into the tell people part.</description>
			<content type="html"><![CDATA[<p>I&rsquo;ve read many times that the key to success is just to <em><a href="http://carl.flax.ie/dothingstellpeople.html">do things and tell people</a>.</em></p>
<p>In the new decade, I am going to put much more effort into the <em>tell people</em> part.</p>
]]></content>
		</item>
		
	</channel>
</rss>
