Posts on Alex Kyllo

Questions to ask to keep your data science projects on track

Sun, 02 Jan 2022 22:16:24 -0800

Data science projects fail a lot. It’s highly speculative work, so as a data scientist, some crucial career skills are knowing how to assess the viability of a project, when to cut your losses, and why things didn’t work out, so that you can learn from it and adjust your approach. I’ve put together a list of questions to ask at each stage in a project’s lifecycle, to help you avoid wasting your precious time on projects that are not going to make it.

I’m assuming the general sort of projects that I have experience with: small business data science projects (think one to a few people over one to a few months), which we can roughly break down into a generic five-phase lifecycle:

Scoping
Analysis
Development
Validation
Release

This is not a prescriptive procedure but a rubric to help you assess progress and viability. If you’re not sure how to move your project forward or whether it’s even worth investing further time into, the answers to these questions may help you decide the best next steps.

Scoping

Have discussions with the proposed project’s customers and write a one-pager that answers the following questions:

What is the business decision or problem motivating the project?
Which type of deliverable (presentation, static report, dashboard etc.) is most appropriate?
Who will use the deliverables?
How will they use the deliverables to achieve business impact?
Have the necessary data been collected, or can they be collected in time to analyze them and act upon the analysis?
When the project is done, how and when will we know whether it was successful?
Who will agree to review my work for quality?
Where and when will I present or distribute my work to its intended users?

Review this document with a peer or your manager first if possible, and then with the customer, to align their expectations. Revise it as needed to achieve agreement, but don’t force it if the case isn’t compelling or you can’t get clarity.

Many, if not most, data science project ideas should be terminated at this stage for reasons such as there isn’t a clear problem-solution fit, there is no decisionmaker who can act upon the findings, or the desired data doesn’t exist and can’t be collected in time.

Beware of project ideas that constitute a post-hoc analysis to justify a decision that’s already been made or a program that’s already launched. These sorts of projects are not actionable, tend to get political, and are subject to outcome bias, where a decision is evaluated using information that was unknowable to the decisionmaker at the time.

Also beware of the “solution in search of a problem,” where you pick the kind of model you want to develop first and then try to find data, use cases and users for it. This mode of work is common in data science teams and it doesn’t work well when your goal is to influence business decisions rather than to conduct research.

Analysis

This is the exploratory data analysis (EDA) phase. Locate and study the data sources and determine their provenance. Start writing queries to profile and extract the data. Start an R or Python script for plotting the data distributions. Don’t spend a lot of time on writing clean, maintainable code at this point, it’s likely to be throwaway work. Your goal here is to get familiar with the data and identify its problems and limitations.

Data provenance questions

Where are the necessary data for this project located?
How was the data collected? What, if any, sampling method was used?
Who owns the data sources, what level of support do they offer, and how will we contact each other if there is a problem?
How often will we need to refresh the data from the source, and is the source updated at least that often?

Data profiling questions

What is the user/customer ID grain?
What is the time grain?
What are the variables, their data types and business meanings?
What should be the dependent variable? Is it present in the data? How is it distributed?
What is the distribution of each ordinal and nominal categorical variable?
What are the distributions and moments of each integer or real variable?
What are the pairwise joint distributions and correlations among the variables?
Which variables have missing data and what are the reasons for missingness? Missing data in business are usually missing not at random (MNAR), so what is the pattern?
Can I tell if there are missing observations? Are they missing for certain classes or on certain dates?
Do any variables have extreme outliers? Can I find out how they happen?
Are the observations IID?

Reasons to cancel a project at this phase would be discovering that the data are inaccessible, unreliable, no longer maintained or supported, or are of insufficient quantity, quality or completeness to produce a deliverable data product that meets the goals of the project. If you’re working on an otherwise promising project and find a dealbreaker in the data, it behooves you to give specific, actionable feedback to the customer and data owner so that they have an opportunity to correct the problem for future work.

Development

The development stage of the project is where you build your deliverable data product–you write code, develop models, craft visualizations, and document your findings. Unlike the EDA phase where your code can be throwaway scripts or notebooks, here you’ll want to start introducing some structure (and comments, and documentation) to your project because it will allow you to change and re-run steps without manual rework, and will also help your peers who need to understand and review your work.

Modeling tasks

For many of us, building a predictive or causal model is the fun part of data science. But before jumping into modeling, ask yourself a few questions to determine if it’s really worth building a model at all:

Is the model attempting to automate a human decision? How often does that human decision need to be made, how much labor does it require, and what is the human error rate?
What is a simple heuristic baseline model for the task and how well does it perform?
Does the improved performance of a more complex model justify the increase in complexity and resource-intensiveness and decrease in interpretability?
How will the model outputs be consumed by a system, business process, or human decisionmaker?
What is the cost of a false positive (or overestimation) relative to a false negative (or underestimation)?
If the data is wholly or partially labeled, where do the labels come from? Can I detect any label noise?
Could a model predict the label before the point in time when it is needed in the business process?
What is the appropriate unit for random sampling or assignment?
What validation or model selection strategy is appropriate?
What biases and confounders could exist in the data and how will I control for them? (This is the paramount question for a causal model, but even for purely predictive models, biases can cause serious problems!)
How will I assess model residuals to understand how the model fails, and compare these to the errors produced by the existing business process?
Is there potential for unfair or disparate impact on human subjects? How can I address this?

The field of machine learning is littered with failed projects where a predictive model sounded like a good idea, but the turned out not to be an improvement over the existing process. For a timely example, see Nature’s meta-analysis of COVID-19 detection and prognostication models that found none of the 62 studies reviewed produced anything usable in practice.

Visualization tasks

It’s easy to create clutter and confuse your audience when presenting data in plots. For each data visualization that will be included in a deck, report or dashboard, consider whether and why the plot is necessary:

Is the visual intended to:
- show change in a metric or progress toward a goal over time
- compare multiple groups by a metric
- characterize the distribution of a metric within a group
- compare distributions of a metric across two or more groups
- depict a relationship between two or more variables
- call attention to outliers
- show the performance of a model
Does the visual require interactivity? Does it, though?
How often should the viewer look at the visual? (Once? Weekly? Monthly?)
Would the plot be better as a table?

Metrics tasks

For some data science projects, the primary deliverable is a metric, typically visualized on some dashboard that’s refreshed every hour, day, week or month. If you are designing and implementing a new metric, consider:

How does the metric relate to customer success or other business objectives?
Is higher better or worse?
Is there a goal or target?
How is it flawed as a proxy of what you really want to measure?
Is it being used to measure the performance of a person or team?
How might it be “gamed” or introduce a perverse incentive or feedback loop?

You get what you measure (and reward), for better or for worse. A badly designed metric can do much more damage than not measuring at all, so Goodhart’s law and its many corollaries should be in the front of your mind when designing metrics, especially those that attempt to quantify the achievements of human workers, because the act of measuring will cause their behavior to change in unintended ways.

Presentation

For each slide or report section, ask yourself critical questions about why it belongs and what purpose it serves. Your audience is giving you their precious time and attention, so maximize the value they get in return.

What is my finding? (Lead with the punchline)
What is the evidence and how does it support the finding?
What are the limitations of the evidence? What untestable assumptions did I have to make? (Savvy audiences will ask!)
What is the business implication of the finding?
What am I asking the viewer to do with this information?
What are the obvious questions that a viewer would ask when they see this?

Validation

Validation is a chance to check if you made the right thing and made the thing right. Seek input from peers to determine if the deliverables need any quality improvements, and from stakeholders as appropriate to determine if the deliverables (still) meet the business need identified at the start of the project. Do this throughout the project, particularly if it’s on the larger side or if you’re working on something that makes sense to develop iteratively, but you should always have some sort of quality gate before you “ship.”

Does the original business problem or need for the project deliverable still exist? What changed in the meantime?
Are my code files source-controlled and my datasets backed up?
Can I reproduce the deliverable from the raw data in one step? Do the quantitative results turn out the same after I re-run the entire process?
Has a peer read over my queries, code and report to check them for errors?
Have I shown a draft or pre-release version to a customer to validate that the deliverable meets their needs?

Release

The release stage is where you deliver the presentation, publish the paper, or deploy the model or dashboard to production.

For a production model or dashboard, someone will need to support it, so before you deploy it, ask questions like:

What are the ongoing maintenance responsibilities?
What computing resources are required to operate in production?
What monitoring and alerts are required?
How will I track user adoption and engagement?

For a report or presentation:

Did I successfully deliver my work to its intended consumer?
If so, what were the decisions and follow-up actions?
Did the stakeholder commit to taking any recommended action based on my work? Why or why not?

Finally, it’s a good idea to have a project retrospective discussion or written report that summarizes, without blaming any individuals or teams, what went right, what went wrong, and what lessons were learned that could be applied to future work.

While I’ve attempted to make this guide sufficiently general to apply to a broad swath of business data science projects, every project is a little different, so ask the critical questions that make the most sense for your project. So long as you start small, fail fast, and insist on actionable results, you should see an improved return on your time investment in your projects.

Fighting Fakes Fairly

Tue, 22 Dec 2020 21:42:53 -0800

I just wrapped up the first quarter of my master’s degree program. In my project-driven Machine Learning class, I chose to work on a fake face detection task. I picked this topic mostly because I had no experience with computer vision or convolutional neural networks (CNNs) and wanted to try something totally new to me, plus I knew that there was an obvious ethics component to facial recognition AI and wanted to feature that in my work.

I wasn’t expecting the topic to be quite so timely, but after Google fired Timnit Gebru for questioning both their AI ethics and their diversity, equality and inclusion practices, I saw a jaw-dropping thread pop up in my Twitter feed. Someone had created a fake account, using GAN generated fake face profile images, claiming to be one of Dr. Gebru’s former colleagues, in order to smear her reputation.

Even worse, a professor emeritus from the UW CSE department, Pedro Domingos, engaged with and even retweeted this account, apparently not realizing it was a fake:

This made it very clear to me that “deep fakes” is not a toy problem or a hypothetical future problem, but that the technology is being deployed right now to spread disinformation online. So this points to the need for technology to detect and flag fakes, in order to keep up in the “arms race.”

My project team took the obvious approach to the problem–get a dataset consisting of real human face images and another dataset of fake ones generated by StyleGAN, and then train a CNN model to distinguish between the two datasets. We used Keras, and while it was a bit frustrating to get it configured and working on top of CUDA, because TensorFlow requires older versions of the NVIDIA libraries, we were able to get it working. I hacked together a little Python framework to run model training with a specified set of model hyperparameters and image preprocessing steps, saving the model weights and accuracy results to files to track our progress. We ended up going with a 10-layer network based on the VGG-16 architecture but simplified to avoid overfitting, because the problem is much less complex than ImageNet.

While we got promising results (>97% test accuracy) on the original datasets, we knew that face image datasets tend to have selection bias toward middle-aged, white faces, so we wanted to test whether our model performed equally well on subjects of different demographics.

In order to conduct that fairness assessment, we needed a demographically labeled dataset of both real and fake faces, so we took the FairFace dataset, then used pixel2style2pixel as an autoencoder, to embed each image into the latent space of the StyleGAN network and then extract it back out into an image.

The results of this process were pretty interesting. The pixel2style2pixel model was able to reconstruct faces very similar to the originals, removing things like hands partially covering faces and bruises and blemishes, though it sometimes made mistakes like misinterpreting head coverings as hair. Here are some samples:

Here’s my own face before and after running through the face falsifier network, which feels pretty eerie to look at:

What we found in the process, though, was that our model trained on the first dataset totally failed to generalize to the second dataset, with accuracy only in the mid 50s, only marginally better than a coin toss. So we decided not to bother with a fairness assessment on a clearly useless model. Instead, we retrained and scored the model on a combined dataset drawn from both datasets. We found that this new model performed better than our first model on both the original dataset and the combined dataset, probably due to the additional training examples, and it also performed with almost exactly the same metrics across demographic splits of male/female, white/non-white, black/non-black, elderly/non-elderly, and child/non-child.

There’s a ton more work to be done in the space, and I still have doubts about whether the model really detected something intrinsic to fake faces that will generalize well across more datasets. But for now, I think this experience demonstrates the value of using a diverse, heterogenous input dataset, testing the model on truly out-of-sample data, and considering and planning for a model fairness assessment up front.

Our code is available on GitHub: https://github.com/alexkyllo/fake-faces/

Winter quarter will be a little change of pace as I’m taking High Performance Computing, which will focus on GPU programming with CUDA. I’m pretty excited for it because I hope to brush up on my computational linear algebra and better understand what is actually happening when I train a neural network model on my GPU.

Database Learning Resources

Sat, 09 May 2020 11:04:29 -0700

Over the last few years working in analytics and data science, I’ve developed a strong interest in the internal workings of database engines, particularly distributed database for analytical workloads. Here I’m going to build a list of computer science concepts and learning resources that I am finding helpful in building understanding of how these systems work and how to implement them.

File I/O Concepts

On Disk I/O a series of blog posts by Alex Petrov, explaining the different methods of file I/O in Linux (buffered, direct, memory-mapped, asynchronous, vectored) and their use cases in the context of database applications, as well as on-disk data structures (SSTables, B+ Trees, Log-Structured Merge Trees)

Probabilistic Data Structures

Bloom Filter A hash bitmap for probabilistic answers to set membership queries (true/false with no false negatives)
HyperLogLog A hash based algorithm for fast approximate cardinality (distinct count) estimation
Count Min Sketch an approximate frequency table similar to a counting Bloom filter

On-Disk Data Structures

LSM Tree Paper Log Structured Merge Trees–trees of immutable sorted runs of on-disk data that are periodically merged using external merge sort
Google BigTable Paper Application of LSM Trees

Distributed Consensus Algorithms

Comprehensive Database Concept Books

Open Courses

CMU 15-445 Database Systems covers basics of on-disk database system implementation
CMU 15-721 Advanced Database Systems covers advanced topics in in-memory database implementation, with emphasis on concurrency control and optimizations.

Make Is All You Need

Mon, 30 Dec 2019 18:03:52 -0800

As a data analyst mostly working with high level languages and GUI tools, I thought of GNU Make as a tool (from 1976!) for building Linux binaries from compiled languages, and it didn’t occur to me how it could be useful in my workflow. Make is usually part of any curriculum for software engineers learning C/C++, but I seldom if ever see it mentioned in data science courses or tutorials.

Today, I use Make for all of my software and data science projects, and I consider it the most important tool for reproducibility. It ships with Linux and Mac, and there is a Windows installer available. I love that it’s language-agnostic, so it’s perfect for teams where different people use R, Python, or other languages for their analysis code. It also gives you a “single point of contact” for all the scripts and commands in your project, so you can just type make or make [task name] instead of having to remember and re-type each command.

The GNU Make Manual is quite approachable while also providing great detail, but in a nutshell, you create a file called Makefile in your project directory and add to it a set of “rules” for producing output files from input files, in this form:

targets : prerequisites
    recipe

A “target” and a “prerequisite” are usually filenames, but can also be just names for tasks (called “phony” targets).

So why is this so useful for data science projects?

Data analysis code typically involves five steps for processing data:

Getting it
Cleaning it
Exploring it
Modeling it
Interpreting it

Without a disciplined approach, a typical R or Python script or notebook file will do all of these steps together, perhaps interleaved, making it difficult to pull the steps apart so that they can be executed separately.

Why is it important to separate them? Some steps may process large quantities of data, so you don’t want to re-run everything each time you change your script. Caching intermediate results saves you a lot of time as you’re developing.

Also, if you have mutli-language scenarios, say you want to visualize your data with ggplot2 before training a predictive model with Keras, it’s easiest to separate these steps into different scripts in different languages, rather than doing something complicated like spawning a Python subprocess from within R or vice versa.

Here’s an example of what a Makefile might look like, for generating a paper written in LaTeX, containing two plots generated by an R script and a Python script, from a CSV file created by a SQL database query:

default : paper.pdf

paper.pdf : paper.tex figure-01.png figure-02.png
    pdflatex paper.tex

figure-01.png : plot.R data.csv
    Rscript plot.R

figure-02.png : plot.py data.csv
    python plot.py

data.csv: query.sql
    sql2csv --db "sqlite:///database.db" query.sql

Given this Makefile, when you type make, it will execute each of these recipes in dependency order, and then if you type make again, it will do nothing, because the dependencies haven’t changed. Make looks at the file modified timestamps to see which steps actually need to be redone. If you change the query.sql file and type make again, it will rerun everything because that’s the first item in the dependency chain, but if you only change the paper.tex file, then only the recipe for paper.pdf will be rerun the next time you type make, because nothing upstream of it changed. This is an incredible time-saver when some of your recipes take a long time to run, as is often the case with data science projects. I’ve seen many other tools designed to solve this problem, most of them language-specific and much more complex to learn and use than Make.

As far as how to structure a new project, there isn’t really an accepted standard data science code project structure or framework. The closest thing might be Cookiecutter Data Science (which also utilizes a Makefile), though it’s Python-specific and I haven’t seen it used “in the wild” yet. If you use a Makefile in your project, though, it doesn’t really matter that much how you structure your directories or even what language you use, as thinking in terms of Make recipes will encourage you to break up your analysis scripts into a pipeline of small, dependent, repeatable steps with cached results. Then when a colleague picks up your project to review your work or enhance it, just looking at the Makefile will give them a clear idea of how it’s put together. So use Make for your data science projects and encourage your coworkers to do the same–you will thank each other!

Reboot

Mon, 30 Dec 2019 17:16:15 -0800

I’ve read many times that the key to success is just to do things and tell people.

In the new decade, I am going to put much more effort into the tell people part.