Aug 24, 2020

Easy Python Package Publishing with Poetry

I just published my first Python package on PyPI, called feature-grouper, a data science package for a simple form of dimensionality reduction.

The package itself is almost trivial, a couple of functions and a scikit-learn transformer class, but it’s something I anticipate reusing on future projects, so I wanted to be able to just import it rather than copy-pasting the code.

I thought I’d write a post to detail the end-to-end development process to serve as a reminder to myself the next time I want to write a package, and in case anyone else finds this useful. I used Poetry for package management, which made the experience nicer by unifying several different command line tools (pip, virtualenv, twine). While the overall process was pretty straightforward, there were a couple of parts that I felt were non-obvious and deserve explanation.

Installing the tools

Installing Python (on Ubuntu):

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.8 python3-pip python3.8-dev python3.8-venv make
python3.8 -m pip install --upgrade pip setuptools wheel

Installing Poetry:

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python3

Initializing the project

I used the poetry new command to create the project structure.

poetry new feature-grouper

Poetry includes pytest by default (a good choice) and gives you a nice, simple package project directory like this:

cd feature-grouper
tree

.
├── feature_grouper
│   ├── feature_grouper.py
│   ├── __init__.py
├── pyproject.toml
├── README.rst
└── tests
    ├── __init__.py
    └── test_feature_grouper.py

The pyproject.toml file replaces setup.py for Poetry projects, so I opened that file and updated the package description field.

poetry install auto-created a virtual environment the first time, and installed the base dependencies, generating a poetry.lock file that records the entire package dependency tree with the compatible version numbers.

poetry shell activates the virtual environment and exit deactivates it.

Installing dependencies

The next step was to add and install the dependencies I knew I would need to develop the package.

poetry add scipy numpy scikit-learn
poetry add --dev black pylint wrapt Sphinx sphinx-rtd-theme

The package itself depends on scipy, numpy and scikit-learn, meanwhile I use black and pylint for code formatting and linting, and Sphinx for documentation, so I install those as dev dependencies.

poetry add adds the dependencies to pyproject.toml, solves the dependency graph, updatess poetry.lock and installs the packages into the virtual environemnt, all in one step.

poetry export --without-hashes > requirements.txt generates a requirements.txt file, which can be used by other tools that need to install the dependencies.

Testing

Getting started with my test suite was a snap. I just had to open tests/test_feature_grouper.py and start writing functions with assert statements, and run them with pytest. That made it easy to get started with test-driven development.

Coding

The feature_grouper/feature_grouper.py file is only 131 lines including comments, so there isn’t a whole lot going on–the key point is that the class FeatureGrouper extends BaseEstimator and TransformerMixin from the sklearn.base module, and implements its own fit, transform, and inverse_transform methods, to make it basically a drop-in replacement for an existing scikit-learn transformer class like sklearn.decomposition.PCA, so that it can be used in a sklearn.pipeline.Pipeline.

I also want to mention that I find it better to write the class and function docstrings as I go, rather than writing all the code first and going back to document it later. The latter always feels like way more work. I went with Sphinx-style docstrings but for my next package I will probably try Numpy-style docstrings as they are a little more readable in plaintext and seem more popular in the Python community.

Documentation

Sphinx is the go-to for generating code documentation for Python packages. It includes a sphinx-quickstart CLI to help you get started.

mkdir docs
cd docs
sphinx-quickstart

Then, so that Sphinx could find my code to autogenerate documentation pages from the docstrings I had to add these lines to conf.py:

# -- Path setup --------------------------------------------------------------

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
import os
import sys

sys.path.insert(0, os.path.abspath(".."))

I like the Read The Docs theme for documentation so I wanted to include that theme as well as autodoc to generate the documentation from my class and function docstrings, so I added the sphinx.ext.autodoc and sphinx_rtd_theme extensions in conf.py:

# -- General configuration ---------------------------------------------------

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
    "sphinx.ext.todo",
    "sphinx.ext.viewcode",
    "sphinx.ext.autodoc",
    "sphinx_rtd_theme",
]

Then, to provide the content pages for the docs site, I added two files, docs/overview.rst and docs/reference.rst. In the overview file I put a description of the package and a code sample, and in the reference file I put the following:

``feature_grouper`` API reference
=================================

.. automodule:: feature_grouper.feature_grouper
   :members:

The automodule thing tells Sphinx to read all the class and function docstrings in your Python code file, and generate API documentation for them.

Sphinx includes a Makefile so you can build the docs site by just typing make html and you can preview the HTML output in your browser. Each time you change a docstring, just do make html again and it will rebuild the docs.

After pushing my package to GitHub, I published my documentation to Read The Docs because they have free hosting that is pretty easy to configure. RTD failed to build my docs the first time, because it was looking for a file called “content” by default, but I had named my file “index”–had to make a trip to StackOverflow for that one. It turned out I just needed to add another setting to docs/conf.py, master_doc = "index" in order to get it to find the right file. At that point, the documentation was done.

Publishing to PyPI

Creating an account on PyPI is pretty straightforward, you just sign up through their web interface. I turned Two-Factor Authentication on and added an API token in Account settings.

One configuration was needed to set up Poetry to publish my package, using the API token I obtained from the PyPI site:

poetry config pypi-token.pypi [api-token]

Once that was set, I just needed two more commands:

poetry build # to create the .tar.gz and .whl files in a dist/ folder
poetry publish # to upload the project to PyPI

and then my package was live on PyPI within seconds!