ggplot in Python

August 6, 2024·

· 12 minute read

Description of the image — the final product.

Introduction

I often find myself switching between R and Python for various stages of a project. ¹ I am particularly reliant on R for creating visual outputs with ggplot. So this week I will be testing out the plotnine library, which attempts to replicate ggplot’s grammar of graphics in Python.

This writeup is inspired by Nicola Rennie’s recent post on how to create annotated charts with plotnine. I would highly recommend reading her post first if you have not done so already. I was very impressed by the infographic-like chart she produces, and in this writeup I’m going to create a chart in her style/format. As a result the structure of this walkthrough is going to be quite similar to Nicola’s, however I will be discussing 2 additional points not covered in her tutorial:

How to use icons and Google fonts in plotnine plots.
How to work with matplotlib’s color palettes in plotnine

Data Cleaning

To create our plot we are going to return to the 2015 FinAccess geospatial mapping dataset, which provides the locations and opening dates of over 60,000 mobile money agents in Kenya. We’re going to create a column chart displaying the number of new mobile money agents in each month from 2007 until 2015, and to do this we will need to do a little data processing. We begin by loading in the necessary libraries:

import os
import numpy as np
import pandas as pd
import plotnine as p9
import matplotlib.pyplot as plt
import cmasher as cmr
import colorsys
import matplotlib as mpl
import textwrap
import mpl_fontkit as fk
import highlight_text as ht

Now we can import and clean the data:

agents = pd.read_csv("data/agents.csv")

agents["date"] = pd.to_datetime(
    agents["In What Year Did the Establishment Begin Operations"]
)

agents["year"] = agents["date"].dt.year
agents["month"] = agents["date"].dt.month

agents.dropna(subset="date", inplace=True)
agents = agents[(agents["date"] >= pd.to_datetime("2007-01-01")) & (agents["date"] <= pd.to_datetime("2015-08-31"))]
# data collection ended in August 2015 (source: https://www.fsdkenya.org/research-and-publications/datasets/2016-finaccess-geospatial-mapping/)

agents_count = agents["date"].groupby(agents.date.dt.to_period("M").rename("month")).agg("count").reset_index(name="count")
agents_count["date"] = agents_count["month"].dt.to_timestamp()
agents_count["year"] = agents_count["date"].dt.year

Here we are:

importing the data as a pandas dataframe,
converting the opening date column to datetime format and extracting the year and month,
filtering the data to only include agents with opening dates between 2007 and 2015,
counting the number of agents with opening dates in each month and saving the result in agents_count,
and creating datetime columns from the month column (i.e. each row/month in agents_count is assigned a date corresponding to the first of that month).

The final dataframe, agents_count, looks like this:

agents_count.head()

	month	count	date	year
0	2007-01	437	2007-01-01 00:00:00	2007
1	2007-02	55	2007-02-01 00:00:00	2007
2	2007-03	75	2007-03-01 00:00:00	2007
3	2007-04	35	2007-04-01 00:00:00	2007
4	2007-05	36	2007-05-01 00:00:00	2007

The Basics

The plotnine library works almost identically to R’s ggplot2 package. At the most basic level, a plotnine object is composed of one or more layers, which take:

data,
a mapping between variables in the data and certain aesthetics (the rules for how those variables should be displayed),
a stat which transforms the raw data to a plottable format,
and a geom which takes the supplied aesthetics and produces a visual output (your plot)

Let’s go through how to do this in practice by creating a very simple column chart using agents_count:

boring = (
    p9.ggplot()
    + p9.geom_col(
        data = agents_count,
        mapping = p9.aes(x="date", y="count"),
    )
    + p9.scale_x_datetime(date_breaks="1 years", date_labels="%Y")
    + p9.labs(
        title="Mobile Money Agents in Kenya (2007 - 2015)", 
        caption="a bit boring ..."
    )
)
boring.draw()

In our construction of the plotnine object we create an empty ggplot object, and then add a layer with the column chart geometry, agents_count data, and a mapping that transforms the date and count variables into x and y coordinates. (We also format the x axis labels a little, and add a title/caption). This works … but the output is a bit plain. Let’s try and modify this to get something closer to the style of Nicola’s infographic-esque chart.

Advanced

Fonts and Icons

I am a big fan of Google Fonts. Incorporating them into ggplot objects in R is very simple with the sysfonts package (just use font_add_google), but the standard solutions for doing so in Python are less straightforward. After a lot of searching I eventually found mpl-fontkit, a Python library that allows you to load Google Fonts for matplotlib plots. Since plotnine uses matplotlib for the plotting backend, we can use mpl-fontkit in our plotnine object too. We can load a font as follows:

import mpl_fontkit as fk
fk.install("Inter") # the font family used in this website!

Let’s install IBM Plex Sans and Merriweather to use as our body and title font families:

fk.install("IBM Plex Sans")
fk.install("Merriweather")

bodyfont = "IBM Plex Sans"
titlefont = "Merriweather"

mpl-fontkit also allows the use of the free Font Awesome icons in matplotlib annotations. To use them we just need to install the Font Awesome fonts using

    fk.install_fontawesome()

Icons can now be displayed by referencing their Unicode in a string (e.g. "\uf0ac" for globe), and setting the fontfamily parameter to be “Font Awesome 6 Brands” or “Font Awesome 6 Free” (depending on whether the icon is a brand icon or not). This will be explained in more detail below.

Colors

Nicola defines her color palette using a list of pre-specified hex codes. If you don’t have a specific color palette in mind, it can be helpful to explore and generate palettes using colormaps in the CMasher library. All of the CMasher colormaps can be displayed using:

import cmasher as cmr
cmr.create_cmap_overview()

We can also access the matplotlib colormaps within CMasher. These colormaps can be displayed as follows:

# source: https://stackoverflow.com/questions/34314356/how-to-view-all-colormaps-available-in-matplotlib

import matplotlib as mpl
import matplotlib.pyplot as plt

def plot_colorMaps(cmap):

    fig, ax = plt.subplots(figsize=(4,0.4))
    col_map = plt.get_cmap(cmap)
    mpl.colorbar.ColorbarBase(ax, cmap=col_map, orientation = 'horizontal')
    plt.show()

for cmap_id in plt.colormaps():
    print(cmap_id)
    plot_colorMaps(cmap_id)

We’ll create a palette from the magma colormap with a user-defined function that extracts N hex codes from a specified range of a given colormap:

def get_col(cmap: str, N: int, start: int, end: int, rev: bool):
    if rev:
        return list(reversed(cmr.take_cmap_colors(
            cmap, N = N, cmap_range=(start, end), return_fmt="hex"
        )))
    return cmr.take_cmap_colors(
        cmap, N = N, cmap_range=(start, end), return_fmt="hex"
    )

palette = get_col(
    "magma",
    N = 9,
    start = 0.4,
    end = 0.93,
    rev = True
)

(We select N=9 since there are 9 years in our data, and we can apply a different color to each year.)

We’ll choose a background and text color based on our palette, by increasing and decreasing the brightness of the first and last palette colors:

# adapted from https://stackoverflow.com/questions/37765197/darken-or-lighten-a-color-in-matplotlib
def adj_light(color, scale_l):
    rgb = mpl.colors.ColorConverter.to_rgb(color)
    # convert rgb to hls
    h, l, s = colorsys.rgb_to_hls(*rgb)
    # manipulate h, l, s values and return as rgb
    return mpl.colors.to_hex(colorsys.hls_to_rgb(h, min(1, l * scale_l), s = s))

bgcol = adj_light(palette[0], 1.19)
fontcol = adj_light(palette[-1], 0.15)

Annotations

We’ll start by defining the annotations we’re going to use. By “annotations” I mean every bit of text that’s going to appear on our plot. Let’s begin with the axes labels:

X = pd.Series(np.arange(2007, 2016, 2)).to_frame(name="year")
X["date"] = pd.to_datetime(X["year"], format = "%Y")

min_count = 50
agents_count["count_adj"] = agents_count["count"] + min_count
Y = pd.DataFrame({
    "value": np.arange(0 + min_count, 5000 + min_count, 1000),
    "label": ["0", "1,000", "2,000", "3,000", "4,000\nAgents"]
})

Note that we shift all of the count values up by min_count and adjust the Y axis labels accordingly (this will help to space out the plot a little).

Here are the titles:

title = "Mobile Money Agents in Kenya"
subtitle = "In 2015 The Bill and Melinda Gates Foundation (along with several other organizations) collected data on the locations and opening dates of almost 100,000 financial touch-points across Kenya, including 68,000 active mobile money agents. Here we track the number of new agents each month until August 2015."
wrapped_st = '\n'.join(textwrap.wrap(subtitle, width=66))

We use wrap to break our subtitle text into smaller chunks with a specified max width, and then join them with the \n delimiter to get a wrapped paragraph.

Now we can move onto the slightly more complicated annotations.

Both plotnine and ggplot2 do not have native support for markdown/html-like rendering for text (e.g. if you want only some parts of a string to be displayed in bold, or in a different font). So to achieve this in Python we’re going to rely on the highlight-text library, which allows you to enter inline formatting dictionaries in your strings using matplotlib syntax (and later render these strings accordingly using the ax_text function):

# source: https://www.findevgateway.org/case-study/2010/06/community-level-economic-effects-m-pesa-kenya-initial-findings
text2010 = '<Jan 2010::{"fontweight": "600", "color": "#FA7F5E"}>: M-PESA, Kenya\'s first mobile\nmoney service, reaches 9 million users.'
where2010 = pd.to_datetime("2010-01-01")
where2010off = where2010 - pd.DateOffset(months=1)
where2010start = pd.to_datetime("2007-04-01")
val2010 = agents_count.loc[agents_count["date"] == where2010, "count_adj"].values[0]

text2013 = '<Jan 2013::{"fontweight": "600", "color": "#C43C75"}>: 4542 new agents\nbegin operations in Kenya.'
where2013 = pd.to_datetime("2013-01-01")
where2013off = where2013 + pd.DateOffset(months=1)
where2013start = pd.to_datetime("2014-11-15")
val2013 = agents_count.loc[agents_count["date"] == where2013, "count_adj"].values[0]

source_data = '<Data::{"fontweight": "600"}>: FinAccess geospatial mapping 2015'

github = '<\uf09b::{"fontfamily": "Font Awesome 6 Brands", "fontsize": "7.5"}> u-tkarshd'
website = '<\uf0ac::{"fontfamily": "Font Awesome 6 Free", "fontsize": "7", "fontweight": "900"}> u-tkarshd.com'

As explained earlier, to include icons we use their Unicode and set the fontfamily parameter to equal one of the Font Awesome fonts we installed. Here we are using icons to add attribution annotations to our plot. Unfortunately I wasn’t able to find a way to reference our palette in the formatting dictionaries for text2010 and text2013, so I had to manually extract the hex codes for those years from palette and then reference them directly:

print(palette[2010-2007])
print(palette[2013-2007])

Finally let’s define our axes bounds:

xstart = pd.to_datetime("2006-10-01")
xend = pd.to_datetime("2016-09-01")
ystart = -1500
yend = 5800

Plotting

Now we can construct our plotnine object! We begin with nothing:

p = (
    p9.ggplot()
)
p.draw()

Let’s first add our pre-defined x and y axis annotations:

p = (
    p
    # X AXIS TICKS
    + p9.geom_segment(
        data = X,
        mapping = p9.aes(x="date", xend="date", y=0, yend=-430),
        linetype = "dashed",
        color = fontcol,
        size = 0.3
    )
    # X AXIS LABELS
    + p9.geom_text(
        data = X,
        mapping = p9.aes(x="date", y = -570, label="year"),
        color = fontcol,
        family = bodyfont,
        fontweight = 300,
        size = 6,
        ha = "left"
    )
    # Y AXIS LABELS
    + p9.geom_text(
        data = Y,
        mapping = p9.aes(x = pd.to_datetime("2015-09-30"), y="value", label="label"),
        color = fontcol,
        family = bodyfont,
        fontweight = 300,
        size = 6,
        ha = "left",
        va = "top",
    )
)
p.draw()

⚠️

Notice that plotnine attempts to generate scales automatically using the mapping in our first geom_segment layer (i.e. x = "date", y = 0). We can ignore these for now and remove them later.

ℹ️

If you’re familiar with matplotlib you will notice that some of the static aesthetics applied to geom_text are actually matplotlib parameters (e.g. ha and va for horizontal/vertical alignment as opposed to hjust and vjust in ggplot2). This is where we begin to see some divergence between the ggplot2 and plotnine syntax.

Let’s add in our title, subtitle and in-plot line segments using the annotate geometry. Note annotate is great for adding discrete, one-off annotations of various types (including "text", "rect" for rectangles, "segment" for line segments), while the geom_* equivalents (used for the axes annotations above) are good for translating variables in a dataframe into visual output using a specified mapping.

p = (
    p
    # TITLE
    + p9.annotate(
        "text",
        label = title,
        x = pd.to_datetime("2007-01-01"),
        y = 5650,
        color = fontcol,
        family = titlefont,
        fontweight = "bold",
        fontstyle = "italic",
        ha = "left",
        va = "top",
        size = 12
    )
    # SUBTITLE
    + p9.annotate(
        "text",
        label = wrapped_st,
        x = pd.to_datetime("2007-01-01"),
        y = 5080,
        color = fontcol,
        family = bodyfont,
        fontweight = 300,
        fontstyle = "normal",
        ha = "left",
        va = "top",
        lineheight = 1.4,
        size = 6
    )
    # 2010 ANNOTATION LINE 1
    + p9.annotate(
        "segment",
        y = val2010 - 100, yend = val2010 - 100,
        x = where2010, xend = where2010 - pd.DateOffset(months = 3),
        color = fontcol,
        size = 0.1
    )
    # 2010 ANNOTATION LINE 2
    + p9.annotate(
        "segment",
        y = val2010 - 100, yend = val2010 + 100,
        x = where2010 - pd.DateOffset(months = 3), xend = where2010 - pd.DateOffset(months = 3),
        color = fontcol,
        size = 0.1
    )
    # 2010 ANNOTATION LINE 3
    + p9.annotate(
        "segment",
        y = val2010 + 100, yend = val2010 + 100,
        x = where2010start, xend = where2010off,
        color = fontcol,
        size = 0.1
    )
    # 2013 ANNOTATION LINE 1
    + p9.annotate(
        "segment",
        y = val2013 - 100, yend = val2013 - 100,
        x = where2013, xend = where2013 + pd.DateOffset(months = 3),
        color = fontcol,
        size = 0.1
    )
    # 2013 ANNOTATION LINE 2
    + p9.annotate(
        "segment",
        y = val2013 - 100, yend = val2013 + 100,
        x = where2013 + pd.DateOffset(months = 3), xend = where2013 + pd.DateOffset(months = 3),
        color = fontcol,
        size = 0.1
    )
    # 2013 ANNOTATION LINE 3
    + p9.annotate(
        "segment",
        y = val2013 + 100, yend = val2013 + 100,
        x = where2013start, xend = where2013off,
        color = fontcol,
        size = 0.1
    )
)
p.draw()

Now we can (finally) plot our column chart.

p = (
    p
    # COLUMN PLOT
    + p9.geom_col(
        data = agents_count,
        mapping = p9.aes(x="date", y="count_adj", fill = "factor(year)"),
        width = 35
    )
    # SET COLUMN COLORS
    + p9.scale_fill_manual(
        values = palette
    )
)
p.draw()
p.save("output/column.svg")

ℹ️

One feature of plotnine which is particularly useful is that it lets you work with R’s factor class in Python. This makes it very easy to convert variables to categorical variables that we can apply color or fill scales to.

Let’s tidy up our plot by removing all the unneccessary elements (extra axes, grid etc). We can do this by applying a theme that automatically gets rid of these elements to make a “barebones” plot (theme_void). We can then add our own theme specifications (e.g. removing the legend, setting the background color) and set our axes bounds:

p = (
    p
    + p9.scale_x_datetime(limits = [xstart, xend])
    + p9.scale_y_continuous(limits=[ystart, yend])
    + p9.theme_void()
    + p9.theme(
        panel_background=p9.element_rect(fill = bgcol),
        legend_position="none"
    )
)
p.draw()
p.save("output/format.svg")

Nearly there! Now it’s time to add the more complex annotations using highlight-text. The highlight-text functions work with matplotlib plots, so we first need to extract and store the underlying matplotlib figure and axes from our plotnine object.

fig = p.draw()
ax = plt.gca()
fig.set_dpi(300)

To apply our annotations using the axes as our coordinate system, we can use ax_text from highlight-text (to use the figure as your coordinate system you can use fig_text):

ht.ax_text(
    x = where2010off,
    y = val2010 + 150,
    s = text2010,
    color = fontcol,
    fontfamily = bodyfont,
    fontsize = 5,
    fontweight = 300,
    va = 'bottom',
    ha = 'right',
    textalign = "right",
    vsep = 1.8
)

ht.ax_text(
    x = where2013off,
    y = val2013 + 150,
    s = text2013,
    color = fontcol,
    fontfamily = bodyfont,
    fontsize = 5,
    fontweight = 300,
    va = 'bottom',
    ha = 'left',
    textalign = "left",
    vsep = 1.8
)

ht.ax_text(
    x = pd.to_datetime("2015-08-16"),
    y = -1000,
    s = github,
    color = fontcol,
    fontfamily = bodyfont,
    fontsize = 6,
    fontweight = 300,
    va = 'top',
    ha = 'right'
)

ht.ax_text(
    x = pd.to_datetime("2015-08-16"),
    y = -1300,
    s = website,
    color = fontcol,
    fontfamily = bodyfont,
    fontsize = 6,
    fontweight = 300,
    va = 'top',
    ha = 'right'
)

ht.ax_text(
    x = pd.to_datetime("2006-12-16"),
    y = -1100,
    s = source_data,
    color = fontcol,
    fontfamily = bodyfont,
    fontsize = 6,
    fontweight = 300,
    va = "top",
    ha = 'left'
)

plt.show()

ℹ️

ax_text will apply the “global” text formatting parameters passed in the ax_text function call to the entire string unless a given parameter is explicitly specified for some substring in an inline formatting dictionary. That is why, for example, in text2010 the “Jan 2010” substring is rendered with fontweight=600 and color="#FA7F5E" while the rest of string is rendered with fontweight=300 and color=fontcol

The Final Product

Take the previous blog post as an example. It is much more straightforward to import and clean the nightlights data in Python using the Google Earth Engine API, rather than manually downloading the shapefiles and processing them in R with sf. ↩︎

Mobile Money in Kenya: A Replication Exercise