ggplot in Python
the final product.
Introduction
I often find myself switching between R and Python for various stages of a project. 1 I am particularly reliant on R for creating visual outputs with ggplot
. So this week I will be testing out the plotnine
library, which attempts to replicate ggplot
’s grammar of graphics in Python.
This writeup is inspired by Nicola Rennie’s recent post on how to create annotated charts with plotnine
. I would highly recommend reading her post first if you have not done so already. I was very impressed by the infographic-like chart she produces, and in this writeup I’m going to create a chart in her style/format. As a result the structure of this walkthrough is going to be quite similar to Nicola’s, however I will be discussing 2 additional points not covered in her tutorial:
- How to use icons and Google fonts in
plotnine
plots. - How to work with
matplotlib
’s color palettes inplotnine
Data Cleaning
To create our plot we are going to return to the 2015 FinAccess geospatial mapping dataset, which provides the locations and opening dates of over 60,000 mobile money agents in Kenya. We’re going to create a column chart displaying the number of new mobile money agents in each month from 2007 until 2015, and to do this we will need to do a little data processing. We begin by loading in the necessary libraries:
import os
import numpy as np
import pandas as pd
import plotnine as p9
import matplotlib.pyplot as plt
import cmasher as cmr
import colorsys
import matplotlib as mpl
import textwrap
import mpl_fontkit as fk
import highlight_text as ht
Now we can import and clean the data:
agents = pd.read_csv("data/agents.csv")
agents["date"] = pd.to_datetime(
agents["In What Year Did the Establishment Begin Operations"]
)
agents["year"] = agents["date"].dt.year
agents["month"] = agents["date"].dt.month
agents.dropna(subset="date", inplace=True)
agents = agents[(agents["date"] >= pd.to_datetime("2007-01-01")) & (agents["date"] <= pd.to_datetime("2015-08-31"))]
# data collection ended in August 2015 (source: https://www.fsdkenya.org/research-and-publications/datasets/2016-finaccess-geospatial-mapping/)
agents_count = agents["date"].groupby(agents.date.dt.to_period("M").rename("month")).agg("count").reset_index(name="count")
agents_count["date"] = agents_count["month"].dt.to_timestamp()
agents_count["year"] = agents_count["date"].dt.year
Here we are:
- importing the data as a
pandas
dataframe, - converting the opening date column to
datetime
format and extracting the year and month, - filtering the data to only include agents with opening dates between 2007 and 2015,
- counting the number of agents with opening dates in each month and saving the result in
agents_count
, - and creating
datetime
columns from the month column (i.e. each row/month inagents_count
is assigned a date corresponding to the first of that month).
The final dataframe, agents_count
, looks like this:
agents_count.head()
month | count | date | year | |
---|---|---|---|---|
0 | 2007-01 | 437 | 2007-01-01 00:00:00 | 2007 |
1 | 2007-02 | 55 | 2007-02-01 00:00:00 | 2007 |
2 | 2007-03 | 75 | 2007-03-01 00:00:00 | 2007 |
3 | 2007-04 | 35 | 2007-04-01 00:00:00 | 2007 |
4 | 2007-05 | 36 | 2007-05-01 00:00:00 | 2007 |
The Basics
The plotnine
library works almost identically to R’s ggplot2
package. At the most basic level, a plotnine
object is composed of one or more layers, which take:
- data,
- a mapping between variables in the data and certain aesthetics (the rules for how those variables should be displayed),
- a stat which transforms the raw data to a plottable format,
- and a geom which takes the supplied aesthetics and produces a visual output (your plot)
Let’s go through how to do this in practice by creating a very simple column chart using agents_count
:
boring = (
p9.ggplot()
+ p9.geom_col(
data = agents_count,
mapping = p9.aes(x="date", y="count"),
)
+ p9.scale_x_datetime(date_breaks="1 years", date_labels="%Y")
+ p9.labs(
title="Mobile Money Agents in Kenya (2007 - 2015)",
caption="a bit boring ..."
)
)
boring.draw()
Boring version
In our construction of the plotnine
object we create an empty ggplot object, and then add a layer with the column chart geometry, agents_count
data, and a mapping that transforms the date
and count
variables into x and y coordinates. (We also format the x axis labels a little, and add a title/caption). This works … but the output is a bit plain. Let’s try and modify this to get something closer to the style of Nicola’s infographic-esque chart.
Advanced
Fonts and Icons
I am a big fan of Google Fonts. Incorporating them into ggplot
objects in R is very simple with the sysfonts
package (just use font_add_google
), but the standard solutions for doing so in Python are less straightforward. After a lot of searching I eventually found mpl-fontkit
, a Python library that allows you to load Google Fonts for matplotlib
plots. Since plotnine
uses matplotlib
for the plotting backend, we can use mpl-fontkit
in our plotnine
object too. We can load a font as follows:
import mpl_fontkit as fk
fk.install("Inter") # the font family used in this website!
Let’s install IBM Plex Sans and Merriweather to use as our body and title font families:
fk.install("IBM Plex Sans")
fk.install("Merriweather")
bodyfont = "IBM Plex Sans"
titlefont = "Merriweather"
mpl-fontkit
also allows the use of the free Font Awesome icons in matplotlib
annotations. To use them we just need to install the Font Awesome fonts using
fk.install_fontawesome()
Icons can now be displayed by referencing their Unicode in a string (e.g. "\uf0ac"
for globe), and setting the fontfamily
parameter to be “Font Awesome 6 Brands” or “Font Awesome 6 Free” (depending on whether the icon is a brand icon or not). This will be explained in more detail below.
Colors
Nicola defines her color palette using a list of pre-specified hex codes. If you don’t have a specific color palette in mind, it can be helpful to explore and generate palettes using colormaps in the CMasher
library. All of the CMasher colormaps can be displayed using:
import cmasher as cmr
cmr.create_cmap_overview()
We can also access the matplotlib
colormaps within CMasher
. These colormaps can be displayed as follows:
# source: https://stackoverflow.com/questions/34314356/how-to-view-all-colormaps-available-in-matplotlib
import matplotlib as mpl
import matplotlib.pyplot as plt
def plot_colorMaps(cmap):
fig, ax = plt.subplots(figsize=(4,0.4))
col_map = plt.get_cmap(cmap)
mpl.colorbar.ColorbarBase(ax, cmap=col_map, orientation = 'horizontal')
plt.show()
for cmap_id in plt.colormaps():
print(cmap_id)
plot_colorMaps(cmap_id)
We’ll create a palette from the magma
colormap with a user-defined function that extracts N
hex codes from a specified range of a given colormap:
def get_col(cmap: str, N: int, start: int, end: int, rev: bool):
if rev:
return list(reversed(cmr.take_cmap_colors(
cmap, N = N, cmap_range=(start, end), return_fmt="hex"
)))
return cmr.take_cmap_colors(
cmap, N = N, cmap_range=(start, end), return_fmt="hex"
)
palette = get_col(
"magma",
N = 9,
start = 0.4,
end = 0.93,
rev = True
)
(We select N=9
since there are 9 years in our data, and we can apply a different color to each year.)
We’ll choose a background and text color based on our palette
, by increasing and decreasing the brightness of the first and last palette colors:
# adapted from https://stackoverflow.com/questions/37765197/darken-or-lighten-a-color-in-matplotlib
def adj_light(color, scale_l):
rgb = mpl.colors.ColorConverter.to_rgb(color)
# convert rgb to hls
h, l, s = colorsys.rgb_to_hls(*rgb)
# manipulate h, l, s values and return as rgb
return mpl.colors.to_hex(colorsys.hls_to_rgb(h, min(1, l * scale_l), s = s))
bgcol = adj_light(palette[0], 1.19)
fontcol = adj_light(palette[-1], 0.15)
Annotations
We’ll start by defining the annotations we’re going to use. By “annotations” I mean every bit of text that’s going to appear on our plot. Let’s begin with the axes labels:
X = pd.Series(np.arange(2007, 2016, 2)).to_frame(name="year")
X["date"] = pd.to_datetime(X["year"], format = "%Y")
min_count = 50
agents_count["count_adj"] = agents_count["count"] + min_count
Y = pd.DataFrame({
"value": np.arange(0 + min_count, 5000 + min_count, 1000),
"label": ["0", "1,000", "2,000", "3,000", "4,000\nAgents"]
})
Note that we shift all of the count
values up by min_count
and adjust the Y axis labels accordingly (this will help to space out the plot a little).
Here are the titles:
title = "Mobile Money Agents in Kenya"
subtitle = "In 2015 The Bill and Melinda Gates Foundation (along with several other organizations) collected data on the locations and opening dates of almost 100,000 financial touch-points across Kenya, including 68,000 active mobile money agents. Here we track the number of new agents each month until August 2015."
wrapped_st = '\n'.join(textwrap.wrap(subtitle, width=66))
We use wrap
to break our subtitle text into smaller chunks with a specified max width
, and then join
them with the \n
delimiter to get a wrapped paragraph.
Now we can move onto the slightly more complicated annotations.
Both plotnine
and ggplot2
do not have native support for markdown/html-like rendering for text (e.g. if you want only some parts of a string to be displayed in bold, or in a different font). So to achieve this in Python we’re going to rely on the highlight-text
library, which allows you to enter inline formatting dictionaries in your strings using matplotlib
syntax (and later render these strings accordingly using the ax_text
function):
# source: https://www.findevgateway.org/case-study/2010/06/community-level-economic-effects-m-pesa-kenya-initial-findings
text2010 = '<Jan 2010::{"fontweight": "600", "color": "#FA7F5E"}>: M-PESA, Kenya\'s first mobile\nmoney service, reaches 9 million users.'
where2010 = pd.to_datetime("2010-01-01")
where2010off = where2010 - pd.DateOffset(months=1)
where2010start = pd.to_datetime("2007-04-01")
val2010 = agents_count.loc[agents_count["date"] == where2010, "count_adj"].values[0]
text2013 = '<Jan 2013::{"fontweight": "600", "color": "#C43C75"}>: 4542 new agents\nbegin operations in Kenya.'
where2013 = pd.to_datetime("2013-01-01")
where2013off = where2013 + pd.DateOffset(months=1)
where2013start = pd.to_datetime("2014-11-15")
val2013 = agents_count.loc[agents_count["date"] == where2013, "count_adj"].values[0]
source_data = '<Data::{"fontweight": "600"}>: FinAccess geospatial mapping 2015'
github = '<\uf09b::{"fontfamily": "Font Awesome 6 Brands", "fontsize": "7.5"}> u-tkarshd'
website = '<\uf0ac::{"fontfamily": "Font Awesome 6 Free", "fontsize": "7", "fontweight": "900"}> u-tkarshd.com'
As explained earlier, to include icons we use their Unicode and set the fontfamily
parameter to equal one of the Font Awesome fonts we installed. Here we are using icons to add attribution annotations to our plot. Unfortunately I wasn’t able to find a way to reference our palette
in the formatting dictionaries for text2010
and text2013
, so I had to manually extract the hex codes for those years from palette
and then reference them directly:
print(palette[2010-2007])
print(palette[2013-2007])
Finally let’s define our axes bounds:
xstart = pd.to_datetime("2006-10-01")
xend = pd.to_datetime("2016-09-01")
ystart = -1500
yend = 5800
Plotting
Now we can construct our plotnine
object! We begin with nothing:
p = (
p9.ggplot()
)
p.draw()
nothing.
Let’s first add our pre-defined x and y axis annotations:
p = (
p
# X AXIS TICKS
+ p9.geom_segment(
data = X,
mapping = p9.aes(x="date", xend="date", y=0, yend=-430),
linetype = "dashed",
color = fontcol,
size = 0.3
)
# X AXIS LABELS
+ p9.geom_text(
data = X,
mapping = p9.aes(x="date", y = -570, label="year"),
color = fontcol,
family = bodyfont,
fontweight = 300,
size = 6,
ha = "left"
)
# Y AXIS LABELS
+ p9.geom_text(
data = Y,
mapping = p9.aes(x = pd.to_datetime("2015-09-30"), y="value", label="label"),
color = fontcol,
family = bodyfont,
fontweight = 300,
size = 6,
ha = "left",
va = "top",
)
)
p.draw()
with axes.
plotnine
attempts to generate scales automatically using the mapping in our first geom_segment
layer (i.e. x = "date", y = 0
). We can ignore these for now and remove them later.matplotlib
you will notice that some of the static aesthetics applied to geom_text
are actually matplotlib
parameters (e.g. ha
and va
for horizontal/vertical alignment as opposed to hjust
and vjust
in ggplot2
). This is where we begin to see some divergence between the ggplot2
and plotnine
syntax.Let’s add in our title, subtitle and in-plot line segments using the annotate
geometry. Note annotate
is great for adding discrete, one-off annotations of various types (including "text"
, "rect"
for rectangles, "segment"
for line segments), while the geom_*
equivalents (used for the axes annotations above) are good for translating variables in a dataframe into visual output using a specified mapping.
p = (
p
# TITLE
+ p9.annotate(
"text",
label = title,
x = pd.to_datetime("2007-01-01"),
y = 5650,
color = fontcol,
family = titlefont,
fontweight = "bold",
fontstyle = "italic",
ha = "left",
va = "top",
size = 12
)
# SUBTITLE
+ p9.annotate(
"text",
label = wrapped_st,
x = pd.to_datetime("2007-01-01"),
y = 5080,
color = fontcol,
family = bodyfont,
fontweight = 300,
fontstyle = "normal",
ha = "left",
va = "top",
lineheight = 1.4,
size = 6
)
# 2010 ANNOTATION LINE 1
+ p9.annotate(
"segment",
y = val2010 - 100, yend = val2010 - 100,
x = where2010, xend = where2010 - pd.DateOffset(months = 3),
color = fontcol,
size = 0.1
)
# 2010 ANNOTATION LINE 2
+ p9.annotate(
"segment",
y = val2010 - 100, yend = val2010 + 100,
x = where2010 - pd.DateOffset(months = 3), xend = where2010 - pd.DateOffset(months = 3),
color = fontcol,
size = 0.1
)
# 2010 ANNOTATION LINE 3
+ p9.annotate(
"segment",
y = val2010 + 100, yend = val2010 + 100,
x = where2010start, xend = where2010off,
color = fontcol,
size = 0.1
)
# 2013 ANNOTATION LINE 1
+ p9.annotate(
"segment",
y = val2013 - 100, yend = val2013 - 100,
x = where2013, xend = where2013 + pd.DateOffset(months = 3),
color = fontcol,
size = 0.1
)
# 2013 ANNOTATION LINE 2
+ p9.annotate(
"segment",
y = val2013 - 100, yend = val2013 + 100,
x = where2013 + pd.DateOffset(months = 3), xend = where2013 + pd.DateOffset(months = 3),
color = fontcol,
size = 0.1
)
# 2013 ANNOTATION LINE 3
+ p9.annotate(
"segment",
y = val2013 + 100, yend = val2013 + 100,
x = where2013start, xend = where2013off,
color = fontcol,
size = 0.1
)
)
p.draw()
with titles.
Now we can (finally) plot our column chart.
p = (
p
# COLUMN PLOT
+ p9.geom_col(
data = agents_count,
mapping = p9.aes(x="date", y="count_adj", fill = "factor(year)"),
width = 35
)
# SET COLUMN COLORS
+ p9.scale_fill_manual(
values = palette
)
)
p.draw()
p.save("output/column.svg")
with the column chart.
plotnine
which is particularly useful is that it lets you work with R’s factor
class in Python. This makes it very easy to convert variables to categorical variables that we can apply color
or fill
scales to.Let’s tidy up our plot by removing all the unneccessary elements (extra axes, grid etc). We can do this by applying a theme
that automatically gets rid of these elements to make a “barebones” plot (theme_void
). We can then add our own theme
specifications (e.g. removing the legend, setting the background color) and set our axes bounds:
p = (
p
+ p9.scale_x_datetime(limits = [xstart, xend])
+ p9.scale_y_continuous(limits=[ystart, yend])
+ p9.theme_void()
+ p9.theme(
panel_background=p9.element_rect(fill = bgcol),
legend_position="none"
)
)
p.draw()
p.save("output/format.svg")
after some formatting.
Nearly there! Now it’s time to add the more complex annotations using highlight-text
. The highlight-text
functions work with matplotlib
plots, so we first need to extract and store the underlying matplotlib
figure and axes from our plotnine
object.
fig = p.draw()
ax = plt.gca()
fig.set_dpi(300)
To apply our annotations using the axes as our coordinate system, we can use ax_text
from highlight-text
(to use the figure as your coordinate system you can use fig_text
):
ht.ax_text(
x = where2010off,
y = val2010 + 150,
s = text2010,
color = fontcol,
fontfamily = bodyfont,
fontsize = 5,
fontweight = 300,
va = 'bottom',
ha = 'right',
textalign = "right",
vsep = 1.8
)
ht.ax_text(
x = where2013off,
y = val2013 + 150,
s = text2013,
color = fontcol,
fontfamily = bodyfont,
fontsize = 5,
fontweight = 300,
va = 'bottom',
ha = 'left',
textalign = "left",
vsep = 1.8
)
ht.ax_text(
x = pd.to_datetime("2015-08-16"),
y = -1000,
s = github,
color = fontcol,
fontfamily = bodyfont,
fontsize = 6,
fontweight = 300,
va = 'top',
ha = 'right'
)
ht.ax_text(
x = pd.to_datetime("2015-08-16"),
y = -1300,
s = website,
color = fontcol,
fontfamily = bodyfont,
fontsize = 6,
fontweight = 300,
va = 'top',
ha = 'right'
)
ht.ax_text(
x = pd.to_datetime("2006-12-16"),
y = -1100,
s = source_data,
color = fontcol,
fontfamily = bodyfont,
fontsize = 6,
fontweight = 300,
va = "top",
ha = 'left'
)
plt.show()
ax_text
will apply the “global” text formatting parameters passed in the ax_text
function call to the entire string unless a given parameter is explicitly specified for some substring in an inline formatting dictionary. That is why, for example, in text2010
the “Jan 2010” substring is rendered with fontweight=600
and color="#FA7F5E"
while the rest of string is rendered with fontweight=300
and color=fontcol
The Final Product
the final product.
Take the previous blog post as an example. It is much more straightforward to import and clean the nightlights data in Python using the Google Earth Engine API, rather than manually downloading the shapefiles and processing them in R with
sf
. ↩︎