Tweaking plots

2.8. Tweaking plots#

In this section, we’ll look at how to adjust the appearance of plots to make them clearer, more informative, and easier to interpret.

We’ll cover how to:

set axis labels
set plot titles
create and edeit a legend
arrange suplots to make a multi-panel figure
adjust the size and shape of figures

While Seaborn usually makes good choices automatically, these kinds of plot tweaks are often essential for producing meaningful, reader-friendly figures.

2.8.1. `Matplotlib`#

Seaborn is designed to produce nice looking plots with minimal effort. However, it is built on top of an older, lower-level plotting library called Matplotlib, which provides the underlying functionality.Matplotlib (inspired by the plotting tools available in MATLAB - another scientific programming environment) contains functions for controlling almost every element of a plot — from axis limits and labels to subplot layouts and figure sizes. Usually if there is something you would like to edit in the plot, it is possible to do so using these functions.

If we want to manually set something like the axis labels, axis range, or figure layout, we often call Matplotlib functions directly. You may have noticed some of these calls which often start with plt. throughout this chapter:

Matplotlib functions are preceded by plt. for example plt.xlim() or plt.subplot()
In contrast, Seaborn functions aare preceded by sns. (Samuel Norman Seaborn!), e.g. sns.histplot().

Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

Import the data#

We’ll use the Titanic data (data on 890 people who were onbboard the Titanic when it sank). This includes some odd sounding variables such as SibSp but don’t worry about them for now.

titanic = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/titanic_2.csv')
display(titanic)

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 11 columns

As an initial exercise, let’s plot a histogram of the age of the passengers:

sns.histplot(data=titanic, x='Age')
plt.show()

../_images/bff18fbf6681c4a1d7f49c1061749dda05df43961c1c591af68293dbf79c9c2c.png

We can see that there were a lot of young adults on the Titanic - possibly emigrating to start a new life in America. Let’s try tweaking this plot to improve its interpretability

2.8.2. Axis Labels & Title#

plt.xlabel()
plt.ylabel()
plt.title()

Your axis labels should always convey what is being plotted. When using Seaborn with a Pandas DataFrame, the axis labels usually are taken automatically from the column names. These are often meaningful, but sometimes they can be cryptic codes that wouldn’t make sense to a general reader — for example, in the Titanic dataset we are using what do you think Pclass, SibSp, or Parch mean?

If the meaning isn’t immediately clear, you must edit the labels. Clear, descriptive labels are essential for an interpretable figure. You can also add a title to your plot to describe what the figure shows.

Just for fun, let’s label the \(x\)-axis “bananas”, the \(y\)-axis “fruitbats”, and give our plot the title “A load of nonsense”:

sns.histplot(data=titanic, x='Age')
plt.xlabel('bananas') #Change the x axis label from the default
plt.ylabel('fruitbats') #Change the y axis label from the default
plt.title('A load of nonsense', fontsize=18) # note I made the font size bigger!
plt.show()

../_images/317771cc770734303809bdeb3d2e152783a5ed49ee53bee0e7143c83b7ea8514.png

2.8.3. Legend#

plt.legend()

When your plot contains multiple groups or categories, it’s important to include a legend so that the reader can tell which colour or linetype corresponds to which group.

For example, let’s plot the age distributions of passengers in the different classes on the Titanic. Here we’ll use a KDE plot here rather than a histogram, as the histogram is just a bit too cluttered with the three distributions (you can try changing it to a histogram to see what I mean).

By default, Seaborn adds a legend automatically when a hue variable is specified, but you can adjust it using Matplotlib’s plt.legend(). For example, Here we edit the category labels, the title, and reposition the legend on the plot.

sns.kdeplot(data=titanic, x='Age', hue='Pclass', fill=True)
plt.legend(['third', 'second','first'], title='Class',loc = 'upper left')
plt.show()

../_images/0b6fab9000feda165baff0029ca2e4ca0fda144cad7cd9ed3631eb15aa0bf095.png

Note on the data: There were a lot of young adults in 3rd class, and almost no children in first class

2.8.4. Ordering#

When plotting categorical data where the categories are defined by strings, Seaborn will usually plot them in the order they appear in the DataFrame — that is, from top to bottom in the dataset.

However, this default order is not always ideal or meaningful. For example, consider the Oxford weather dataset: the months might appear in the order they happen to occur in the data (e.g. starting in September) rather than in calendar order.

This can make your plot confusing or misleading.

weather = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather2.csv')
sns.violinplot(data=weather, x='Month', y='Tmean')
plt.show()

../_images/3a23d07e2f24ab653a03f34501febb27bbce3b1a062a53400b9121b28c3fdf13.png

We can force the order in which the categories are presented useing the argument order

Note this would become hue_order if we were using the hue property to disaggregate categories of data in a KDE plot or histogram.

sns.violinplot(data=weather, x='Month', y='Tmean', order=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
plt.show()

../_images/da097177db371146d65498bfd3d60642d044db905e403b5d7b34719e6b86255d.png

2.8.5. Subplots#

Often, we want to present multiple plots side by side to compare them directly or to illustrate a broader point.

Returning to the Titanic dataset, suppose we want to plot histograms of passenger age, shown separately for men and women. We may wish to display these plots next to each other as panels within a single figure.

This can be done using the function plt.subplot(), which allows us to create a figure containing multiple panels (subplots). Each subplot is addressed by its position within a grid, defined by three numbers inside plt.subplot(rows, columns, index):

rows — the total number of rows of subplots
columns — the total number of columns of subplots
index — the position of the current subplot (counted left to right, top to bottom)

plt.subplot(1,2,1)
sns.histplot(data=titanic.query('Sex == "male"'), x='Age', color='b', bins=range(0,80,5))

plt.subplot(1,2,2)
sns.histplot(data=titanic.query('Sex == "female"'), x='Age', color='r', bins=range(0,80,5))

plt.tight_layout() # shift the plots sideways so they don't overlap
plt.show()

../_images/822d1e2dd94dfd1c1fd912a60afa8be33cd7676f4f5cf47bba065596886054c2.png

In the example above we have one row and two columns; if we wanted two rows and one column, we would do this:

plt.subplot(2,1,1)
sns.histplot(data=titanic.query('Sex == "male"'), x='Age', color='b', bins=range(0,80,5))

plt.subplot(2,1,2)
sns.histplot(data=titanic.query('Sex == "female"'), x='Age', color='r', bins=range(0,80,5))

plt.tight_layout() # shift the plots so they don't overlap
plt.show()

../_images/bb89ea17df95f336897345a896c538953b008e9bf6ca22280d4bfe107f0d59a2.png

You’ll notice that it is actually a bit easier to compare the distributions when they are arranged vertically. For example we can see that the maximum age for women was higher, and the peak age for women was (slightly) lower. Good choice!

plt.subplot() syntax#

The function plt.subplot() has three arguments:

plt.subplot(rows, columns, index)

For example, if we want to create a figure with one row and two columns (plots side-by-side), we use plt.subplot(1,2,i), where i is the location to put the next plot, reading from left to right and top to bottom. This is perhaps best explained by diagram.

Here is how I make a subplot with 3 rows and 2 columns, then there are overall 6 possible places to put any one plot. The red number on each plot refers to the specific index for that location.

Here is how I make a subplot with 2 rows and 4 columns. Again, the red number on each plot refers to the specific index for that location, now there are 8 total plots.

plt.tight_layout()#

Sometimes the axis labels or titles of one subplot can overlap with the adjacent subplot, especially when you have multiple panels in a single figure. The command plt.tight_layout() automatically adjusts the spacing between subplots so that all elements fit neatly within the figure.

2.8.6. Axis range#

plt.xlim()
plt.ylim()

It is often easier to compare across plots if the axis ranges are the same. Seaborn will automatically adjust the axes to fit the range of the data in each plot. This is convenient for individual plots, but it can make side-by-side comparisons misleading, e.g., it might not look like there is a difference until you notice one is scaled between 1 and 100 and the other from 1 to 1000!

We can set the axis range using the functions:

plt.ylim() (to set the limits in \(y\))
plt.xlim() (to set the limits in \(x\))

Let’s recreate our two side-by-side histograms of passenger age for men and women, but this time we’ll set the \(y\)-axis range to be the same in both plots:

plt.subplot(1,2,1)
sns.histplot(data=titanic.query('Sex == "male"'), x='Age', color='b', bins=range(0,80,5))
plt.ylim([0,80])

plt.subplot(1,2,2)
sns.histplot(data=titanic.query('Sex == "female"'), x='Age', color='r', bins=range(0,80,5))
plt.ylim([0,80])

plt.tight_layout() # shift the plots sideways so they don't overlap
plt.show()

../_images/1cc920cc1f18e84c608f6cd9a678d49543a55190c50636790ee14187b55f1090.png

Ooh, suddenly we can see that there were a lot more men than women on the Titanic!

Note It is generally most relevant to match the \(y\)-axes when two subplots are side-by-side, and the \(x\)-axes when the subplots are arranged one above the other. However, it is often (but not always) best to match both \(x\) and \(y\) axes!

2.8.7. Figure size#

plt.figure(figsize=(x,y))

Maybe the figures in your Jupyter notebook are too big or too small for your liking, or have the wrong aspect ratio.

You can change this by ‘pre-creating’ your figure at a certain size using plt.figure(figsize=(x,y)), before running the plotting command.

x and y are the desired size, nominally in inches, but this will depend on the size of your screen (!):

# create a low, wide figure
plt.figure(figsize=(8,2))
sns.kdeplot(data=titanic, x='Age', hue='Pclass', fill=True)
plt.show()

../_images/db681050b901bcbc7c2a1045b24999ec3469ffba6fe070f249d8de31a7f1f95f.png

# create a tall thin figure
plt.figure(figsize=(2,8))
sns.kdeplot(data=titanic, x='Age', hue='Pclass', fill=True)
plt.show()

../_images/c59d7ab8900ce7e25bb8ee1279108facaf02416b4b386839bdcba24aa79058b1.png

Note that this applies to the whole figure, which could be made up from several subplots - so this syntax can be useful to avoid your subplots getting very stretched - compare the default aspect ratio here:

plt.subplot(2,1,1)
sns.histplot(data=titanic.query('Sex == "male"'), x='Age', color='b', bins=range(0,80,5))

plt.subplot(2,1,2)
sns.histplot(data=titanic.query('Sex == "female"'), x='Age', color='r', bins=range(0,80,5))

plt.tight_layout() # shift the plots sideways so they don't overlap
plt.show()

… with a tweaked one here:

plt.figure(figsize=(6,8))

plt.subplot(2,1,1)
sns.histplot(data=titanic.query('Sex == "male"'), x='Age', color='b', bins=range(0,80,5))

plt.subplot(2,1,2)
sns.histplot(data=titanic.query('Sex == "female"'), x='Age', color='r', bins=range(0,80,5))

plt.tight_layout() # shift the plots sideways so they don't overlap
plt.show()

../_images/cafa34e8eddea113059216eb6c13f53ccbb0c4f4c2dc012d0935ca78a278b7e5.png