Tables of descriptive statistics

1.9. Tables of descriptive statistics#

When writing a report or analysing a dataset, you’ll often want to create a table of descriptive statistics. This can provide a clear summary that gathers all the key measures you need in one place. Additionally, you can Pandas makes this easy with several built-in functions that can automatically calculate and neatly display these statistics for your dataset.

Below we review a couple of useful functions for making tables:

df.describe(): quickly generates a summary table (count, mean, standard deviation, min, quartiles, max) for all numeric columns.
df.agg(): lets you choose exactly which statistics to compute (for example, just the mean and standard deviation).
df.groupby(): groups your data by a categorical variable (for example, month or year) and calculates summary statistics for each group.

1.9.1. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings 
warnings.simplefilter('ignore', category=FutureWarning)

1.9.2. Import a dataset to work with#

We will work with weather data from the Oxford weather station. This code block will read it automatically from the internet.

weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)

	YYYY	Month	MM	DD	DD365	Tmax	Tmin	Tmean	Trange	Rainfall_mm
0	1827	Jan	1	1	1	8.3	5.6	7.0	2.7	0.0
1	1827	Jan	1	2	2	2.2	0.0	1.1	2.2	0.0
2	1827	Jan	1	3	3	-2.2	-8.3	-5.3	6.1	9.7
3	1827	Jan	1	4	4	-1.7	-7.8	-4.8	6.1	0.0
4	1827	Jan	1	5	5	0.0	-10.6	-5.3	10.6	0.0
...	...	...	...	...	...	...	...	...	...	...
71338	2022	Apr	4	26	116	15.2	4.1	9.7	11.1	0.0
71339	2022	Apr	4	27	117	10.7	2.6	6.7	8.1	0.0
71340	2022	Apr	4	28	118	12.7	3.9	8.3	8.8	0.0
71341	2022	Apr	4	29	119	11.7	6.7	9.2	5.0	0.0
71342	2022	Apr	4	30	120	17.6	1.0	9.3	16.6	0.0

71343 rows × 10 columns

1.9.3. Quick look: `df.describe()`#

df.describe()

We can output a standard set of descriptives for all the numerical columns in the dataframe by just using df.describe()

This command outputs a standard set of descriptive statistics for all the numerical columns in your DataFrame. By default, it shows:

count – how many non-missing values there are,
mean – the average value,
std – the standard deviation,
min and max – the smallest and largest values,
25%, 50%, 75% – the quartiles (useful for understanding data spread).

This gives you a quick way to check the range, central tendency, and variability of each variable all at once — a great first step whenever you start exploring a dataset.

Tip: If your DataFrame also contains text or categorical data, describe() will automatically skip those columns unless you specify otherwise. Here, Month is missing

weather.describe()

	YYYY	MM	DD	DD365	Tmax	Tmin	Tmean	Trange	Rainfall_mm
count	71343.000000	71343.000000	71343.000000	71343.000000	71342.000000	71341.000000	71341.000000	71338.000000	71343.000000
mean	1924.165174	6.516210	15.729154	182.917077	13.890153	6.156018	10.046208	7.734244	1.786964
std	56.387642	3.450063	8.799927	105.478891	6.490258	5.235341	5.632085	3.513996	3.976377
min	1827.000000	1.000000	1.000000	1.000000	-9.600000	-17.800000	-12.100000	0.000000	0.000000
25%	1875.000000	4.000000	8.000000	91.500000	9.200000	2.200000	5.900000	5.200000	0.000000
50%	1924.000000	7.000000	16.000000	183.000000	13.700000	6.400000	10.100000	7.400000	0.000000
75%	1973.000000	10.000000	23.000000	274.000000	18.900000	10.200000	14.500000	10.000000	1.700000
max	2022.000000	12.000000	31.000000	366.000000	36.500000	21.200000	27.400000	23.100000	87.900000

This can be good for a quick look of your dataset. However, for many practical applications, you most likely you will not need all metrics on all variables. For example, if you were producing a report for an assignment (or a client) you may want to customize the table.

1.9.4. Custom table: `df.agg()`#

You can create a tailored summary table using the df.agg() function (short for aggregate).

Here the syntax is as follows:

df.agg({'columname1':['stat1', 'stat2', 'stat3'], 'columnname2':['stat1']})

This function takes a dictionary which is a Python structure which takes keys and values:

The keys are the names of the columns you want to summarise.
The values (inside the lists) are the statistics you want to calculate for each column.

Example#

here’s how to produce a table showing the mean and standard deviation for Tmin and Tmax:

# get mean and sd for Tmin and Tmax
weather.agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})

	Tmin	Tmax
mean	6.156018	13.890153
std	5.235341	6.490258

1.9.5. Custom table for a subset of data: `df.query('column == X').agg()`#

Finally, instead of creating a table for the entire DataFrame (weather), you can combine what you’ve learned about filtering rows with .query() and summarising data with .agg(). This would allow you to generate descriptive statistics for just a subset of your data

Example#

Get the average and standard deviation for the minimum and maximum temperature for June

# get mean and sd for Tmin and Tmax in June
weather.query('Month == "Jun"').agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})

	Tmin	Tmax
mean	10.328291	20.011487
std	2.664211	3.585932

1.9.6. Disaggregate by category: df.groupby()#

The syntax df.groupby() allows you to split a DataFrame into groups based on the values in one or more categorical columns. Any function you apply afterwards will then be performed separately for each group. This provides a powerful way to summarise your data by category.

Example#

say we want to report the mean value of the peak daily temperature in each month. Remember to get the mean value of the peak daily temperature overall we did:

weather.Tmax.mean()

np.float64(13.890153065515406)

But if we want to see how this value changes from month to month, we can use groupby('MM') to disaggregate the data.

This tells Pandas to:

Split the DataFrame into groups based on the unique values in the MM column.
Calculate the mean of Tmax within each group.

The output will be a Series showing one mean value for each month:

weather.groupby('MM').Tmax.mean()

MM
    6.554444
    7.401048
    9.944914
   13.187517
   16.795252
   20.011487
   21.799007
   21.192936
   18.451043
  14.112639
   9.640041
   7.290571
Name: Tmax, dtype: float64

We can of course combine groupby() with agg() to make a custom table of statistics for each group in the data.

This lets you summarise several variables or calculate multiple descriptive statistics on a category by category basis.

Example#

If we want to show the mean and standard deviation for the minimum and maximum temperature, by month:

# get mean and sd for Tmin and Tmax in each month
weather.groupby('MM').agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})

	Tmin		Tmax
	mean	std	mean	std
MM
1	1.319437	4.034070	6.554444	3.831624
2	1.470683	3.861644	7.401048	3.723290
3	2.396840	3.472544	9.944914	3.641816
4	4.301786	3.144024	13.187517	3.648047
5	7.165062	3.108733	16.795252	3.761523
6	10.328291	2.664211	20.011487	3.585932
7	12.238098	2.457236	21.799007	3.511055
8	11.965261	2.627916	21.192936	3.232944
9	9.824855	3.241090	18.451043	3.088003
10	6.874028	3.825368	14.112639	3.090256
11	3.692014	3.870762	9.640041	3.254649
12	2.067461	4.131333	7.290571	3.801597

1.9.7. Practice#

Try these quick practice questions:

# Make a table containing the mean rainfall in each month
# Check: the mean in October is 2.17mm

# Get the min and max temperature for each month in the year 2000
# Check: the min and max in May are 4.9 and 25.3 degrees respectively

Tables of descriptive statistics

Contents

1.9. Tables of descriptive statistics#

1.9.1. Set up Python Libraries#

1.9.2. Import a dataset to work with#

1.9.3. Quick look: df.describe()#

1.9.4. Custom table: df.agg()#

Example#

1.9.5. Custom table for a subset of data: df.query('column == X').agg()#

Example#

1.9.6. Disaggregate by category: df.groupby()#

Example#

Example#

1.9.7. Practice#

1.9.3. Quick look: `df.describe()`#

1.9.4. Custom table: `df.agg()`#

1.9.5. Custom table for a subset of data: `df.query('column == X').agg()`#