1.9. Tables of descriptive statistics#
When writing a report or analysing a dataset, you’ll often want to create a table of descriptive statistics. This can provide a clear summary that gathers all the key measures you need in one place. Additionally, you can Pandas makes this easy with several built-in functions that can automatically calculate and neatly display these statistics for your dataset.
Below we review a couple of useful functions for making tables:
df.describe(): quickly generates a summary table (count, mean, standard deviation, min, quartiles, max) for all numeric columns.df.agg(): lets you choose exactly which statistics to compute (for example, just the mean and standard deviation).df.groupby(): groups your data by a categorical variable (for example, month or year) and calculates summary statistics for each group.
1.9.1. Set up Python Libraries#
As usual you will need to run this code block to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter('ignore', category=FutureWarning)
1.9.2. Import a dataset to work with#
We will work with weather data from the Oxford weather station. This code block will read it automatically from the internet.
weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)
| YYYY | Month | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1827 | Jan | 1 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 |
| 1 | 1827 | Jan | 1 | 2 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 |
| 2 | 1827 | Jan | 1 | 3 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 |
| 3 | 1827 | Jan | 1 | 4 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 |
| 4 | 1827 | Jan | 1 | 5 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 71338 | 2022 | Apr | 4 | 26 | 116 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 |
| 71339 | 2022 | Apr | 4 | 27 | 117 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 |
| 71340 | 2022 | Apr | 4 | 28 | 118 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 |
| 71341 | 2022 | Apr | 4 | 29 | 119 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 |
| 71342 | 2022 | Apr | 4 | 30 | 120 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 |
71343 rows × 10 columns
1.9.3. Quick look: df.describe()#
df.describe()
We can output a standard set of descriptives for all the numerical columns in the dataframe by just using df.describe()
This command outputs a standard set of descriptive statistics for all the numerical columns in your DataFrame. By default, it shows:
count– how many non-missing values there are,mean– the average value,std– the standard deviation,minandmax– the smallest and largest values,25%,50%,75%– the quartiles (useful for understanding data spread).
This gives you a quick way to check the range, central tendency, and variability of each variable all at once — a great first step whenever you start exploring a dataset.
Tip: If your DataFrame also contains text or categorical data, describe() will automatically skip those columns unless you specify otherwise. Here, Month is missing
weather.describe()
| YYYY | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
|---|---|---|---|---|---|---|---|---|---|
| count | 71343.000000 | 71343.000000 | 71343.000000 | 71343.000000 | 71342.000000 | 71341.000000 | 71341.000000 | 71338.000000 | 71343.000000 |
| mean | 1924.165174 | 6.516210 | 15.729154 | 182.917077 | 13.890153 | 6.156018 | 10.046208 | 7.734244 | 1.786964 |
| std | 56.387642 | 3.450063 | 8.799927 | 105.478891 | 6.490258 | 5.235341 | 5.632085 | 3.513996 | 3.976377 |
| min | 1827.000000 | 1.000000 | 1.000000 | 1.000000 | -9.600000 | -17.800000 | -12.100000 | 0.000000 | 0.000000 |
| 25% | 1875.000000 | 4.000000 | 8.000000 | 91.500000 | 9.200000 | 2.200000 | 5.900000 | 5.200000 | 0.000000 |
| 50% | 1924.000000 | 7.000000 | 16.000000 | 183.000000 | 13.700000 | 6.400000 | 10.100000 | 7.400000 | 0.000000 |
| 75% | 1973.000000 | 10.000000 | 23.000000 | 274.000000 | 18.900000 | 10.200000 | 14.500000 | 10.000000 | 1.700000 |
| max | 2022.000000 | 12.000000 | 31.000000 | 366.000000 | 36.500000 | 21.200000 | 27.400000 | 23.100000 | 87.900000 |
This can be good for a quick look of your dataset. However, for many practical applications, you most likely you will not need all metrics on all variables. For example, if you were producing a report for an assignment (or a client) you may want to customize the table.
1.9.4. Custom table: df.agg()#
You can create a tailored summary table using the df.agg() function (short for aggregate).
Here the syntax is as follows:
df.agg({'columname1':['stat1', 'stat2', 'stat3'], 'columnname2':['stat1']})
This function takes a dictionary which is a Python structure which takes keys and values:
The keys are the names of the columns you want to summarise.
The values (inside the lists) are the statistics you want to calculate for each column.
Example#
here’s how to produce a table showing the mean and standard deviation for Tmin and Tmax:
# get mean and sd for Tmin and Tmax
weather.agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})
| Tmin | Tmax | |
|---|---|---|
| mean | 6.156018 | 13.890153 |
| std | 5.235341 | 6.490258 |
1.9.5. Custom table for a subset of data: df.query('column == X').agg()#
Finally, instead of creating a table for the entire DataFrame (weather), you can combine what you’ve learned about filtering rows with .query() and summarising data with .agg(). This would allow you to generate descriptive statistics for just a subset of your data
Example#
Get the average and standard deviation for the minimum and maximum temperature for June
# get mean and sd for Tmin and Tmax in June
weather.query('Month == "Jun"').agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})
| Tmin | Tmax | |
|---|---|---|
| mean | 10.328291 | 20.011487 |
| std | 2.664211 | 3.585932 |
1.9.6. Disaggregate by category: df.groupby()#
The syntax df.groupby() allows you to split a DataFrame into groups based on the values in one or more categorical columns.
Any function you apply afterwards will then be performed separately for each group. This provides a powerful way to summarise your data by category.
Example#
say we want to report the mean value of the peak daily temperature in each month. Remember to get the mean value of the peak daily temperature overall we did:
weather.Tmax.mean()
np.float64(13.890153065515406)
But if we want to see how this value changes from month to month, we can use groupby('MM') to disaggregate the data.
This tells Pandas to:
Split the DataFrame into groups based on the unique values in the MM column.
Calculate the mean of Tmax within each group.
The output will be a Series showing one mean value for each month:
weather.groupby('MM').Tmax.mean()
MM
1 6.554444
2 7.401048
3 9.944914
4 13.187517
5 16.795252
6 20.011487
7 21.799007
8 21.192936
9 18.451043
10 14.112639
11 9.640041
12 7.290571
Name: Tmax, dtype: float64
We can of course combine groupby() with agg() to make a custom table of statistics for each group in the data.
This lets you summarise several variables or calculate multiple descriptive statistics on a category by category basis.
Example#
If we want to show the mean and standard deviation for the minimum and maximum temperature, by month:
# get mean and sd for Tmin and Tmax in each month
weather.groupby('MM').agg({'Tmin':['mean', 'std'], 'Tmax':['mean', 'std']})
| Tmin | Tmax | |||
|---|---|---|---|---|
| mean | std | mean | std | |
| MM | ||||
| 1 | 1.319437 | 4.034070 | 6.554444 | 3.831624 |
| 2 | 1.470683 | 3.861644 | 7.401048 | 3.723290 |
| 3 | 2.396840 | 3.472544 | 9.944914 | 3.641816 |
| 4 | 4.301786 | 3.144024 | 13.187517 | 3.648047 |
| 5 | 7.165062 | 3.108733 | 16.795252 | 3.761523 |
| 6 | 10.328291 | 2.664211 | 20.011487 | 3.585932 |
| 7 | 12.238098 | 2.457236 | 21.799007 | 3.511055 |
| 8 | 11.965261 | 2.627916 | 21.192936 | 3.232944 |
| 9 | 9.824855 | 3.241090 | 18.451043 | 3.088003 |
| 10 | 6.874028 | 3.825368 | 14.112639 | 3.090256 |
| 11 | 3.692014 | 3.870762 | 9.640041 | 3.254649 |
| 12 | 2.067461 | 4.131333 | 7.290571 | 3.801597 |
1.9.7. Practice#
Try these quick practice questions:
# Make a table containing the mean rainfall in each month
# Check: the mean in October is 2.17mm
# Get the min and max temperature for each month in the year 2000
# Check: the min and max in May are 4.9 and 25.3 degrees respectively