1.8. Python: Descriptives and Indexing#
In this notebook we cover some key features of the Pandas library which will be one of the libraries we will be using most often when working with data in Python.
Pandas provides efficient and flexible ways to store, clean, explore, and analyse data. Its core structure, the DataFrame, is like a digital spreadsheet (i.e., like excel): it has rows and columns, each of which can hold numerical or categorical information. Almost every dataset you’ll work with in psychology, biomedical, or human sciences can be handled easily with Pandas.
Here we will focus on two fundamental skills:
Using
Pandassyntax to calculate descriptive statistics (such as means, medians, and standard deviations).Learning how to index a DataFrame (i.e., how to select specific rows and columns you want to analyse)
Indexing is crucial because it lets you apply statistical functions only to the parts of your data you’re interested in.
We’ll meet indexing here in the context of descriptive statistics, but you’ll use it every single time you analyse data in Python. In fact, incorrect indexing is the most common source of bugs students encounter in this course, so it’s well worth mastering early!
Please Work through all the exercises in this notebook carefully before your tutorial.
1.8.1. Set up Python Libraries#
As usual, start by running the code block below to import the Python libraries you’ll need for this notebook.
NOTE: Here we are importing the library pandas using the alias pd. That means whenever you call a function from the pandas library you’ll use the shorthand pd which conveniently saving yourself 4 extra keystrokes!
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter('ignore', category=FutureWarning)
1.8.2. Import a dataset to work with#
We’ll be working with weather data recorded at the Oxford Weather Station. The code block below will automatically download the dataset from the internet using the read_csv() function from the pandas library.
The read_csv() function loads the data from a comma-separated values (CSV) file and stores it in a DataFrame. Here we’ve decided to name that DataFrame weather but we could have called it whatever we want as this is just a variable name.
We’ll then use the display() function to show a quick preview of the DataFrame so you can see what the data look like (rows, columns, and headings).
weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)
| YYYY | Month | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1827 | Jan | 1 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 |
| 1 | 1827 | Jan | 1 | 2 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 |
| 2 | 1827 | Jan | 1 | 3 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 |
| 3 | 1827 | Jan | 1 | 4 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 |
| 4 | 1827 | Jan | 1 | 5 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 71338 | 2022 | Apr | 4 | 26 | 116 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 |
| 71339 | 2022 | Apr | 4 | 27 | 117 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 |
| 71340 | 2022 | Apr | 4 | 28 | 118 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 |
| 71341 | 2022 | Apr | 4 | 29 | 119 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 |
| 71342 | 2022 | Apr | 4 | 30 | 120 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 |
71343 rows × 10 columns
What do you think each of the columns is referring to?
What does each row represent?
1.8.3. Get descriptive statistics#
In this section, we’ll explore indexing in the context of descripive statistics. You might want to calculate these statistics for different variables (columns) and/or for a specific subset of rows.
The Pandas library includes a range of built-in functions that make this quick and straightforward:
df.mean()- gets the meandf.median()- gets the mediandf.var()- gets the variancedf.std()- gets the standard deviationdf.min()- gets the minimum valuedf.max()- gets the maximum valuedf.corr()- gets the correlation coefficient (Pearson or Spearman)
NOTE: In these examples, df is simply a placeholder for your DataFrame so you should be sure to replace it with the actual name of your dataset (e.g., weather).
You can replace df with the name of your dataframe (and we need to tell the function to use only columns containing numerical data)
weather.mean(numeric_only=True)
YYYY 1924.165174
MM 6.516210
DD 15.729154
DD365 182.917077
Tmax 13.890153
Tmin 6.156018
Tmean 10.046208
Trange 7.734244
Rainfall_mm 1.786964
dtype: float64
Which columns from the dataframe are missing from the table of means?
Month is missing. Pandas will only calculate statistics for columns containing numerical data. If your DataFrame includes text (strings) or categorical variables, you may need to specify only the numeric columns within the function call.
1.8.4. Decriptives for a single column#
Sometimes you’ll want to calculate a descriptive statistic (like the mean) for just one variable rather than for the whole DataFrame.
The general format is::
df.column_name.mean()
For example, in our dataframe (weather) you may want to know the mean of the column which shows the max temperature (Tmax), you can write:
weather.Tmax.mean()
np.float64(13.890153065515406)
Instead of a whole table of means, many of which were irrelevant, we now just have the single number we wanted - the mean of Tmax
Think check that the value obtained this way matches the one in the table above.
Why does it work?#
The commmand weather.Tmax.mean() actually involves two steps.
Step 1:
The syntax weather.Tmax (or more generally, df.column) returns just the one column of the dataframe as a Pandas series. A series is essentially a one-dimensional version of a DataFrame that holds a single column of data along with its index (the row labels)
Let’s take a Look at the result of the call weather.Tmax
weather.Tmax
0 8.3
1 2.2
2 -2.2
3 -1.7
4 0.0
...
71338 15.2
71339 10.7
71340 12.7
71341 11.7
71342 17.6
Name: Tmax, Length: 71343, dtype: float64
What does the first number in this list (it should be 8.3) refer to ? how about the second number?
If we liked, we could give this single-column series a name a ‘save’ it into a variable with a name of our choosing:
DailyMaxima = weather.Tmax
print(DailyMaxima)
0 8.3
1 2.2
2 -2.2
3 -1.7
4 0.0
...
71338 15.2
71339 10.7
71340 12.7
71341 11.7
71342 17.6
Name: Tmax, Length: 71343, dtype: float64
From there, we could use the mean function and apply it to our new variable:
DailyMaxima.mean()
np.float64(13.890153065515406)
However, we don’t need to do that as we can just tack the .mean() function onto it directly as follows:
weather.Tmax.mean()
np.float64(13.890153065515406)
Step 2:
Here, the syntax .mean() gets the mean of the thing before the dot (i.e. the series you created with the command weather.Tmax)
Exercises#
Complete the following quick practice questions.
Note that the numerical answers are provided so you can check that you have done it correctly.
# get the mean daily rainfall
# answer = 1.79
# get the median daily rainfall
# answer = 0.0
# think - why??
# get the standard deviation of the maximum daily temperature
# answer = 6.49
# find the minimum value of "year"
# answer = 1827
1.8.5. Get descriptives for a subset of#
Sometimes you don’t want descriptives for the entire dataset — you just want to look at a subset of the rows that meet certain conditions (for example, all data from a particular year). Pandas provides an elegant way to do this using the query function.
What does .query() do?
The query function lets you filter a DataFrame using a simple, readable condition written as a string. It checks each row to see whether it meets your specified criteria and then returns only the rows that do.
Some basic use cases are:
df.query('columnname == number')df.query('columnname == "string"')df.query('columnname1 == value1 and columnname2 == value2')
Example#
Say we want to know the mean daily maximum temperature for the year 1921.
Step 1: find the relevant rows#
We first need to pull out the rows where the column YYYY equals 1921.
Notes:
The entire query is enclosed in quotation marks (’ ‘).
The double equals sign (==) is used to check if two values are equal, this is different from a single equals (=), which is used for assigning values (e.g.,
a = 15, would assign the value of 15 to the variable a;a == 15would check if the variable a is equal to 15 and produce either a True or a False)
# get all the rows where year is 1921
weather.query('YYYY == 1921')
| YYYY | Month | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
|---|---|---|---|---|---|---|---|---|---|---|
| 34333 | 1921 | Dec | 12 | 31 | 1 | 9.1 | 3.3 | 6.2 | 5.8 | 0.0 |
| 34334 | 1921 | Jan | 1 | 1 | 2 | 12.2 | 7.1 | 9.7 | 5.1 | 9.7 |
| 34335 | 1921 | Jan | 1 | 2 | 3 | 11.9 | 8.7 | 10.3 | 3.2 | 6.1 |
| 34336 | 1921 | Jan | 1 | 3 | 4 | 9.9 | 5.9 | 7.9 | 4.0 | 0.0 |
| 34337 | 1921 | Jan | 1 | 4 | 5 | 12.3 | 9.8 | 11.1 | 2.5 | 5.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 34693 | 1921 | Dec | 12 | 26 | 361 | 10.8 | -3.1 | 3.9 | 13.9 | 5.6 |
| 34694 | 1921 | Dec | 12 | 27 | 362 | 11.2 | 5.4 | 8.3 | 5.8 | 1.3 |
| 34695 | 1921 | Dec | 12 | 28 | 363 | 13.9 | 3.9 | 8.9 | 10.0 | 0.0 |
| 34696 | 1921 | Dec | 12 | 29 | 364 | 7.6 | 1.7 | 4.7 | 5.9 | 0.8 |
| 34697 | 1921 | Dec | 12 | 30 | 365 | 11.7 | 1.9 | 6.8 | 9.8 | 4.5 |
365 rows × 10 columns
Note the size of this table - there are 365 rows. Why?
In the original table
weather, there is one row per day365 of those rows (days) match the criterion ‘YYYY = 1921’ because there are 365 days in a year.
If I wanted to, I could give this table a name and save it for later use, note the use of the single vs. double equal sign:
weather1921 = weather.query('YYYY == 1921') # create a new dataframe for just 1921
weather1921 # look at my new dataframe
| YYYY | Month | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
|---|---|---|---|---|---|---|---|---|---|---|
| 34333 | 1921 | Dec | 12 | 31 | 1 | 9.1 | 3.3 | 6.2 | 5.8 | 0.0 |
| 34334 | 1921 | Jan | 1 | 1 | 2 | 12.2 | 7.1 | 9.7 | 5.1 | 9.7 |
| 34335 | 1921 | Jan | 1 | 2 | 3 | 11.9 | 8.7 | 10.3 | 3.2 | 6.1 |
| 34336 | 1921 | Jan | 1 | 3 | 4 | 9.9 | 5.9 | 7.9 | 4.0 | 0.0 |
| 34337 | 1921 | Jan | 1 | 4 | 5 | 12.3 | 9.8 | 11.1 | 2.5 | 5.5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 34693 | 1921 | Dec | 12 | 26 | 361 | 10.8 | -3.1 | 3.9 | 13.9 | 5.6 |
| 34694 | 1921 | Dec | 12 | 27 | 362 | 11.2 | 5.4 | 8.3 | 5.8 | 1.3 |
| 34695 | 1921 | Dec | 12 | 28 | 363 | 13.9 | 3.9 | 8.9 | 10.0 | 0.0 |
| 34696 | 1921 | Dec | 12 | 29 | 364 | 7.6 | 1.7 | 4.7 | 5.9 | 0.8 |
| 34697 | 1921 | Dec | 12 | 30 | 365 | 11.7 | 1.9 | 6.8 | 9.8 | 4.5 |
365 rows × 10 columns
… but I don’t need to do this as I can tack on more commands to complete my goal in a single line of code.
Step 2: find the relevant column#
Now that I have grabbed the relevant rows, I need to access the relevant column Tmax
weather.query('YYYY == 1921').Tmax
34333 9.1
34334 12.2
34335 11.9
34336 9.9
34337 12.3
...
34693 10.8
34694 11.2
34695 13.9
34696 7.6
34697 11.7
Name: Tmax, Length: 365, dtype: float64
Step 3: add in the actual function#
Finally, I can use a function to complete some set of operations on the values I’ve obtained from Steps 1 & 2. Above I mentioned that we want to know the mean daily maximum temperature for 1921. So the final step is to include the function mean():
weather.query('YYYY == 1921').Tmax.mean()
np.float64(15.651232876712331)
Exercises#
Complete the following quick practice questions:
# Get the mean daily maximum temperature in 2006
# answer = 15.7
# Get the mean daily minimum temperature in 1947
# answer = 6.16
1.8.6. Get descriptives for one category#
So far, we’ve filtered rows based on numeric values (like the year 1921). But what if we want to calculate descriptives for a specific category, such as a particular month?
Example#
say we want to know the mean daily maximum temperature for October. We need to pull out all the rows where the column ‘Month’ matches the string ‘Oct’.
The syntax is almost the same as before, but since we’re matching a string rather than a number, the category name must be enclosed in quotation marks inside the query:
weather.query('Month == "Oct"').Tmax.mean()
np.float64(14.112638544251446)
Some Notes:
you can actually use single and double quotes interchangeably in Python, but in cases like this where you have quotes within quotes, it is better to use both types to avoid the computer getting confused about which opening and closing quotes to pair up.
so you could do either of these
weather.query('Month == "Oct"').Tmax.mean()weather.query("Month == 'Oct'").Tmax.mean()
but not these
weather.query('Month == 'Oct'').Tmax.mean()weather.query("Month == "Oct"").Tmax.mean()
if you are not sure how the months are coded you can try running
weather.Month.unique()which will print a list of the unique values in the column Month
weather.Month.unique()
array(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
'Oct', 'Nov', 'Dec'], dtype=object)
Exercises#
Complete the following quick practice questions:
# Get the mean daily rainfall in January
# answer = 1.77
# Get the mean daily rainfall in June
# answer = 1.81
# Is this surprising? why?
1.8.7. Match multiple conditions#
What if we want to know the mean daily maximum temperature for June 1921. That is, only for rows where the year is 1921 and the month is “Jun”?
We can simply pass multiple conditions to df.query() by connecting them with and (or its symbol form &). Each condition should be written as you would on its own, and you combine them inside the same set of quotes.
# mean daily maximumm temperature in 1921
weather.query('YYYY == 1921 & Month == "Jun"').Tmax.mean()
np.float64(21.443333333333335)
Think hopefully the value for June is higher than the average for the whole year? If not something has gone wrong!
weather.query('YYYY == 1921').Tmax.mean()
np.float64(15.651232876712331)
Exercises#
Complete the following quick practice questions:
# Get the mean daily rainfall in June 2007
# answer = 2.62
# Get the minimum temperature in January 1947
# answer = -12.4. This was an unusually cold winter.
# note that you were asked for the min temperature in that month, not the mean of the daily minimum Tmin)
1.8.8. Syntax for correlation#
When we calculate the correlation between two variables we will need to access two columns - specifically, the two variables we are comparing.
Example#
Say we want to find the correlation between the minimum and maximum daily temperatures, stored in the columns Tmin and Tmax.
Correlation is a symmetrical measure which means that it doesn’t matter which variable is treated as \(x\) and which as \(y\). However, in Python we still need to choose one column to start from when using the pandas function.
So we will begin by selecting one column (again, it doesn’t matter which one):
weather.Tminwill simply print out the contents of the column
weather.Tmin
0 5.6
1 0.0
2 -8.3
3 -7.8
4 -10.6
...
71338 4.1
71339 2.6
71340 3.9
71341 6.7
71342 1.0
Name: Tmin, Length: 71343, dtype: float64
Now we can call the pandas function .corr() to calculate the correlation. If you are every unsure about how a function works you can use Python’s built in help() function.
From this we learn that .corr() takes an argument "other" of type Series. This means that we need to specify another column (or Series) to correlate with our first one.
help(weather.Tmin.corr)
Help on method corr in module pandas.core.series:
corr(other: 'Series', method: 'CorrelationMethod' = 'pearson', min_periods: 'int | None' = None) -> 'float' method of pandas.core.series.Series instance
Compute correlation with `other` Series, excluding missing values.
The two `Series` objects are not required to be the same length and will be
aligned internally before the correlation function is applied.
Parameters
----------
other : Series
Series with which to compute the correlation.
method : {'pearson', 'kendall', 'spearman'} or callable
Method used to compute correlation:
- pearson : Standard correlation coefficient
- kendall : Kendall Tau correlation coefficient
- spearman : Spearman rank correlation
- callable: Callable with input two 1d ndarrays and returning a float.
.. warning::
Note that the returned matrix from corr will have 1 along the
diagonals and will be symmetric regardless of the callable's
behavior.
min_periods : int, optional
Minimum number of observations needed to have a valid result.
Returns
-------
float
Correlation with other.
See Also
--------
DataFrame.corr : Compute pairwise correlation between columns.
DataFrame.corrwith : Compute pairwise correlation with another
DataFrame or Series.
Notes
-----
Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.
* `Pearson correlation coefficient <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
* `Kendall rank correlation coefficient <https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient>`_
* `Spearman's rank correlation coefficient <https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>`_
Automatic data alignment: as with all pandas operations, automatic data alignment is performed for this method.
``corr()`` automatically considers values with matching indices.
Examples
--------
>>> def histogram_intersection(a, b):
... v = np.minimum(a, b).sum().round(decimals=1)
... return v
>>> s1 = pd.Series([.2, .0, .6, .2])
>>> s2 = pd.Series([.3, .6, .0, .1])
>>> s1.corr(s2, method=histogram_intersection)
0.3
Pandas auto-aligns the values with matching indices
>>> s1 = pd.Series([1, 2, 3], index=[0, 1, 2])
>>> s2 = pd.Series([1, 2, 3], index=[2, 1, 0])
>>> s1.corr(s2)
-1.0
So in order to calculate the correlation between Tmin and Tmax we write:
# get the correlation between columns Tmin and Tmax of dataframe weather
weather.Tmin.corr(weather.Tmax)
np.float64(0.8414800914062079)
Concept Check
Is this a high correlation?
Would you expect the sign to change if you swap the order of Tmin and Tmax?
Why or Why Not?
We could change what we correlate against Tmin by changing what is inside the brackets, eg:
# get the correlation between columns Tmin and Tmean of dataframe weather
weather.Tmin.corr(weather.Tmean)
np.float64(0.9502479355058373)
Since correlation is ‘symmetrical’ (it doesn’t matter which variable is \(x\) and which is \(y\)) we can also switch the variables around:
# get the correlation between columns Tmin and Tmean of dataframe weather
weather.Tmean.corr(weather.Tmin)
np.float64(0.9502479355058373)
… we just need one column specified outside the brackets and one inside
Correlation type#
We can set the type of correlation to Pearson or Spearman as appropriate (the default is Pearson’s \(r\)), Have a look above at the help for this function and see if you can find out where this argument is specified:
# get the correlation between columns Tmin and Tmax of dataframe weather
weather.Tmin.corr(weather.Tmax, method='spearman')
np.float64(0.8438796074978867)
Exercises#
Complete the following practice questions to help you get to grips with the syntax for correlation:
# Find the Spearman correlation between daily mean temperature Tmean, and rainfall
# answer = -0.019
# Find the correlation between year and daily mean temperature
# answer = 0.083
Think - are these strong or weak correlations? Is this surprising, did you expect something different?
All-by-all correlation matrix#
We can also quickly output all the correlations between all possible pairs of columns in a dataframe by simply not giving any column index (but we do have to tell it to only use the numeric columns):
weather.corr(numeric_only=True)
| YYYY | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
|---|---|---|---|---|---|---|---|---|---|
| YYYY | 1.000000 | -0.003411 | -0.000059 | -0.003372 | 0.071631 | 0.089683 | 0.083044 | -0.001257 | 0.008117 |
| MM | -0.003411 | 1.000000 | 0.010567 | 0.995580 | 0.179681 | 0.235401 | 0.213082 | -0.018820 | 0.043672 |
| DD | -0.000059 | 0.010567 | 1.000000 | 0.092771 | 0.001217 | 0.002876 | 0.002035 | -0.002055 | 0.005315 |
| DD365 | -0.003372 | 0.995580 | 0.092771 | 1.000000 | 0.177016 | 0.233248 | 0.210545 | -0.020536 | 0.043925 |
| Tmax | 0.071631 | 0.179681 | 0.001217 | 0.177016 | 1.000000 | 0.841480 | 0.967881 | 0.593339 | -0.008807 |
| Tmin | 0.089683 | 0.235401 | 0.002876 | 0.233248 | 0.841480 | 1.000000 | 0.950248 | 0.064379 | 0.086181 |
| Tmean | 0.083044 | 0.213082 | 0.002035 | 0.210545 | 0.967881 | 0.950248 | 1.000000 | 0.371965 | 0.035037 |
| Trange | -0.001257 | -0.018820 | -0.002055 | -0.020536 | 0.593339 | 0.064379 | 0.371965 | 1.000000 | -0.144654 |
| Rainfall_mm | 0.008117 | 0.043672 | 0.005315 | 0.043925 | -0.008807 | 0.086181 | 0.035037 | -0.144654 | 1.000000 |
The problem with this is that the figure we want is often buried amongst a lot of irrelevant correlation coefficients.
Correlation using scipy.stats#
Note that here we are using the pandas function df.corr()
Later in the course we will use two functions from a library called scipy.stats:
stats.pearsonr()stats.spearmanr()
These are actually a bit more useful as they give us a \(p\)-value for the correlation as well as the correlation coefficient
don’t worry if you don’t know what a \(p\)-value is yet - this will be covered later in the course
scipy.stats will be used extensively in the module on hypothesis testing