1.8. Python: Descriptives and Indexing#

In this notebook we cover some key features of the Pandas library which will be one of the libraries we will be using most often when working with data in Python.

Pandas provides efficient and flexible ways to store, clean, explore, and analyse data. Its core structure, the DataFrame, is like a digital spreadsheet (i.e., like excel): it has rows and columns, each of which can hold numerical or categorical information. Almost every dataset you’ll work with in psychology, biomedical, or human sciences can be handled easily with Pandas.

Here we will focus on two fundamental skills:

  • Using Pandas syntax to calculate descriptive statistics (such as means, medians, and standard deviations).

  • Learning how to index a DataFrame (i.e., how to select specific rows and columns you want to analyse)

    • Indexing is crucial because it lets you apply statistical functions only to the parts of your data you’re interested in.

We’ll meet indexing here in the context of descriptive statistics, but you’ll use it every single time you analyse data in Python. In fact, incorrect indexing is the most common source of bugs students encounter in this course, so it’s well worth mastering early!

Please Work through all the exercises in this notebook carefully before your tutorial.

1.8.1. Set up Python Libraries#

As usual, start by running the code block below to import the Python libraries you’ll need for this notebook.

NOTE: Here we are importing the library pandas using the alias pd. That means whenever you call a function from the pandas library you’ll use the shorthand pd which conveniently saving yourself 4 extra keystrokes!

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd 
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings 
warnings.simplefilter('ignore', category=FutureWarning)

1.8.2. Import a dataset to work with#

We’ll be working with weather data recorded at the Oxford Weather Station. The code block below will automatically download the dataset from the internet using the read_csv() function from the pandas library.

The read_csv() function loads the data from a comma-separated values (CSV) file and stores it in a DataFrame. Here we’ve decided to name that DataFrame weather but we could have called it whatever we want as this is just a variable name.

We’ll then use the display() function to show a quick preview of the DataFrame so you can see what the data look like (rows, columns, and headings).

weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)
YYYY Month MM DD DD365 Tmax Tmin Tmean Trange Rainfall_mm
0 1827 Jan 1 1 1 8.3 5.6 7.0 2.7 0.0
1 1827 Jan 1 2 2 2.2 0.0 1.1 2.2 0.0
2 1827 Jan 1 3 3 -2.2 -8.3 -5.3 6.1 9.7
3 1827 Jan 1 4 4 -1.7 -7.8 -4.8 6.1 0.0
4 1827 Jan 1 5 5 0.0 -10.6 -5.3 10.6 0.0
... ... ... ... ... ... ... ... ... ... ...
71338 2022 Apr 4 26 116 15.2 4.1 9.7 11.1 0.0
71339 2022 Apr 4 27 117 10.7 2.6 6.7 8.1 0.0
71340 2022 Apr 4 28 118 12.7 3.9 8.3 8.8 0.0
71341 2022 Apr 4 29 119 11.7 6.7 9.2 5.0 0.0
71342 2022 Apr 4 30 120 17.6 1.0 9.3 16.6 0.0

71343 rows × 10 columns

  • What do you think each of the columns is referring to?

  • What does each row represent?

1.8.3. Get descriptive statistics#

In this section, we’ll explore indexing in the context of descripive statistics. You might want to calculate these statistics for different variables (columns) and/or for a specific subset of rows.

The Pandas library includes a range of built-in functions that make this quick and straightforward:

  • df.mean() - gets the mean

  • df.median() - gets the median

  • df.var() - gets the variance

  • df.std() - gets the standard deviation

  • df.min() - gets the minimum value

  • df.max() - gets the maximum value

  • df.corr() - gets the correlation coefficient (Pearson or Spearman)

NOTE: In these examples, df is simply a placeholder for your DataFrame so you should be sure to replace it with the actual name of your dataset (e.g., weather).

You can replace df with the name of your dataframe (and we need to tell the function to use only columns containing numerical data)

weather.mean(numeric_only=True)
YYYY           1924.165174
MM                6.516210
DD               15.729154
DD365           182.917077
Tmax             13.890153
Tmin              6.156018
Tmean            10.046208
Trange            7.734244
Rainfall_mm       1.786964
dtype: float64

Which columns from the dataframe are missing from the table of means?

  • Month is missing. Pandas will only calculate statistics for columns containing numerical data. If your DataFrame includes text (strings) or categorical variables, you may need to specify only the numeric columns within the function call.

1.8.4. Decriptives for a single column#

Sometimes you’ll want to calculate a descriptive statistic (like the mean) for just one variable rather than for the whole DataFrame.

The general format is::

  • df.column_name.mean()

For example, in our dataframe (weather) you may want to know the mean of the column which shows the max temperature (Tmax), you can write:

weather.Tmax.mean()
np.float64(13.890153065515406)

Instead of a whole table of means, many of which were irrelevant, we now just have the single number we wanted - the mean of Tmax

  • Think check that the value obtained this way matches the one in the table above.

Why does it work?#

The commmand weather.Tmax.mean() actually involves two steps.

Step 1:

The syntax weather.Tmax (or more generally, df.column) returns just the one column of the dataframe as a Pandas series. A series is essentially a one-dimensional version of a DataFrame that holds a single column of data along with its index (the row labels)

Let’s take a Look at the result of the call weather.Tmax

weather.Tmax
0         8.3
1         2.2
2        -2.2
3        -1.7
4         0.0
         ... 
71338    15.2
71339    10.7
71340    12.7
71341    11.7
71342    17.6
Name: Tmax, Length: 71343, dtype: float64
  • What does the first number in this list (it should be 8.3) refer to ? how about the second number?

If we liked, we could give this single-column series a name a ‘save’ it into a variable with a name of our choosing:

DailyMaxima  = weather.Tmax
print(DailyMaxima)
0         8.3
1         2.2
2        -2.2
3        -1.7
4         0.0
         ... 
71338    15.2
71339    10.7
71340    12.7
71341    11.7
71342    17.6
Name: Tmax, Length: 71343, dtype: float64

From there, we could use the mean function and apply it to our new variable:

DailyMaxima.mean()
np.float64(13.890153065515406)

However, we don’t need to do that as we can just tack the .mean() function onto it directly as follows:

weather.Tmax.mean()
np.float64(13.890153065515406)

Step 2:

Here, the syntax .mean() gets the mean of the thing before the dot (i.e. the series you created with the command weather.Tmax)

Exercises#

Complete the following quick practice questions.

Note that the numerical answers are provided so you can check that you have done it correctly.

# get the mean daily rainfall
# answer = 1.79
# get the median daily rainfall
# answer = 0.0
# think - why??
# get the standard deviation of the maximum daily temperature
# answer = 6.49
# find the minimum value of "year" 
# answer = 1827

1.8.5. Get descriptives for a subset of#

Sometimes you don’t want descriptives for the entire dataset — you just want to look at a subset of the rows that meet certain conditions (for example, all data from a particular year). Pandas provides an elegant way to do this using the query function.

What does .query() do?

The query function lets you filter a DataFrame using a simple, readable condition written as a string. It checks each row to see whether it meets your specified criteria and then returns only the rows that do.

Some basic use cases are:

  • df.query('columnname == number')

  • df.query('columnname == "string"')

  • df.query('columnname1 == value1 and columnname2 == value2')

Example#

Say we want to know the mean daily maximum temperature for the year 1921.

Step 1: find the relevant rows#

We first need to pull out the rows where the column YYYY equals 1921.

Notes:

  • The entire query is enclosed in quotation marks (’ ‘).

  • The double equals sign (==) is used to check if two values are equal, this is different from a single equals (=), which is used for assigning values (e.g., a = 15, would assign the value of 15 to the variable a; a == 15 would check if the variable a is equal to 15 and produce either a True or a False)

# get all the rows where year is 1921
weather.query('YYYY == 1921')
YYYY Month MM DD DD365 Tmax Tmin Tmean Trange Rainfall_mm
34333 1921 Dec 12 31 1 9.1 3.3 6.2 5.8 0.0
34334 1921 Jan 1 1 2 12.2 7.1 9.7 5.1 9.7
34335 1921 Jan 1 2 3 11.9 8.7 10.3 3.2 6.1
34336 1921 Jan 1 3 4 9.9 5.9 7.9 4.0 0.0
34337 1921 Jan 1 4 5 12.3 9.8 11.1 2.5 5.5
... ... ... ... ... ... ... ... ... ... ...
34693 1921 Dec 12 26 361 10.8 -3.1 3.9 13.9 5.6
34694 1921 Dec 12 27 362 11.2 5.4 8.3 5.8 1.3
34695 1921 Dec 12 28 363 13.9 3.9 8.9 10.0 0.0
34696 1921 Dec 12 29 364 7.6 1.7 4.7 5.9 0.8
34697 1921 Dec 12 30 365 11.7 1.9 6.8 9.8 4.5

365 rows × 10 columns

Note the size of this table - there are 365 rows. Why?

  • In the original table weather, there is one row per day

  • 365 of those rows (days) match the criterion ‘YYYY = 1921’ because there are 365 days in a year.

If I wanted to, I could give this table a name and save it for later use, note the use of the single vs. double equal sign:

weather1921 = weather.query('YYYY == 1921') # create a new dataframe for just 1921
weather1921 # look at my new dataframe
YYYY Month MM DD DD365 Tmax Tmin Tmean Trange Rainfall_mm
34333 1921 Dec 12 31 1 9.1 3.3 6.2 5.8 0.0
34334 1921 Jan 1 1 2 12.2 7.1 9.7 5.1 9.7
34335 1921 Jan 1 2 3 11.9 8.7 10.3 3.2 6.1
34336 1921 Jan 1 3 4 9.9 5.9 7.9 4.0 0.0
34337 1921 Jan 1 4 5 12.3 9.8 11.1 2.5 5.5
... ... ... ... ... ... ... ... ... ... ...
34693 1921 Dec 12 26 361 10.8 -3.1 3.9 13.9 5.6
34694 1921 Dec 12 27 362 11.2 5.4 8.3 5.8 1.3
34695 1921 Dec 12 28 363 13.9 3.9 8.9 10.0 0.0
34696 1921 Dec 12 29 364 7.6 1.7 4.7 5.9 0.8
34697 1921 Dec 12 30 365 11.7 1.9 6.8 9.8 4.5

365 rows × 10 columns

… but I don’t need to do this as I can tack on more commands to complete my goal in a single line of code.

Step 2: find the relevant column#

Now that I have grabbed the relevant rows, I need to access the relevant column Tmax

weather.query('YYYY == 1921').Tmax
34333     9.1
34334    12.2
34335    11.9
34336     9.9
34337    12.3
         ... 
34693    10.8
34694    11.2
34695    13.9
34696     7.6
34697    11.7
Name: Tmax, Length: 365, dtype: float64

Step 3: add in the actual function#

Finally, I can use a function to complete some set of operations on the values I’ve obtained from Steps 1 & 2. Above I mentioned that we want to know the mean daily maximum temperature for 1921. So the final step is to include the function mean():

weather.query('YYYY == 1921').Tmax.mean()
np.float64(15.651232876712331)

Exercises#

Complete the following quick practice questions:

# Get the mean daily maximum temperature in 2006
# answer = 15.7
# Get the mean daily minimum temperature in 1947
# answer = 6.16

1.8.6. Get descriptives for one category#

So far, we’ve filtered rows based on numeric values (like the year 1921). But what if we want to calculate descriptives for a specific category, such as a particular month?

Example#

say we want to know the mean daily maximum temperature for October. We need to pull out all the rows where the column ‘Month’ matches the string ‘Oct’.

The syntax is almost the same as before, but since we’re matching a string rather than a number, the category name must be enclosed in quotation marks inside the query:

weather.query('Month == "Oct"').Tmax.mean()
np.float64(14.112638544251446)

Some Notes:

  1. you can actually use single and double quotes interchangeably in Python, but in cases like this where you have quotes within quotes, it is better to use both types to avoid the computer getting confused about which opening and closing quotes to pair up.

    so you could do either of these

    • weather.query('Month == "Oct"').Tmax.mean()

    • weather.query("Month == 'Oct'").Tmax.mean()

    but not these

    • weather.query('Month == 'Oct'').Tmax.mean()

    • weather.query("Month == "Oct"").Tmax.mean()

  2. if you are not sure how the months are coded you can try running weather.Month.unique() which will print a list of the unique values in the column Month

weather.Month.unique()
array(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
       'Oct', 'Nov', 'Dec'], dtype=object)

Exercises#

Complete the following quick practice questions:

# Get the mean daily rainfall in January
# answer = 1.77
# Get the mean daily rainfall in June
# answer = 1.81
# Is this surprising? why?

1.8.7. Match multiple conditions#

What if we want to know the mean daily maximum temperature for June 1921. That is, only for rows where the year is 1921 and the month is “Jun”?

We can simply pass multiple conditions to df.query() by connecting them with and (or its symbol form &). Each condition should be written as you would on its own, and you combine them inside the same set of quotes.

# mean daily maximumm temperature in 1921
weather.query('YYYY == 1921 & Month == "Jun"').Tmax.mean()
np.float64(21.443333333333335)
  • Think hopefully the value for June is higher than the average for the whole year? If not something has gone wrong!

weather.query('YYYY == 1921').Tmax.mean()
np.float64(15.651232876712331)

Exercises#

Complete the following quick practice questions:

# Get the mean daily rainfall in June 2007

# answer = 2.62
# Get the minimum temperature in January 1947

# answer = -12.4. This was an unusually cold winter.
# note that you were asked for the min temperature in that month, not the mean of the daily minimum Tmin)

1.8.8. Syntax for correlation#

When we calculate the correlation between two variables we will need to access two columns - specifically, the two variables we are comparing.

Example#

Say we want to find the correlation between the minimum and maximum daily temperatures, stored in the columns Tmin and Tmax.

Correlation is a symmetrical measure which means that it doesn’t matter which variable is treated as \(x\) and which as \(y\). However, in Python we still need to choose one column to start from when using the pandas function.

So we will begin by selecting one column (again, it doesn’t matter which one):

  • weather.Tmin will simply print out the contents of the column

weather.Tmin
0         5.6
1         0.0
2        -8.3
3        -7.8
4       -10.6
         ... 
71338     4.1
71339     2.6
71340     3.9
71341     6.7
71342     1.0
Name: Tmin, Length: 71343, dtype: float64

Now we can call the pandas function .corr() to calculate the correlation. If you are every unsure about how a function works you can use Python’s built in help() function.

From this we learn that .corr() takes an argument "other" of type Series. This means that we need to specify another column (or Series) to correlate with our first one.

help(weather.Tmin.corr)
Help on method corr in module pandas.core.series:

corr(other: 'Series', method: 'CorrelationMethod' = 'pearson', min_periods: 'int | None' = None) -> 'float' method of pandas.core.series.Series instance
    Compute correlation with `other` Series, excluding missing values.
    
    The two `Series` objects are not required to be the same length and will be
    aligned internally before the correlation function is applied.
    
    Parameters
    ----------
    other : Series
        Series with which to compute the correlation.
    method : {'pearson', 'kendall', 'spearman'} or callable
        Method used to compute correlation:
    
        - pearson : Standard correlation coefficient
        - kendall : Kendall Tau correlation coefficient
        - spearman : Spearman rank correlation
        - callable: Callable with input two 1d ndarrays and returning a float.
    
        .. warning::
            Note that the returned matrix from corr will have 1 along the
            diagonals and will be symmetric regardless of the callable's
            behavior.
    min_periods : int, optional
        Minimum number of observations needed to have a valid result.
    
    Returns
    -------
    float
        Correlation with other.
    
    See Also
    --------
    DataFrame.corr : Compute pairwise correlation between columns.
    DataFrame.corrwith : Compute pairwise correlation with another
        DataFrame or Series.
    
    Notes
    -----
    Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.
    
    * `Pearson correlation coefficient <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
    * `Kendall rank correlation coefficient <https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient>`_
    * `Spearman's rank correlation coefficient <https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient>`_
    
    Automatic data alignment: as with all pandas operations, automatic data alignment is performed for this method.
    ``corr()`` automatically considers values with matching indices.
    
    Examples
    --------
    >>> def histogram_intersection(a, b):
    ...     v = np.minimum(a, b).sum().round(decimals=1)
    ...     return v
    >>> s1 = pd.Series([.2, .0, .6, .2])
    >>> s2 = pd.Series([.3, .6, .0, .1])
    >>> s1.corr(s2, method=histogram_intersection)
    0.3
    
    Pandas auto-aligns the values with matching indices
    
    >>> s1 = pd.Series([1, 2, 3], index=[0, 1, 2])
    >>> s2 = pd.Series([1, 2, 3], index=[2, 1, 0])
    >>> s1.corr(s2)
    -1.0

So in order to calculate the correlation between Tmin and Tmax we write:

# get the correlation between columns Tmin and Tmax of dataframe weather
weather.Tmin.corr(weather.Tmax)
np.float64(0.8414800914062079)

Concept Check

  • Is this a high correlation?

  • Would you expect the sign to change if you swap the order of Tmin and Tmax?

  • Why or Why Not?

We could change what we correlate against Tmin by changing what is inside the brackets, eg:

# get the correlation between columns Tmin and Tmean of dataframe weather
weather.Tmin.corr(weather.Tmean)
np.float64(0.9502479355058373)

Since correlation is ‘symmetrical’ (it doesn’t matter which variable is \(x\) and which is \(y\)) we can also switch the variables around:

# get the correlation between columns Tmin and Tmean of dataframe weather
weather.Tmean.corr(weather.Tmin)
np.float64(0.9502479355058373)

… we just need one column specified outside the brackets and one inside

Correlation type#

We can set the type of correlation to Pearson or Spearman as appropriate (the default is Pearson’s \(r\)), Have a look above at the help for this function and see if you can find out where this argument is specified:

# get the correlation between columns Tmin and Tmax of dataframe weather
weather.Tmin.corr(weather.Tmax, method='spearman')
np.float64(0.8438796074978867)

Exercises#

Complete the following practice questions to help you get to grips with the syntax for correlation:

# Find the Spearman correlation between daily mean temperature Tmean, and rainfall
# answer = -0.019
# Find the correlation between year and daily mean temperature
# answer = 0.083

Think - are these strong or weak correlations? Is this surprising, did you expect something different?

All-by-all correlation matrix#

We can also quickly output all the correlations between all possible pairs of columns in a dataframe by simply not giving any column index (but we do have to tell it to only use the numeric columns):

weather.corr(numeric_only=True)
YYYY MM DD DD365 Tmax Tmin Tmean Trange Rainfall_mm
YYYY 1.000000 -0.003411 -0.000059 -0.003372 0.071631 0.089683 0.083044 -0.001257 0.008117
MM -0.003411 1.000000 0.010567 0.995580 0.179681 0.235401 0.213082 -0.018820 0.043672
DD -0.000059 0.010567 1.000000 0.092771 0.001217 0.002876 0.002035 -0.002055 0.005315
DD365 -0.003372 0.995580 0.092771 1.000000 0.177016 0.233248 0.210545 -0.020536 0.043925
Tmax 0.071631 0.179681 0.001217 0.177016 1.000000 0.841480 0.967881 0.593339 -0.008807
Tmin 0.089683 0.235401 0.002876 0.233248 0.841480 1.000000 0.950248 0.064379 0.086181
Tmean 0.083044 0.213082 0.002035 0.210545 0.967881 0.950248 1.000000 0.371965 0.035037
Trange -0.001257 -0.018820 -0.002055 -0.020536 0.593339 0.064379 0.371965 1.000000 -0.144654
Rainfall_mm 0.008117 0.043672 0.005315 0.043925 -0.008807 0.086181 0.035037 -0.144654 1.000000

The problem with this is that the figure we want is often buried amongst a lot of irrelevant correlation coefficients.

Correlation using scipy.stats#

Note that here we are using the pandas function df.corr()

Later in the course we will use two functions from a library called scipy.stats:

  • stats.pearsonr()

  • stats.spearmanr()

These are actually a bit more useful as they give us a \(p\)-value for the correlation as well as the correlation coefficient

  • don’t worry if you don’t know what a \(p\)-value is yet - this will be covered later in the course

scipy.stats will be used extensively in the module on hypothesis testing