1.6. Python Functions#
This week we’ll need to write our own Python functions as part of running a permutation test. Before we do that, we will review what functions are and how to create them in Python.
This is a quick Python tangent to our main statistics objective for the week and is meant to give you the minimal, practical knowledge you need to implement permutation tests cleanly.
1.6.1. Set up Python libraries#
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings
warnings.simplefilter('ignore', category=FutureWarning)
1.6.2. Import the data#
We need some data to work with. Let’s use the good old Oxford Weather dataset.
weather = pd.read_csv("https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/OxfordWeather.csv")
display(weather)
| YYYY | Month | MM | DD | DD365 | Tmax | Tmin | Tmean | Trange | Rainfall_mm | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1827 | Jan | 1 | 1 | 1 | 8.3 | 5.6 | 7.0 | 2.7 | 0.0 |
| 1 | 1827 | Jan | 1 | 2 | 2 | 2.2 | 0.0 | 1.1 | 2.2 | 0.0 |
| 2 | 1827 | Jan | 1 | 3 | 3 | -2.2 | -8.3 | -5.3 | 6.1 | 9.7 |
| 3 | 1827 | Jan | 1 | 4 | 4 | -1.7 | -7.8 | -4.8 | 6.1 | 0.0 |
| 4 | 1827 | Jan | 1 | 5 | 5 | 0.0 | -10.6 | -5.3 | 10.6 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 71338 | 2022 | Apr | 4 | 26 | 116 | 15.2 | 4.1 | 9.7 | 11.1 | 0.0 |
| 71339 | 2022 | Apr | 4 | 27 | 117 | 10.7 | 2.6 | 6.7 | 8.1 | 0.0 |
| 71340 | 2022 | Apr | 4 | 28 | 118 | 12.7 | 3.9 | 8.3 | 8.8 | 0.0 |
| 71341 | 2022 | Apr | 4 | 29 | 119 | 11.7 | 6.7 | 9.2 | 5.0 | 0.0 |
| 71342 | 2022 | Apr | 4 | 30 | 120 | 17.6 | 1.0 | 9.3 | 16.6 | 0.0 |
71343 rows × 10 columns
1.6.3. What is a function?#
A function is a named block of code that performs a single, well-defined task. Specifically, a function will take in some information (an input), do something to it, and then returns the final output. Once you write a function you can reuse it many times with different inputs. Ultimately this helps to keep your notebook tidy and makes your analysis easier to read and debug.
Functions were introduced in DataCamp and you could review this if helpful after reading this section.
We’ve actually been using Python functions for several weeks already. Most of these have come from the libraries pandas, seaborn, or numpy. As an example, the function df.mean() calculates the mean of each column in a dataframe.
Functions in Python code can be easy to spot because the function name is followed by a pair of parentheses (). These parentheses might be empty, or they might contain one or more input parameters that the function needs in order to do its job.
To understand how functions work, let’s create our own simple (potentially redundant) function. Here we will create a function that calculates the mean of a single column in a dataframe.
def myMean(x):
m=sum(x)/len(x)
return m
This simple function has three parts:
Function Definition: The keyword
deftells Python you are defining a function. The name of the function ismyMean, andxis the input the function expects.Function Body: This tells Python what to do with the input. Here, we calculate the mean by dividing the sum of all values in
xby the number of values inx.Return statement: The return keyword specifies what the function should output. When
myMeanis called, it will send back the value ofm(which was calculated on the basis ofx).
You may have noticed that when you ran the code block above, nothing seemed to happen. That’s because we only defined the function, we told Python how to do something, but we didn’t actually ask it to do it yet.
Once a function is created, we can run (or call) it by using its name followed by parentheses, and putting the input inside the parentheses. For example:
myMean(weather.Rainfall_mm)
1.7869643833314295
What happened?
We called the function by writing
myMean(...)We gave it an input (inside the brackets,
weather.Rainfall_mmThe function calculated the mean by adding up the values in th input column and dividing by the humber of values (length of the columns)
The function returned an output (shown below the code box), of ~1.79mm, which is the mean rainfall
Let’s double-check this using the built-in pandas function we’re familiar with, df.mean().
print(weather.Rainfall_mm.mean())
1.7869643833312312
Yep, same answer.
Note: You have to run the code block defining the function before you can call it, otherwise it won’t have been created and won’t exist!
1.6.4. Difference of means#
As another example, let’s define a function that takes two inputs and returns the difference between their means. This is the kind of calculation we’ll use later when we compare two groups (for example, height of men vs. women, etc.).
def dMeans(x,y):
mx = sum(x)/len(x)
my = sum(y)/len(y)
diff = mx-my
return diff
Note: that this function now has two inputs: x and y
The function does the following:
calculate the mean for
xasmxcalculate the mean for
yasmycalculate the difference between
mxandmyreturn that difference (
diff)
Let’s use it to calculate the difference in mean rainfall between November and May. We will need to start by selecting the relevant rows and column from the dataframe and giving them clear names, and then feeding those into our function that we defined by:
# find the relevant rows and column in the dataframe and give them a name
nov = weather.query('Month == "Nov"').Rainfall_mm
may = weather.query('Month == "May"').Rainfall_mm
# run the function dMeans
dMeans(nov,may)
# note we could have done the same thing in a single line (see below), however, I think the version above might be a bit easier to read...
# dMeans(weather.query('Month == "Oct"').Rainfall_mm, weather.query('Month == "May"').Rainfall_mm)
0.37674993107251487
Apparently it rains more in November than May, which is unsuprising; the mean daily rainfall is 0.37 mm greater in November.
Note: that which input (nov or may) gets treated as x or y inside the function is determined entirely by the order in which we pass them into the function call.
In the function definition we have:
def dMeans(x,y):
meaning that when we call the function, whatever is first in the brackets becomes x and whatever is second becomes y. So when we call
dMeans(nov,may)
novbecomesxandmaybecomesy
The function returns mean(x) - mean(y) so this is rainfall in November-May; if the output is a positive number this means that there was more rain in November than May.
If we called dMeans(may,nov) we would get rainfall in May-November - presumably a negative number, as the rainfall in November is higher.
1.6.5. Mean difference#
Finally, let’s define a function that takes in two matched pairs inputs and finds the mean difference (within pairs, i.e., heights of brothers vs. sisters)
In our weather dataset suppose we want to know whether the weather was warmer in 2001 than in 1901.
Instead of simply comparing the overall average temperature for all 365 days in 1901 to the overall average for all 365 days in 2001, we can make a more precise comparison by matching each day with the same date in the other year. That is, we can compute the temperature difference for:
Jan 1st 1901 vs. Jan 1st 2001
Jan 2nd 1901 vs. Jan 2nd 2001
…and so on, for all 365 days
Naturally this will only work if the data are actually matched and hence the two samples have the same \(n\).
def mDiff(x,y):
diff = x-y
meanDiff = sum(diff)/len(diff)
return meanDiff
#Note here you need to make sure you are working with "np.arrays" or else you will get a value of nan.
#This is because we otherwise would be trying to subtract two pandas Series with different indices
temp_1901 = np.asarray(weather.query('YYYY == 1901').Tmean)
temp_2001 = np.asarray(weather.query('YYYY == 2001').Tmean)
print(mDiff(temp_1901, temp_2001))
print(dMeans(temp_1901, temp_2001))
-1.2043835616438354
-1.2043835616438425
Eagle-eyed readers may notice that, when applied to the same data, these two functions produce the same numerical result. However, we will see later in this chapter that once we start randomly shuffling the data (as in a permutation test), the difference of means and the mean difference behave quite differently.
The two approaches answer subtly different questions, and their behaviour under permutation becomes an important part of interpreting the test correctly.