1.12. Extra Practice#

This is meant to help you practise the same core skills you developed in the previous exercises. Completing these exercises are optional and only meant to provide a little extra practice if you want.

1.12.1. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

1.12.2. Import the data#

We need some data to work with. We will import a dataset that looks at student achievement in secondary education of two schools The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires.

Here’s a list and description of the variables:

  • school: student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)

  • sex: student’s sex (‘F’ - female or ‘M’ - male)

  • age: student’s age (integer 15-22)

  • Mjob: mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

  • Fjob: father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

  • traveltime: home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

  • studytime: weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

  • nursery: attended nursery school (binary: yes or no)

  • femrel: quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

  • freetime: free time after school (numeric: from 1 - very low to 5 - very high)

  • absences: number of school absences (numeric: from 0 to 93)

  • G1: first term grade (numeric: from 0 to 20)

  • G2: mid term grade (numeric: from 0 to 20)

  • G3: final term grade (numeric: from 0 to 20, output target)

student = pd.read_csv("https://raw.githubusercontent.com/SageBoettcher/StatsCourseBook_2026/main/data/student.csv")
display(student)
school sex age Mjob Fjob traveltime studytime nursery famrel freetime absences G1 G2 G3
0 GP F 18 at_home teacher 2 2 yes 4 3 6 5 6 6
1 GP F 17 at_home other 1 2 no 5 3 4 5 5 6
2 GP F 15 at_home other 1 2 yes 4 3 10 7 8 10
3 GP F 15 health services 1 3 yes 3 2 2 15 14 15
4 GP F 16 other other 1 2 yes 4 3 4 6 10 10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
390 MS M 20 services services 1 2 yes 5 5 11 9 9 9
391 MS M 17 services services 2 1 no 2 4 3 14 16 16
392 MS M 21 other other 1 1 no 5 5 3 10 8 7
393 MS M 18 services other 3 1 no 4 4 0 11 12 10
394 MS M 19 other at_home 1 1 yes 3 2 5 8 9 9

395 rows × 14 columns

1.12.3. Part 1: Paired Data#

Let’s start by considering the paired nature of the data. Here we have repeated measures design, where each student as 3 exam scores G1: first term grade G2: mid-term grade and G3: final grade.

In this case we might want to ask whether students performance differed in the final exam compared to their performance in the first term.

Things you need to decide:#

  • what is our null hypothesis

  • what is our alternative hypothesis?

Is it a paired or unpaired test?

  • therefore which permutation_type is needed, samples, pairings or independent?

  • are we testing for the mean difference (within pairs) mean(x-y), or the difference of means mean(x)-mean(y)

  • what actually are x and y in this case?

Is it a one- or two-tailed test?

  • therefore which alternative hypothesis type is needed, two-sided, greater or less?

What \(\alpha\) value will you use?

  • what value must \(p\) be smaller than, to reject the null hypothesis?

  • this is the experimenter’s choice but usually 0.05 is used (sometimes 0.001 or 0.001)

Your answers here

After you’ve made some decisions about the type of tests you’d like to run, you will want to start by plotting your data.

What plot makese sense for this dataset?

#Your Code Here

Maybe you already have a sense of wether or not there might be a difference in the scores.

What is the Test statistic and descriptive statistics

#Your Code Here

Now perform your permutation test

#Your code here

Report conclusion

Your answer here

Write up

Your answer here

1.12.4. Part 2: Unpaired Data#

We might want to next consider whether overall performance differs across different groups.

We may want to add an additional variable to our dataset which accounts for each students’ overall performance (i.e., an average of the performance in each of the tests)

student['avgPerf'] = (student.G1 + student.G2 + student.G3)/3
student
school sex age Mjob Fjob traveltime studytime nursery famrel freetime absences G1 G2 G3 avgPerf
0 GP F 18 at_home teacher 2 2 yes 4 3 6 5 6 6 5.666667
1 GP F 17 at_home other 1 2 no 5 3 4 5 5 6 5.333333
2 GP F 15 at_home other 1 2 yes 4 3 10 7 8 10 8.333333
3 GP F 15 health services 1 3 yes 3 2 2 15 14 15 14.666667
4 GP F 16 other other 1 2 yes 4 3 4 6 10 10 8.666667
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
390 MS M 20 services services 1 2 yes 5 5 11 9 9 9 9.000000
391 MS M 17 services services 2 1 no 2 4 3 14 16 16 15.333333
392 MS M 21 other other 1 1 no 5 5 3 10 8 7 8.333333
393 MS M 18 services other 3 1 no 4 4 0 11 12 10 11.000000
394 MS M 19 other at_home 1 1 yes 3 2 5 8 9 9 8.666667

395 rows × 15 columns

Here there are several groups you might be interested in testing their average performance:

  • Does performanc differ by school (GP & MS)?

  • Does performance differ by sex?

  • Does performance differ by whether the sudent attended nursery?

Decide which factor you would like to test and determine your hypotheses:

Things you need to decide:#

  • what is our null hypothesis

  • what is our alternative hypothesis?

Is it a paired or unpaired test?

  • therefore which permutation_type is needed, samples, pairings or independent?

  • are we testing for the mean difference (within pairs) mean(x-y), or the difference of means mean(x)-mean(y)

  • what actually are x and y in this case?

Is it a one- or two-tailed test?

  • therefore which alternative hypothesis type is needed, two-sided, greater or less?

What \(\alpha\) value will you use?

  • what value must \(p\) be smaller than, to reject the null hypothesis?

  • this is the experimenter’s choice but usually 0.05 is used (sometimes 0.001 or 0.001)

After you’ve made some decisions about the type of tests you’d like to run, you will want to start by plotting your data.

What plot makese sense for this dataset?

#Your code here

Maybe you already have a sense of wether or not there might be a difference in the scores.

What is the Test statistic and descriptive statistics

#Your code here

Now perform your permutation test

#Your code here

Report conclusion

Your answer here

Write up

Your answer here

1.12.5. Part 3: Correlation#

Finally, you might be interested if any of the variables in our dataset are related. Here you could consider how any of our continuous variables (traveltime, familyrel, freetime, absences, G1, G2, G3, avgPerf, etc.) might relate to one another.

Decide which relationship you’d like to test:

Things you need to decide:#

  • what is our null hypothesis

  • what is our alternative hypothesis?

Is it a paired or unpaired test?

  • therefore which permutation_type is needed, samples, pairings or independent?

  • are we testing for the mean difference (within pairs) mean(x-y), or the difference of means mean(x)-mean(y)

  • what actually are x and y in this case?

Is it a one- or two-tailed test?

  • therefore which alternative hypothesis type is needed, two-sided, greater or less?

What \(\alpha\) value will you use?

  • what value must \(p\) be smaller than, to reject the null hypothesis?

  • this is the experimenter’s choice but usually 0.05 is used (sometimes 0.001 or 0.001)

After you’ve made some decisions about the type of tests you’d like to run, you will want to start by plotting your data.

What plot makese sense for this dataset?

#Your code here

Maybe you already have a sense of wether or not there might be a relationship in the scores.

What is the Test statistic and descriptive statistics

#Your code here

Now perform your permutation test

#Your code here

Report conclusion

Your answer here

Write up

Your answer here