3.4. Sampling with and without replacement#

This notebook introduces the idea of sampling and the pandas function df.sample()

When we sample from a population or parent distribution, we can do so with or without replacement.

Sampling without replacement is what we usually do when running an experiment or survey. For example, imagine if we give 100 students a wellbeing questionnaire - each student in our sample is a member of the larger population or parent distribution (for example, all students or all students in the college we were sampling from etc). Because, we only want to survey each person once, we sample without replacement. That is, if one student is selected they cannot be selected again.

Sampling with replacement is often a good way to model random events. A classic example is rolling a dice. Each roll yeilds an outcome (1-6), but if you a roll a 3 on the first round, this does not remove the possibility of rolling another 3 the next time. Each roll is an independent sample, meaning the die is effectively “reset” before each roll.

A direct comparison beween the sampling types would be drawing cards from a deck:

  • Without replacement, each card once drawn is set aside, so it is impossible to draw the same card twice.

  • With replacement, each card is tucked back into the deck after being drawn, so it can be drawn again.

3.4.1. Set up Python libraries#

As usual, run the code cell below to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
import warnings 
warnings.simplefilter('ignore', category=FutureWarning)

3.4.2. Toy example#

Let’s explore the idea of sampling with and without replacement using a very simple example.

Note: a simple example designed just to illustrate a point is sometimes called a toy example

Say I have a dataset listing four pets belonging to four different children:

[cat, dog, cat, rabbit]

Let’s call this our “population”. If I sample from this dataset, I get a new list of pets. For example, say I draw a sample of size \(𝑛=2\), my sample might be [cat, cat].

Without replacement#

If I sample without replacement, after I have ‘drawn’ my first sample pet fromt the original dataset, I cannot draw it again - my next sample pet will be drawn from the remaining three. So if I originally sample cat, I will remove this from the original list before I sample again, leaving me with only [dog, cat, rabbit] to choose from.

The consequence of this is that all samples of size 𝑛=4 contain all of the original 4 pets, albeit in a different order:

[cat, cat, dog, rabbit]

[rabbit, cat, dog, cat]

[cat, dog, rabbit, cat]

etc

With replacement#

If I sample with replacement, each ‘draw’ can be any of the four animals. Remember, it’s like pulling a card from a deck, checking which animal is on it, and then replacing the card in the deck before the next sample is drawn.

So I could get:

[cat, cat, cat, cat]

or more likely:

[cat, dog, cat, cat]

[rabbit, dog, cat, rabbit]

… etc

3.4.3. Sampling from a Pandas dataframe#

Pandas has a handy built-in sampling function called df.sample()

Let’s see it at work. First start by loading in the simple dataframe.

pets = pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/pets.csv')
pets
Child Pet
0 Anna cat
1 Betty cat
2 Charley cat
3 David dog
4 Egbert cat
5 Freddie rabbit
6 Georgia dog
7 Henrietta cat

The .sample() function will take two arguments: \(𝑛\) and \(replace\)

# draw a sample of size n=3 without replacement
pets.sample(3, replace=False)
Child Pet
2 Charley cat
5 Freddie rabbit
6 Georgia dog
# draw a sample of size n=12 with replacement
pets.sample(12, replace=True)
Child Pet
3 David dog
2 Charley cat
6 Georgia dog
3 David dog
0 Anna cat
0 Anna cat
1 Betty cat
4 Egbert cat
0 Anna cat
0 Anna cat
6 Georgia dog
6 Georgia dog

Try running each of the cells above several times to confirm that the sampling is indeed random.

Summarizing samples#

Often we are not interested in the exact contents of the sample, but some summary value - for example, how many cats are there?

# Make a new sample and just get the column 'Pet' save this into a new variable
pets_sample = pets.sample(8, replace=True).Pet
print(pets_sample)
5    rabbit
6       dog
4       cat
6       dog
6       dog
6       dog
1       cat
3       dog
Name: Pet, dtype: object
# Count the cats from your previous sample
print(sum(pets_sample=='cat'))
2