Neutralizing bad datapoints

3.4. Neutralizing bad datapoints#

Once you have found dummy values (such as 9999), bad datapoints or outliers, you will want to remove their influence from your analysis.

To do this, you have three options:

Replace the bad values with a dummy value (such as NaN)
- This retains as much information as possible
Replace the whole record (row of the dataframe) with a dummy value (such as NaN)
- Useful if the bad value for one variable casts doubt on the data quality for the whole record, or for some reason you need to retain only complete records
Delete the whole record (row of the dataframe)
- Generally not recommended as could be seen as dishonest (see below)
- Exception would be obvious duplicate records or completely blank records

Most often, the first option is best as this is a method which retains the most information possible, without distorting your data. To understand how this method works please watch this short video introducing the NaN (Not a Number) placeholder:

%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/nxhUlpsuj10?si=jUq6W-NtbNoVv0Ev" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

3.4.1. Set up Python Libraries#

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

3.4.2. Import a dataset to work with#

The data will be automatically loaded from the internet when you run this code block:

hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')
display(hospital)

	CHARGES	LOS	AGE	SEX	DRG	DIED
0	4752.00	10	79.0	F	122.0	0.0
1	3941.00	6	34.0	F	122.0	0.0
2	3657.00	5	76.0	F	122.0	0.0
3	1481.00	2	80.0	F	122.0	0.0
4	1681.00	1	55.0	M	122.0	0.0
...	...	...	...	...	...	...
12839	22603.57	14	79.0	F	121.0	0.0
12840	NaN	7	91.0	F	121.0	0.0
12841	14359.14	9	79.0	F	121.0	0.0
12842	12986.00	5	70.0	M	121.0	0.0
12843	NaN	1	81.0	M	123.0	1.0

12844 rows × 6 columns

3.4.3. 1. Replace the bad values with `NaN`#

In most cases the best option is to replace only the bad values, while keeping the rest of the record intact.

We’ll replace these bad values with NaN which is a special placeholder that indicates missing or invalid data. Most Python and Pandas functions automatically ignore NaN values when performing calculations. For example, if we compute the mean of a column that contains NaNs, Pandas will simply calculate the mean of the non-NaN values.

I. Replace a dummy value with `df.replace()`#

If all the bad datapoints have the same value, for example 9999, we can easily replace them using the following syntax. Here you are setting the column in your dataframe to be the exact same column, but with the values replaced.

df.ColName = df.ColName.replace(old_value, new_value)

# note make sure you replace the original column!

hospital.LOS = hospital.LOS.replace(9999, np.nan) 
hospital.AGE = hospital.AGE.replace(9999, np.nan) 

# check they have gone - max values for LOS and AGE should no longer be 9999
hospital.describe()

	CHARGES	LOS	AGE	DRG	DIED
count	12145.000000	12843.000000	12840.000000	12841.000000	12841.000000
mean	9879.087615	7.567858	66.288162	121.690523	0.109805
std	6558.399650	5.114357	13.654237	0.658289	0.312658
min	3.000000	0.000000	20.000000	121.000000	0.000000
25%	5422.200000	4.000000	57.000000	121.000000	0.000000
50%	8445.000000	7.000000	67.000000	122.000000	0.000000
75%	12569.040000	10.000000	77.000000	122.000000	0.000000
max	47910.120000	38.000000	103.000000	123.000000	1.000000

II. Replace values over a cutoff with df.loc[]#

In some cases we want to replace a range of values rather than a specific value. Let’s say for example we decided to remove charges over 30000 dollars.

To do this we will now learn to use slightly different syntax compare to our usual df.query(). This is because we are now going to set values in the data frame rather than simply selecting . For this we will use the function df.loc[] which accesses specific locations in the dataframe defined by row and column numbers and then we can use the \(=\) to assign value.

df.loc[row_index, column_index] = new_value

As an example if I wanted to replace the value of SEX in the first row to ’bananas’ :

hospital.loc[0,'SEX']='bananas' # remember row zero is the first row in Python!
hospital.head() # check it worked

	CHARGES	LOS	AGE	SEX	DRG
0	4752.0	10.0	79.0	bananas	122.0
1	3941.0	6.0	34.0	F	122.0
2	3657.0	5.0	76.0	F	122.0
3	1481.0	2.0	80.0	F	122.0
4	1681.0	1.0	55.0	M	122.0

Now, instead of giving a row number, I can use a condition or a logical criterion to identify which rows we want to edit. This will automatically change select the rows where the condition is True.

For example, lets replace the value of CHARGES for all patients where the charge is greater than 30000 to NaN

hospital.loc[(hospital.CHARGES > 30000),'CHARGES']=np.nan # remember row zero is the first row in Python!
hospital.describe() # check it worked

	CHARGES	LOS	AGE	DRG	DIED
count	11928.000000	12843.000000	12840.000000	12841.000000	12841.000000
mean	9402.119925	7.567858	66.288162	121.690523	0.109805
std	5537.499427	5.114357	13.654237	0.658289	0.312658
min	3.000000	0.000000	20.000000	121.000000	0.000000
25%	5369.150000	4.000000	57.000000	121.000000	0.000000
50%	8329.775000	7.000000	67.000000	122.000000	0.000000
75%	12278.375000	10.000000	77.000000	122.000000	0.000000
max	30000.000000	38.000000	103.000000	123.000000	1.000000

Note the maximum value of CHARGES should now be 30000 dollars

2. Replace the whole record#

Occasionally we decide a whole record or row of the data table (often corresponding to an individual) should be replaced with NaN.

For example, we may decided to take this approach when:

We realise after the fact that a particular participant didn’t meet inclusion criteria for our stud (e.g., we are studying unmedicated patients witha certain condition, and they disclosed after data collection that they are already on medication)
They have bad values for several variables
We wish to only retain records that are complete (e.g., a participant left many questions on a survey blank)

We can replace the whole record with NaN using df.loc[] and simply not specifying a column. For example, let’s make the whole second row of the dataframe become NaN:

hospital.loc[1] = np.nan # remember we count from zero in Python so second row = row 1 !
hospital.head() # check it worked

	CHARGES	LOS	AGE	SEX	DRG	DIED
0	4752.0	10.0	79.0	bananas	122.0	0.0
1	NaN	NaN	NaN	NaN	NaN	NaN
2	3657.0	5.0	76.0	F	122.0	0.0
3	1481.0	2.0	80.0	F	122.0	0.0
4	1681.0	1.0	55.0	M	122.0	0.0

Or say we want to replace the whole record for anyone with CHARGES under 100 dollars with NaN:

hospital.loc[(hospital.CHARGES<100)] = np.nan 
hospital.describe() # check it worked - min charge should now be >= $100

	CHARGES	LOS	AGE	DRG	DIED
count	11919.000000	12834.000000	12831.000000	12832.000000	12832.000000
mean	9408.858857	7.568568	66.294053	121.689994	0.109570
std	5534.051880	5.113899	13.651050	0.658085	0.312365
min	101.000000	0.000000	20.000000	121.000000	0.000000
25%	5374.960000	4.000000	57.000000	121.000000	0.000000
50%	8334.000000	7.000000	67.000000	122.000000	0.000000
75%	12285.550000	10.000000	77.000000	122.000000	0.000000
max	30000.000000	38.000000	103.000000	123.000000	1.000000

3. Drop the record from the dataframe#

NOTE: THIS IS GENERALLY NOT RECOMMENDED - in general we replace missing data with NaN so that it is transparent how many bad datapoints were collected.

Imagine if we conducted a survey including a question “have you ever committed a crime”. Perhaps some people would not answer this question (and indeed this may be more likely if the answer is ‘yes’).

If we deleted all the records for people who skipped the question, this could be misleading.

However sometimes we need to just remove rows. An example would be if a pile of paper forms were automatically scanned, and it turned out a lot of the were completely blank (spare forms). We would want to delete the records corresponding to those spare forms.

Another example would be if a record was an exact duplicate of another (if the same form was scanned twice, for example)

To remove records (rows) we use df.drop().

df.drop(row_index)

As an example, perhaps we want to remove row 0 (the first row), as we this it is a prank entry (SEX == ‘bananas’):

hospital.drop(0, inplace=True)
hospital # check it's gone

	CHARGES	LOS	AGE	SEX	DRG	DIED
1	NaN	NaN	NaN	NaN	NaN	NaN
2	3657.00	5.0	76.0	F	122.0	0.0
3	1481.00	2.0	80.0	F	122.0	0.0
4	1681.00	1.0	55.0	M	122.0	0.0
5	6378.64	9.0	84.0	M	121.0	0.0
...	...	...	...	...	...	...
12839	22603.57	14.0	79.0	F	121.0	0.0
12840	NaN	7.0	91.0	F	121.0	0.0
12841	14359.14	9.0	79.0	F	121.0	0.0
12842	12986.00	5.0	70.0	M	121.0	0.0
12843	NaN	1.0	81.0	M	123.0	1.0

12843 rows × 6 columns

Exercises#

Complete the following exercises to get used to the syntax for replacing values. It’s boring but if you master it now it will save some stress later on.

# 1. For all cases where AGE is greater than 120, replace the value of AGE with `NaN`
# Reload the data:
hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')

# YOUR CODE HERE

# uncomment the next line to check it worked
#hospital.sort_values(by='AGE')

# 2. For all cases where AGE is greater than 120, replace the entire record with `NaN`
# Reload the data:
hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')

# YOUR CODE HERE

# uncomment the next line to check it worked
#hospital.sort_values(by='AGE')

# 3. For all values 9999 in all columns, replace the 9999 with NaN
# Reload the data:
hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')

# YOUR CODE HERE

# uncomment the next line to check it worked
# hospital.sort_values(by='LOS')

# 4. Replace the code DRG = 123 with the string 'DIED'
# Reload the data:
hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')

# YOUR CODE HERE

# uncomment the next line to check it worked
# hospital