Handling NaNs

3.5. Handling NaNs#

NaN (Not a Number) is a special value used to represent missing data in many scientific programming languages, including Python.

Using NaN instead of a numerical dummy value (like 9999 or 0) is helpful because most Python functions either ignore NaNs by default, or can be set to ignore NaNs using an optional function argument. This is useful, for example, if you are trying to calculate the mean of a column. the mean() function would include 9999 in its calculation but ignore NaN

In this section we will review:

Why NaN is better than a numerical dummy value
How to check for NaNs in a dataframe
Setting the NaN-handling in Python functions

Set up Python Libraries

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

3.5.1. Import a dataset to work with#

We again work with the NYC heart attack dataset

The data will be automatically loaded fromt the internet when you run this code block:

hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')
display(hospital)

	CHARGES	LOS	AGE	SEX	DRG	DIED
0	4752.00	10	79.0	F	122.0	0.0
1	3941.00	6	34.0	F	122.0	0.0
2	3657.00	5	76.0	F	122.0	0.0
3	1481.00	2	80.0	F	122.0	0.0
4	1681.00	1	55.0	M	122.0	0.0
...	...	...	...	...	...	...
12839	22603.57	14	79.0	F	121.0	0.0
12840	NaN	7	91.0	F	121.0	0.0
12841	14359.14	9	79.0	F	121.0	0.0
12842	12986.00	5	70.0	M	121.0	0.0
12843	NaN	1	81.0	M	123.0	1.0

12844 rows × 6 columns

3.5.2. `NaN` is not a number!#

Humans may recognise dummy values like 9999 for what they are, but the computer will treat them as numbers.

Say we want to find the mean and standard deviation of the age of patients in out hospital dataset (remembering that missing data were coded as 9999):

print(hospital.AGE.mean())
print(hospital.AGE.std())

67.83507241862638
124.700553618833

Is the value for standard deviation realistic?

These values include the 9999s just as if there were really people 9999 years old in the sample. This is why we see that the standard deviation (a measure of variance within the data) is unusually large.

If we replace the 9999s with NaN we get the correct mean and standard deviation for the data set, excluding the missing data. The mean has changed slightly, and the standard deviation is now much more reasonable.

hospital.AGE = hospital.AGE.replace(9999, np.nan)
print(hospital.AGE.mean())
print(hospital.AGE.std())

66.28816199376946
13.654236726825275

3.5.3. Creating `NaN`s#

If we want to set a value to NaN, we can’t just type NaN or ”NaN” as these would be interpreted as a potential variable name or text. Instead, we cav ‘create’ the value NaN using the numpy function np.nan.

For example, if you were to set the value of CHARGES in row 2 to be NaN you would use this:

hospital.loc[1, 'CHARGES']=np.nan

3.5.4. Check for `NaNs`#

Although NaNs are ignored by many Python functions, you sometimes may want to know if a particular dataset contains NaNs. This can be useful to determine how much data loss you have and if the values you are calculating are to be trusted. To do this you can use the following code:

df.isna() - checks if a particular value is NaN and returns True or False
df.isna().sum() - returns the total number of Trues from the df.isna() call

df.isna() will return a column with True or False for each value of a specified data frame or column. For example here we will check which values in AGE are NaN:

hospital.AGE.isna()

      False
      False
      False
      False
      False
         ...  
  False
  False
  False
  False
  False
Name: AGE, Length: 12844, dtype: bool

df.isna() returned a column with True or False for each value of AGE - True for people where the age is coded as NaN and False otherwise.

This isn’t very readable particularly since the dataset contains more than 12000 rows. If we want to know how many NaNs were in the column, we can use a trick: Python treats True as 1 and False as 0. So if we just take the sum of the column, we get the total nuber of NaNs:

hospital.AGE.isna().sum()

np.int64(4)

Four people’s age was coded as NaN.

3.5.5. NaN handling by Python functions#

Many Python functions automatically ignore NaNs.

These include

df.mean()
df.std()
df.quantile() …. and most other descriptive statistics
sns.histogram()
sns.scatter() … and most other Seaborn and Matplotlib functions

However, some functions do not automatically ignore NaNs, and instead will give an error message, or return the value NaN, if the input data contains NaNs.

This includes a lot of functions from the library scipy.stats, which we will use later in the course. For example, say I want to use a \(t\)-test to ask if the male patients are older than the females

don’t worry if you don’t yet know what a \(t\)-test is - this will make sense when you return to it for revision – the important thing to note here are the NaN values in the output!

stats.ttest_ind(hospital.query('SEX == "M"').AGE, hospital.query('SEX == "F"').AGE)

TtestResult(statistic=np.float64(nan), pvalue=np.float64(nan), df=np.float64(nan))

The function stats.ttest_ind() performs an independent samples \(t\)-test between the two samples we gave it (the ages of male and female patients) and should return a \(t\)-value (statistic) a \(p\) value (pvalue), and the degrees freedom (df)

Right now both of these are NaN because the NaNs in the input were not ignored.

We can tell the function stats.ttest_ind() to ignore NaNs, using the argumment nan_policy='omit':

stats.ttest_ind(hospital.query('SEX == "M"').AGE, hospital.query('SEX == "F"').AGE, nan_policy='omit')

TtestResult(statistic=np.float64(-35.41617555682539), pvalue=np.float64(3.1864909732541125e-262), df=np.float64(12836.0))

Now we have actual values instead of NaN: \(t = -35.4\) and \(p = 3.1 x 10^{-262}\) (a very small number)

If you run a Python function and the output is NaN, you very probably need to change how the function handles NaNs using an argument. Check the function’s help page online to get the correct syntax.

Handling NaNs

Contents

3.5. Handling NaNs#

3.5.1. Import a dataset to work with#

3.5.2. NaN is not a number!#

3.5.3. Creating NaNs#

3.5.4. Check for NaNs#

3.5.5. NaN handling by Python functions#

3.5.2. `NaN` is not a number!#

3.5.3. Creating `NaN`s#

3.5.4. Check for `NaNs`#