3.5. Handling NaNs#

NaN (Not a Number) is a special value used to represent missing data in many scientific programming languages, including Python.

Using NaN instead of a numerical dummy value (like 9999 or 0) is helpful because most Python functions either ignore NaNs by default, or can be set to ignore NaNs using an optional function argument. This is useful, for example, if you are trying to calculate the mean of a column. the mean() function would include 9999 in its calculation but ignore NaN

In this section we will review:

  • Why NaN is better than a numerical dummy value

  • How to check for NaNs in a dataframe

  • Setting the NaN-handling in Python functions

Set up Python Libraries

As usual you will need to run this code block to import the relevant Python libraries

# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf

3.5.1. Import a dataset to work with#

We again work with the NYC heart attack dataset

The data will be automatically loaded fromt the internet when you run this code block:

hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')
display(hospital)
CHARGES LOS AGE SEX DRG DIED
0 4752.00 10 79.0 F 122.0 0.0
1 3941.00 6 34.0 F 122.0 0.0
2 3657.00 5 76.0 F 122.0 0.0
3 1481.00 2 80.0 F 122.0 0.0
4 1681.00 1 55.0 M 122.0 0.0
... ... ... ... ... ... ...
12839 22603.57 14 79.0 F 121.0 0.0
12840 NaN 7 91.0 F 121.0 0.0
12841 14359.14 9 79.0 F 121.0 0.0
12842 12986.00 5 70.0 M 121.0 0.0
12843 NaN 1 81.0 M 123.0 1.0

12844 rows × 6 columns

3.5.2. NaN is not a number!#

Humans may recognise dummy values like 9999 for what they are, but the computer will treat them as numbers.

Say we want to find the mean and standard deviation of the age of patients in out hospital dataset (remembering that missing data were coded as 9999):

print(hospital.AGE.mean())
print(hospital.AGE.std())
67.83507241862638
124.700553618833
  • Is the value for standard deviation realistic?

These values include the 9999s just as if there were really people 9999 years old in the sample. This is why we see that the standard deviation (a measure of variance within the data) is unusually large.

If we replace the 9999s with NaN we get the correct mean and standard deviation for the data set, excluding the missing data. The mean has changed slightly, and the standard deviation is now much more reasonable.

hospital.AGE = hospital.AGE.replace(9999, np.nan)
print(hospital.AGE.mean())
print(hospital.AGE.std())
66.28816199376946
13.654236726825275

3.5.3. Creating NaNs#

If we want to set a value to NaN, we can’t just type NaN or ”NaN” as these would be interpreted as a potential variable name or text. Instead, we cav ‘create’ the value NaN using the numpy function np.nan.

For example, if you were to set the value of CHARGES in row 2 to be NaN you would use this:

hospital.loc[1, 'CHARGES']=np.nan

3.5.4. Check for NaNs#

Although NaNs are ignored by many Python functions, you sometimes may want to know if a particular dataset contains NaNs. This can be useful to determine how much data loss you have and if the values you are calculating are to be trusted. To do this you can use the following code:

  • df.isna() - checks if a particular value is NaN and returns True or False

  • df.isna().sum() - returns the total number of Trues from the df.isna() call

df.isna() will return a column with True or False for each value of a specified data frame or column. For example here we will check which values in AGE are NaN:

hospital.AGE.isna()
0        False
1        False
2        False
3        False
4        False
         ...  
12839    False
12840    False
12841    False
12842    False
12843    False
Name: AGE, Length: 12844, dtype: bool

df.isna() returned a column with True or False for each value of AGE - True for people where the age is coded as NaN and False otherwise.

This isn’t very readable particularly since the dataset contains more than 12000 rows. If we want to know how many NaNs were in the column, we can use a trick: Python treats True as 1 and False as 0. So if we just take the sum of the column, we get the total nuber of NaNs:

hospital.AGE.isna().sum()
np.int64(4)

Four people’s age was coded as NaN.

3.5.5. NaN handling by Python functions#

Many Python functions automatically ignore NaNs.

These include

  • df.mean()

  • df.std()

  • df.quantile() …. and most other descriptive statistics

  • sns.histogram()

  • sns.scatter() … and most other Seaborn and Matplotlib functions

However, some functions do not automatically ignore NaNs, and instead will give an error message, or return the value NaN, if the input data contains NaNs.

This includes a lot of functions from the library scipy.stats, which we will use later in the course. For example, say I want to use a \(t\)-test to ask if the male patients are older than the females

  • don’t worry if you don’t yet know what a \(t\)-test is - this will make sense when you return to it for revision – the important thing to note here are the NaN values in the output!

stats.ttest_ind(hospital.query('SEX == "M"').AGE, hospital.query('SEX == "F"').AGE)
TtestResult(statistic=np.float64(nan), pvalue=np.float64(nan), df=np.float64(nan))

The function stats.ttest_ind() performs an independent samples \(t\)-test between the two samples we gave it (the ages of male and female patients) and should return a \(t\)-value (statistic) a \(p\) value (pvalue), and the degrees freedom (df)

Right now both of these are NaN because the NaNs in the input were not ignored.

We can tell the function stats.ttest_ind() to ignore NaNs, using the argumment nan_policy='omit':

stats.ttest_ind(hospital.query('SEX == "M"').AGE, hospital.query('SEX == "F"').AGE, nan_policy='omit')
TtestResult(statistic=np.float64(-35.41617555682539), pvalue=np.float64(3.1864909732541125e-262), df=np.float64(12836.0))

Now we have actual values instead of NaN: \(t = -35.4\) and \(p = 3.1 x 10^{-262}\) (a very small number)

If you run a Python function and the output is NaN, you very probably need to change how the function handles NaNs using an argument. Check the function’s help page online to get the correct syntax.