3.5. Handling NaNs#
NaN (Not a Number) is a special value used to represent missing data in many scientific programming languages, including Python.
Using NaN instead of a numerical dummy value (like 9999 or 0) is helpful because most Python functions either ignore NaNs by default, or can be set to ignore NaNs using an optional function argument. This is useful, for example, if you are trying to calculate the mean of a column. the mean() function would include 9999 in its calculation but ignore NaN
In this section we will review:
Why
NaNis better than a numerical dummy valueHow to check for
NaNs in a dataframeSetting the
NaN-handling in Python functions
Set up Python Libraries
As usual you will need to run this code block to import the relevant Python libraries
# Set-up Python libraries - you need to run this but you don't need to change it
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import seaborn as sns
sns.set_theme(style='white')
import statsmodels.api as sm
import statsmodels.formula.api as smf
3.5.1. Import a dataset to work with#
We again work with the NYC heart attack dataset
The data will be automatically loaded fromt the internet when you run this code block:
hospital=pd.read_csv('https://raw.githubusercontent.com/jillxoreilly/StatsCourseBook_2024/main/data/heartAttack.csv')
display(hospital)
| CHARGES | LOS | AGE | SEX | DRG | DIED | |
|---|---|---|---|---|---|---|
| 0 | 4752.00 | 10 | 79.0 | F | 122.0 | 0.0 |
| 1 | 3941.00 | 6 | 34.0 | F | 122.0 | 0.0 |
| 2 | 3657.00 | 5 | 76.0 | F | 122.0 | 0.0 |
| 3 | 1481.00 | 2 | 80.0 | F | 122.0 | 0.0 |
| 4 | 1681.00 | 1 | 55.0 | M | 122.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 12839 | 22603.57 | 14 | 79.0 | F | 121.0 | 0.0 |
| 12840 | NaN | 7 | 91.0 | F | 121.0 | 0.0 |
| 12841 | 14359.14 | 9 | 79.0 | F | 121.0 | 0.0 |
| 12842 | 12986.00 | 5 | 70.0 | M | 121.0 | 0.0 |
| 12843 | NaN | 1 | 81.0 | M | 123.0 | 1.0 |
12844 rows × 6 columns
3.5.2. NaN is not a number!#
Humans may recognise dummy values like 9999 for what they are, but the computer will treat them as numbers.
Say we want to find the mean and standard deviation of the age of patients in out hospital dataset (remembering that missing data were coded as 9999):
print(hospital.AGE.mean())
print(hospital.AGE.std())
67.83507241862638
124.700553618833
Is the value for standard deviation realistic?
These values include the 9999s just as if there were really people 9999 years old in the sample. This is why we see that the standard deviation (a measure of variance within the data) is unusually large.
If we replace the 9999s with NaN we get the correct mean and standard deviation for the data set, excluding the missing data. The mean has changed slightly, and the standard deviation is now much more reasonable.
hospital.AGE = hospital.AGE.replace(9999, np.nan)
print(hospital.AGE.mean())
print(hospital.AGE.std())
66.28816199376946
13.654236726825275
3.5.3. Creating NaNs#
If we want to set a value to NaN, we can’t just type NaN or ”NaN” as these would be interpreted as a potential variable name or text. Instead, we cav ‘create’ the value NaN using the numpy function np.nan.
For example, if you were to set the value of CHARGES in row 2 to be NaN you would use this:
hospital.loc[1, 'CHARGES']=np.nan
3.5.4. Check for NaNs#
Although NaNs are ignored by many Python functions, you sometimes may want to know if a particular dataset contains NaNs. This can be useful to determine how much data loss you have and if the values you are calculating are to be trusted. To do this you can use the following code:
df.isna()- checks if a particular value isNaNand returnsTrueorFalsedf.isna().sum()- returns the total number ofTrues from thedf.isna()call
df.isna() will return a column with True or False for each value of a specified data frame or column. For example here we will check which values in AGE are NaN:
hospital.AGE.isna()
0 False
1 False
2 False
3 False
4 False
...
12839 False
12840 False
12841 False
12842 False
12843 False
Name: AGE, Length: 12844, dtype: bool
df.isna() returned a column with True or False for each value of AGE - True for people where the age is coded as NaN and False otherwise.
This isn’t very readable particularly since the dataset contains more than 12000 rows. If we want to know how many NaNs were in the column, we can use a trick: Python treats True as 1 and False as 0. So if we just take the sum of the column, we get the total nuber of NaNs:
hospital.AGE.isna().sum()
np.int64(4)
Four people’s age was coded as NaN.
3.5.5. NaN handling by Python functions#
Many Python functions automatically ignore NaNs.
These include
df.mean()df.std()df.quantile()…. and most other descriptive statisticssns.histogram()sns.scatter()… and most otherSeabornandMatplotlibfunctions
However, some functions do not automatically ignore NaNs, and instead will give an error message, or return the value NaN, if the input data contains NaNs.
This includes a lot of functions from the library scipy.stats, which we will use later in the course. For example, say I want to use a \(t\)-test to ask if the male patients are older than the females
don’t worry if you don’t yet know what a \(t\)-test is - this will make sense when you return to it for revision – the important thing to note here are the
NaNvalues in the output!
stats.ttest_ind(hospital.query('SEX == "M"').AGE, hospital.query('SEX == "F"').AGE)
TtestResult(statistic=np.float64(nan), pvalue=np.float64(nan), df=np.float64(nan))
The function stats.ttest_ind() performs an independent samples \(t\)-test between the two samples we gave it (the ages of male and female patients) and should return a \(t\)-value (statistic) a \(p\) value (pvalue), and the degrees freedom (df)
Right now both of these are NaN because the NaNs in the input were not ignored.
We can tell the function stats.ttest_ind() to ignore NaNs, using the argumment nan_policy='omit':
stats.ttest_ind(hospital.query('SEX == "M"').AGE, hospital.query('SEX == "F"').AGE, nan_policy='omit')
TtestResult(statistic=np.float64(-35.41617555682539), pvalue=np.float64(3.1864909732541125e-262), df=np.float64(12836.0))
Now we have actual values instead of NaN: \(t = -35.4\) and \(p = 3.1 x 10^{-262}\) (a very small number)
If you run a Python function and the output is NaN, you very probably need to change how the function handles NaNs using an argument. Check the function’s help page online to get the correct syntax.