Table of Contents
URL: https://www.progressiverobot.com/pandas-dropna-drop-null-na-values-from-dataframe/
Introduction
In this tutorial, you'll learn how to use panda's DataFrame dropna() function.
NA values are "Not Available". This can apply to Null, None, pandas.NaT, or numpy.nan. Using dropna() will drop the rows and columns with these values. This can be beneficial to provide you with only valid data.
By default, this function returns a new DataFrame and the source DataFrame remains unchanged.
This tutorial was verified with Python 3.10.9, pandas 1.5.2, and NumPy 1.24.1.
Syntax
dropna() takes the following parameters:
dropna(self, axis=<^>0<^>, how=<^>"any"<^>, thresh=<^>None<^>, subset=<^>None<^>, inplace=<^>False<^>)
axis:{0 (or 'index'), 1 (or 'columns')}, default 0- If
0, drop rows with missing values. - If
1, drop columns with missing values. how:{'any', 'all'}, default 'any'- If
'any', drop the row or column if any of the values isNA. - If
'all', drop the row or column if all of the values areNA. thresh: (optional) anintvalue to specify the threshold for the drop operation.subset: (optional) column label or sequence of labels to specify rows or columns.inplace: (optional) aboolvalue.- If
True, the source DataFrame is changed andNoneis returned.
Constructing Sample DataFrames
Construct a sample DataFrame that contains valid and invalid values:
[label dropnaExample.py]
import pandas as pd
import numpy as np
d1 = {
'Name': ['Shark', 'Whale', 'Jellyfish', 'Starfish'],
'ID': [1, 2, 3, 4],
'Population': [100, 200, np.nan, pd.NaT],
'Regions': [1, None, pd.NaT, pd.NaT]
}
df1 = pd.DataFrame(d1)
print(df1)
This code will print out the DataFrame:
[secondary_label Output]
Name ID Population Regions
0 Shark 1 100 1
1 Whale 2 200 None
2 Jellyfish 3 NaN NaT
3 Starfish 4 NaT NaT
Then add a second DataFrame with additional rows and columns with NA values:
d2 = {
'Name': ['Shark', 'Whale', 'Jellyfish', 'Starfish', pd.NaT],
'ID': [1, 2, 3, 4, pd.NaT],
'Population': [100, 200, np.nan, pd.NaT, pd.NaT],
'Regions': [1, None, pd.NaT, pd.NaT, pd.NaT],
'Endangered': [pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT]
}
df2 = pd.DataFrame(d2)
print(df2)
This will output a new DataFrame:
[secondary_label Output]
Name ID Population Regions Endangered
0 Shark 1 100 1 NaT
1 Whale 2 200 None NaT
2 Jellyfish 3 NaN NaT NaT
3 Starfish 4 NaT NaT NaT
4 NaT NaT NaT NaT NaT
You will use the preceding DataFrames in the examples that follow.
Dropping All Rows with Missing Values
Use dropna() to remove rows with any None, NaN, or NaT values:
[label dropnaExample.py]
dfresult = df1.dropna()
print(dfresult)
This will output:
[secondary_label Output]
Name ID Population Regions
0 Shark 1 100 1
A new DataFrame with a single row that didn't contain any NA values.
Dropping All Columns with Missing Values
Use dropna() with axis=1 to remove columns with any None, NaN, or NaT values:
dfresult = df1.dropna(axis=1)
print(dfresult)
The columns with any None, NaN, or NaT values will be dropped:
[secondary_label Output]
Name ID
0 Shark 1
1 Whale 2
2 Jellyfish 3
3 Starfish 4
A new DataFrame with a single column that contained non-NA values.
Dropping Rows or Columns if all the Values are Null with how
Use the second DataFrame and how:
[label dropnaExample.py]
dfresult = df2.dropna(how='all')
print(dfresult)
The rows with all values equal to NA will be dropped:
[secondary_label Output]
Name ID Population Regions Endangered
0 Shark 1 100 1 NaT
1 Whale 2 200 None NaT
2 Jellyfish 3 NaN NaT NaT
3 Starfish 4 NaT NaT NaT
The fifth row was dropped.
Next, use how and specify the axis:
[label dropnaExample.py]
dfresult = df2.dropna(how='all', axis=1)
print(dfresult)
The columns with all values equal to NA will be dropped:
[secondary_label Output]
Name ID Population Regions
0 Shark 1 100 1
1 Whale 2 200 None
2 Jellyfish 3 NaN NaT
3 Starfish 4 NaT NaT
4 NaT NaT NaT NaT
The fifth column was dropped.
Dropping Rows or Columns if a Threshold is Crossed with thresh
Use the second DataFrame with thresh to drop rows that do not meet the threshold of at least 3 non-NA values:
[label dropnaExample.py]
dfresult = df2.dropna(thresh=3)
print(dfresult)
The rows do not have at least 3 non-NA will be dropped:
[secondary_label Output]
Name ID Population Regions Endangered
0 Shark 1 100 1 NaT
1 Whale 2 200 None NaT
The third, fourth, and fifth rows were dropped.
Dropping Rows or Columns for Specific subsets
Use the second DataFrame with subset to drop rows with NA values in the Population column:
[label dropnaExample.py]
dfresult = df2.dropna(subset=['Population'])
print(dfresult)
The rows that have Population with NA values will be dropped:
[secondary_label Output]
Name ID Population Regions Endangered
0 Shark 1 100 1 NaT
1 Whale 2 200 None NaT
The third, fourth, and fifth rows were dropped.
You can also specify the index values in the subset when dropping columns from the DataFrame:
[label dropnaExample.py]
dfresult = df2.dropna(subset=[1, 2], axis=1)
print(dfresult)
The columns that contain NA values in subset of rows 1 and 2:
[secondary_label Output]
Name ID
0 Shark 1
1 Whale 2
2 Jellyfish 3
3 Starfish 4
4 NaT NaT
The third, fourth, and fifth columns were dropped.
Changing the source DataFrame after Dropping Rows or Columns with inplace
By default, dropna() does not modify the source DataFrame. However, in some cases, you may wish to save memory when working with a large source DataFrame by using inplace.
[label dropnaExample.py]
df1.dropna(inplace=True)
print(df1)
This code does not use a dfresult variable.
This will output:
[secondary_label Output]
Name ID Population Regions
0 Shark 1 100 1
The original DataFrame has been modified.
Conclusion
In this article, you used the dropna() function to remove rows and columns with NA values.
Continue your learning with more Python and pandas tutorials – Python pandas Module Tutorial, pandas Drop Duplicate Rows.
References