Why Data Cleaning Matters More Than You Think

Data professionals often cite data cleaning as the most time-consuming part of their work — and for good reason. Raw data is almost never analysis-ready. It contains missing values, duplicates, inconsistent formatting, outliers, and encoding errors. The quality of your analysis is only as good as the quality of the data going into it.

Python's pandas library is the go-to tool for data cleaning. This guide walks you through the most common cleaning tasks with practical examples.

Setting Up

If you haven't already, install pandas and load your data:

pip install pandas

import pandas as pd

df = pd.read_csv('your_data.csv')
print(df.head())
print(df.info())

The df.info() call gives you column names, data types, and a count of non-null values — your first diagnostic snapshot of data health.

Handling Missing Values

Missing values are almost universal in real datasets. Your approach depends on the column and context.

Detecting Missing Values

# Count missing values per column
df.isnull().sum()

# Percentage missing
df.isnull().mean() * 100

Your Options

  • Drop rows: df.dropna(subset=['column_name']) — use when missingness is rare and random.
  • Fill with a value: df['column'].fillna(0) or df['column'].fillna('Unknown')
  • Fill with median/mean: df['age'].fillna(df['age'].median()) — safer than mean for skewed distributions.
  • Forward fill: df['price'].ffill() — useful for time series data.

Removing Duplicates

Duplicate rows silently inflate counts and aggregations:

# Check for duplicates
df.duplicated().sum()

# Remove duplicates, keep the first occurrence
df = df.drop_duplicates()

# Check duplicates based on specific columns
df = df.drop_duplicates(subset=['customer_id', 'order_date'])

Fixing Data Types

Pandas sometimes infers the wrong type — a date column loaded as a string, or a numeric column stored as an object. Fix this early:

# Convert to datetime
df['purchase_date'] = pd.to_datetime(df['purchase_date'])

# Convert to numeric (coerce turns errors into NaN)
df['revenue'] = pd.to_numeric(df['revenue'], errors='coerce')

# Convert to categorical (saves memory on low-cardinality columns)
df['status'] = df['status'].astype('category')

Standardizing String Columns

Inconsistent text is a frequent problem — "New York", "new york", "NEW YORK" are logically the same but treated as different values:

# Strip whitespace and lowercase
df['city'] = df['city'].str.strip().str.lower()

# Replace inconsistent values
df['status'] = df['status'].replace({'Completed': 'completed', 'DONE': 'completed'})

# Remove special characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)

Dealing With Outliers

Outliers can distort statistical analysis. The IQR method is a robust way to detect them:

Q1 = df['amount'].quantile(0.25)
Q3 = df['amount'].quantile(0.75)
IQR = Q3 - Q1

# Filter out extreme outliers
df_clean = df[(df['amount'] >= Q1 - 1.5 * IQR) & 
              (df['amount'] <= Q3 + 1.5 * IQR)]

Whether you remove, cap, or transform outliers depends on context. Always investigate before deleting — an "outlier" might be a genuinely important data point or a sign of a data pipeline error.

Building a Repeatable Cleaning Pipeline

Once you've identified your cleaning steps, wrap them in a function so the process is reproducible:

def clean_orders(df):
    df = df.drop_duplicates()
    df['order_date'] = pd.to_datetime(df['order_date'])
    df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
    df['customer_name'] = df['customer_name'].str.strip().str.title()
    df = df.dropna(subset=['amount', 'order_date'])
    return df

clean_df = clean_orders(raw_df)

This makes your work auditable, testable, and easy to reapply when new data arrives.