ZMedia Purwodadi

How to Clean Messy Data in Python with Pandas (Beginner Tutorial)

Table of Contents

The Dirty Secret About Real-World Data

Nobody tells beginners this upfront: most real datasets are messy. Column names are inconsistent. Some rows have missing values. Dates are formatted three different ways in the same column. Numbers are stored as text.

Data cleaning is not machine learning, not fancy visualisations is where a data analyst spends the majority of their working hours. If you can clean data confidently, you're already more valuable than most people calling themselves "data analysts" in Nigeria right now.

This tutorial teaches you the five most important data cleaning operations in Python, with code you can run today.

How to Clean Messy Data in Python with Pandas (Beginner Tutorial)

What We're Working With

We'll use a messy version of a student dataset. Create a file called messy_students.csv and paste this in:

name,state,course,score,enrollment_date
Adeola Bello,lagos,Data Analysis,87,15/01/2024
Emeka Okonkwo,ANAMBRA,python programming,,2024-02-20
Fatima Aliyu,Kano,Data Analysis,78,20-03-2024
Chidi Nwosu,Rivers,  Python Programming  ,85,10/04/2024
Ngozi Eze,Enugu,data analysis,91,05/05/2024
Emeka Okonkwo,ANAMBRA,python programming,,2024-02-20

Notice what's wrong here:

  • State names are not consistent (lagos, ANAMBRA, Kano)
  • Course names have inconsistent capitalisation and extra spaces
  • Emeka's score is missing
  • Dates are formatted three different ways
  • Emeka's row is duplicated

This is mild compared to what real data looks like. Let's clean it.

Step 1: Load the Data and Inspect It

import pandas as pd

df = pd.read_csv("messy_students.csv")
print(df)
print("\nShape:", df.shape)
print("\nMissing values:\n", df.isnull().sum())

Before touching anything, always look at your data and check for missing values. This tells you exactly what you're dealing with.

Step 2: Remove Duplicate Rows

# See duplicates first
print(df.duplicated())

# Remove them
df = df.drop_duplicates()
print("After removing duplicates:", df.shape)

The drop_duplicates() function removes any row that is an exact copy of another row. After this, Emeka's duplicate entry is gone.

Step 3: Fix Inconsistent Text Capitalisation and Whitespace

# Fix state names: make them all Title Case
df["state"] = df["state"].str.title()

# Fix course names: remove extra spaces, then make Title Case
df["course"] = df["course"].str.strip().str.title()

print(df[["state", "course"]])

What these methods do:

  • .str.title(): Capitalises the first letter of each word (Lagos, Anambra, Data Analysis)
  • .str.strip(): Removes leading and trailing spaces from a text value
  • .str.lower() and .str.upper() also exist if you prefer all lowercase or all uppercase

Step 4: Handle Missing Values

Emeka's score is missing. You have three options:

Option A: Drop the row entirely

df_clean = df.dropna(subset=["score"])

Option B: Fill with the average score

mean_score = df["score"].mean()
df["score"] = df["score"].fillna(mean_score)

Option C: Fill with a fixed value (like 0)

df["score"] = df["score"].fillna(0)

Which option is right depends on the context. For academic scores, using the average is often the safest choice. For financial data, dropping the row is usually safer. Always document what you chose and why.

Step 5: Fix Data Types Especially Dates

The enrollment_date column has three different date formats. Pandas can handle this with a single command:

df["enrollment_date"] = pd.to_datetime(df["enrollment_date"], dayfirst=True)
print(df["enrollment_date"])
print(df["enrollment_date"].dtype)

The dayfirst=True argument tells pandas that when it sees something like 15/01/2024, the 15 is the day not the month. This matters if your dates are in the Nigerian/European format (DD/MM/YYYY) rather than the American format (MM/DD/YYYY).

After this, pandas stores the column as a proper datetime type, which means you can now extract the month, calculate date differences, and sort by date correctly.

Step 6: Rename Columns to Be Cleaner (Optional but Good Practice)

df = df.rename(columns={
    "name": "student_name",
    "enrollment_date": "date_enrolled"
})
print(df.columns)

Clean column names with underscores instead of spaces are easier to work with in code. This is a small habit that saves headaches later.

Your Cleaned Dataset

To save your cleaned data as a new CSV file:

df.to_csv("clean_students.csv", index=False)
print("Saved!")

The index=False part prevents pandas from adding an extra numbering column to your file.

Quick Reference: Data Cleaning Cheat Sheet

TaskCode
Remove duplicatesdf.drop_duplicates()
Standardise text casedf["col"].str.title()
Remove whitespacedf["col"].str.strip()
Fill missing valuesdf["col"].fillna(value)
Drop rows with missing valuesdf.dropna(subset=["col"])
Convert to datetimepd.to_datetime(df["col"])
Rename columnsdf.rename(columns={"old": "new"})
Save to CSVdf.to_csv("file.csv", index=False)

Practice Challenge

Find any dataset online with at least 100 rows NIMC demographic data, NBS economic data, or any CSV you can download from Kaggle and apply all five cleaning steps to it. Then answer:

  1. How many duplicates did you find?
  2. Which columns had missing values?
  3. Were any columns the wrong data type?

Share what you find in the comments. Real practice beats theory every time.

What's Next

Once your data is clean, you're ready to start analysing it and finding insights. The next post in this series covers filtering, grouping, and summarising data with pandas the skills that let you answer actual business questions from a dataset.

If you want more structured practice, the Python Exercise Library includes 50+ real-data exercises built specifically for Nigerian learners available on Selar.

Jacob Isah is the founder of JacobIsah Programming Hub and NEXODE Academy. Follow @jacobisah on all platforms for daily Python and data tips.

Post a Comment