How to Clean Messy Data in Python with Pandas (Beginner Tutorial)
The Dirty Secret About Real-World Data
Nobody tells beginners this upfront: most real datasets are messy. Column names are inconsistent. Some rows have missing values. Dates are formatted three different ways in the same column. Numbers are stored as text.
Data cleaning is not machine learning, not fancy visualisations is where a data analyst spends the majority of their working hours. If you can clean data confidently, you're already more valuable than most people calling themselves "data analysts" in Nigeria right now.
This tutorial teaches you the five most important data cleaning operations in Python, with code you can run today.
What We're Working With
We'll use a messy version of a student dataset. Create a file called messy_students.csv and paste this in:
name,state,course,score,enrollment_date
Adeola Bello,lagos,Data Analysis,87,15/01/2024
Emeka Okonkwo,ANAMBRA,python programming,,2024-02-20
Fatima Aliyu,Kano,Data Analysis,78,20-03-2024
Chidi Nwosu,Rivers, Python Programming ,85,10/04/2024
Ngozi Eze,Enugu,data analysis,91,05/05/2024
Emeka Okonkwo,ANAMBRA,python programming,,2024-02-20
Notice what's wrong here:
- State names are not consistent (lagos, ANAMBRA, Kano)
- Course names have inconsistent capitalisation and extra spaces
- Emeka's score is missing
- Dates are formatted three different ways
- Emeka's row is duplicated
This is mild compared to what real data looks like. Let's clean it.
Step 1: Load the Data and Inspect It
import pandas as pd
df = pd.read_csv("messy_students.csv")
print(df)
print("\nShape:", df.shape)
print("\nMissing values:\n", df.isnull().sum())
Before touching anything, always look at your data and check for missing values. This tells you exactly what you're dealing with.
Step 2: Remove Duplicate Rows
# See duplicates first
print(df.duplicated())
# Remove them
df = df.drop_duplicates()
print("After removing duplicates:", df.shape)
The drop_duplicates() function removes any row that is an exact copy of another row. After this, Emeka's duplicate entry is gone.
Step 3: Fix Inconsistent Text Capitalisation and Whitespace
# Fix state names: make them all Title Case
df["state"] = df["state"].str.title()
# Fix course names: remove extra spaces, then make Title Case
df["course"] = df["course"].str.strip().str.title()
print(df[["state", "course"]])
What these methods do:
.str.title(): Capitalises the first letter of each word (Lagos, Anambra, Data Analysis).str.strip(): Removes leading and trailing spaces from a text value.str.lower()and.str.upper()also exist if you prefer all lowercase or all uppercase
Step 4: Handle Missing Values
Emeka's score is missing. You have three options:
Option A: Drop the row entirely
df_clean = df.dropna(subset=["score"])
Option B: Fill with the average score
mean_score = df["score"].mean()
df["score"] = df["score"].fillna(mean_score)
Option C: Fill with a fixed value (like 0)
df["score"] = df["score"].fillna(0)
Which option is right depends on the context. For academic scores, using the average is often the safest choice. For financial data, dropping the row is usually safer. Always document what you chose and why.
Step 5: Fix Data Types Especially Dates
The enrollment_date column has three different date formats. Pandas can handle this with a single command:
df["enrollment_date"] = pd.to_datetime(df["enrollment_date"], dayfirst=True)
print(df["enrollment_date"])
print(df["enrollment_date"].dtype)
The dayfirst=True argument tells pandas that when it sees something like 15/01/2024, the 15 is the day not the month. This matters if your dates are in the Nigerian/European format (DD/MM/YYYY) rather than the American format (MM/DD/YYYY).
After this, pandas stores the column as a proper datetime type, which means you can now extract the month, calculate date differences, and sort by date correctly.
Step 6: Rename Columns to Be Cleaner (Optional but Good Practice)
df = df.rename(columns={
"name": "student_name",
"enrollment_date": "date_enrolled"
})
print(df.columns)
Clean column names with underscores instead of spaces are easier to work with in code. This is a small habit that saves headaches later.
Your Cleaned Dataset
To save your cleaned data as a new CSV file:
df.to_csv("clean_students.csv", index=False)
print("Saved!")
The index=False part prevents pandas from adding an extra numbering column to your file.
Quick Reference: Data Cleaning Cheat Sheet
| Task | Code |
|---|---|
| Remove duplicates | df.drop_duplicates() |
| Standardise text case | df["col"].str.title() |
| Remove whitespace | df["col"].str.strip() |
| Fill missing values | df["col"].fillna(value) |
| Drop rows with missing values | df.dropna(subset=["col"]) |
| Convert to datetime | pd.to_datetime(df["col"]) |
| Rename columns | df.rename(columns={"old": "new"}) |
| Save to CSV | df.to_csv("file.csv", index=False) |
Practice Challenge
Find any dataset online with at least 100 rows NIMC demographic data, NBS economic data, or any CSV you can download from Kaggle and apply all five cleaning steps to it. Then answer:
- How many duplicates did you find?
- Which columns had missing values?
- Were any columns the wrong data type?
Share what you find in the comments. Real practice beats theory every time.
What's Next
Once your data is clean, you're ready to start analysing it and finding insights. The next post in this series covers filtering, grouping, and summarising data with pandas the skills that let you answer actual business questions from a dataset.
If you want more structured practice, the Python Exercise Library includes 50+ real-data exercises built specifically for Nigerian learners available on Selar.
Jacob Isah is the founder of JacobIsah Programming Hub and NEXODE Academy. Follow @jacobisah on all platforms for daily Python and data tips.

Post a Comment