5 Steps to Perform Exploratory Data Analysis (EDA) in Python (With Code Examples)
You have a dataset in front of you. Maybe it is a CSV file from your manager, a download from a government database, or data you collected yourself. Now what?
This is the moment most beginners get stuck. They open the file, see hundreds of rows and dozens of columns, and have no idea where to begin.
The answer is Exploratory Data Analysis, commonly known as EDA.
EDA is the process of investigating a dataset to summarise its key characteristics, such as mean, median, and data types. It helps identify errors like missing values, outliers, and duplicates. Think of it as your first conversation with the data. Before you build any charts, run any models, or draw any conclusions, EDA helps you understand what you are actually working with.
By analysing and visualising data through EDA, you can get a true sense of what the data looks like, discover trends and patterns, spot outliers and other anomalies, and answer key research questions.
In this article, you will learn the 5 steps to perform EDA in Python using Pandas, NumPy, and Seaborn, with code examples you can follow along with today.
Before you start: This article connects directly to our article on 5 NumPy Functions Every Data Analyst Should Know and the 5 Best Python Libraries for Data Visualization. Reading those first will make this much easier to follow.
The Dataset We Will Use
Throughout this article, we will use a simple student performance dataset. It contains information about 10 students, including their names, scores in three subjects, attendance percentage, and study hours per week.
You can create it yourself by running this code at the top of your notebook:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Create a sample student dataset
data = {
"Student": ["Ada", "Emeka", "Fatima", "Chidi", "Ngozi",
"Tunde", "Amara", "Bello", "Grace", "Uche"],
"Maths": [85, 92, 78, None, 88, 95, 60, 73, 90, 55],
"English": [76, 88, 82, 91, 70, 84, 65, None, 87, 72],
"Science": [90, 85, 79, 88, 76, 92, 68, 80, 85, 62],
"Attendance": [95, 98, 88, 75, 92, 99, 70, 85, 96, 65],
"Study_Hours": [5, 7, 6, 4, 6, 8, 3, 5, 7, 2]
}
df = pd.DataFrame(data)
print(df)
This dataset has something important built in: two missing values. One in the Maths column and one in English. This is intentional because real datasets always have missing values, and EDA is how you find them.
Quick Summary: The 5 EDA Steps
| Step | What You Do | Tools Used |
|---|---|---|
| 1. Load and Inspect | Look at size, columns, and data types | df.head(), df.info(), df.shape |
| 2. Check Data Quality | Find missing values and duplicates | df.isnull(), df.duplicated() |
| 3. Summarise Numerically | Get statistics for each column | df.describe(), NumPy functions |
| 4. Explore Distributions | See how values are spread | Histograms, value counts |
| 5. Find Relationships | See how columns relate to each other | Correlation, scatter plots, heatmaps |
Step 1: Load and Inspect Your Data
Why this step matters: Before you do anything else, you need to answer four basic questions about your dataset:
- How many rows and columns does it have?
- What are the column names?
- What type of data is in each column?
- What does the first few rows look like?
These questions sound simple, but skipping this step is how analysts end up wasting hours working on data they did not fully understand.
Code example:
# How many rows and columns?
print("Shape:", df.shape)
# Output: Shape: (10, 6)
# Column names and data types
print(df.info())
# Output:
# RangeIndex: 10 entries, 0 to 9
# Data columns (total 6 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 Student 10 non-null object
# 1 Maths 9 non-null float64
# 2 English 9 non-null float64
# 3 Science 10 non-null float64
# 4 Attendance 10 non-null int64
# 5 Study_Hours 10 non-null int64
# Preview the first 5 rows
print(df.head())
# Preview the last 3 rows
print(df.tail(3))
What to look for in df.info():
Look at the "Non-Null Count" column. Any column where this number is less than your total row count has missing values. In our dataset, Maths shows 9 non-null values (out of 10 rows) and English also shows 9. Those are the two missing values we built in.
Also check the "Dtype" column. If a column that should contain numbers shows up as "object" (which is Pandas text type), it means something in that column is stored as text and will need to be fixed before analysis.
Step 2: Check for Missing Values and Duplicates
Why this step matters: Missing values and duplicate rows are the two most common data quality problems. If you ignore them, every calculation you run later will either be wrong or produce an error.
EDA is a critical early step to understand the data you are working with, detect patterns or anomalies, and form hypotheses. Catching quality issues early saves you from discovering them after you have already shared your analysis with someone.
Code example finding missing values:
# Count missing values in each column
print(df.isnull().sum())
# Output:
# Student 0
# Maths 1
# English 1
# Science 0
# Attendance 0
# Study_Hours 0
# See missing values as a percentage of total rows
missing_percent = (df.isnull().sum() / len(df)) * 100
print(missing_percent.round(1))
# Output:
# Maths 10.0%
# English 10.0%
Code example finding duplicate rows:
# Check for duplicate rows
print("Duplicate rows:", df.duplicated().sum())
# Output: Duplicate rows: 0
# If duplicates exist, remove them like this:
# df = df.drop_duplicates()
What to do after finding missing values:
You have three main options, and the right choice depends on your situation:
# Option 1: Drop rows with missing values (use when few rows are affected)
df_dropped = df.dropna()
# Option 2: Fill missing values with the column average (use for numerical data)
df_filled = df.fillna(df.mean(numeric_only=True))
# Option 3: Fill with a specific value
df_filled2 = df.fillna(0)
# For this article, we will fill with the column mean
df = df.fillna(df.mean(numeric_only=True))
print("Missing values remaining:", df.isnull().sum().sum())
# Output: Missing values remaining: 0
Filling with the mean is the most common approach for numerical data. It replaces the missing value with something reasonable rather than removing the entire row.
Step 3: Summarise Your Data Numerically
Why this step matters: Now that your data is clean, you need to understand what the numbers actually look like. What is the highest score? The lowest? The average? Are the values tightly packed together or spread out widely?
This is where df.describe() and a few NumPy functions become very useful.
Code example:
# Get statistical summary of all numerical columns
print(df.describe().round(1))
# Output:
# Maths English Science Attendance Study_Hours
# count 10.0 10.0 10.0 10.0 10.0
# mean 81.7 81.2 80.5 86.3 5.3
# std 13.1 9.0 10.1 11.7 1.8
# min 55.0 65.0 62.0 65.0 2.0
# 25% 75.5 72.5 73.8 78.8 4.8
# 50% 86.5 82.0 82.5 89.5 5.5
# 75% 90.8 87.8 88.5 95.8 6.8
# max 95.0 91.0 92.0 99.0 8.0
Reading the output:
- count tells you how many valid (non-missing) values are in each column
- mean is the average, useful but sensitive to extreme values
- Standard deviation (std) tells you how spread out the values are. A high std means wide variation; a low std means values cluster near the mean
- min and max show the range
- 25%, 50%, 75% are quartiles the 50% value is the median
Using NumPy for a deeper look:
import numpy as np
# Focus on Maths scores
maths_scores = df["Maths"].values
print(f"Highest score: {np.max(maths_scores):.0f}")
print(f"Lowest score: {np.min(maths_scores):.0f}")
print(f"Average score: {np.mean(maths_scores):.1f}")
print(f"Median score: {np.median(maths_scores):.1f}")
print(f"Std deviation: {np.std(maths_scores):.1f}")
# Output:
# Highest score: 95
# Lowest score: 55
# Average score: 81.7
# Median score: 86.5
# Std deviation: 13.1
Notice that the average (81.7) and the median (86.5) are different. This tells you the data is slightly skewed a few low scores are pulling the average down. In a real analysis, this would prompt you to investigate the low-scoring students further.
Step 4: Explore Distributions
Why this step matters: Numbers in a table do not tell the whole story. A distribution chart shows you where most values fall, whether there are any extreme values (outliers), and whether the data is balanced or heavily skewed to one side.
This is where your data visualization skills come in.
Code example histogram:
import matplotlib.pyplot as plt
import seaborn as sns
# Plot distribution of Maths scores
plt.figure(figsize=(8, 5))
sns.histplot(df["Maths"], bins=6, color="steelblue", kde=True)
plt.title("Distribution of Maths Scores")
plt.xlabel("Score")
plt.ylabel("Number of Students")
plt.show()
The kde=True parameter adds a smooth curve over the histogram that shows the overall shape of the distribution. If the curve is roughly bell-shaped, the data is fairly normal. If it leans heavily to one side, the data is skewed.
Code example checking all score columns at once:
# Plot distributions for all three subjects side by side
fig, axes = plt.subplots(1, 3, figsize=(14, 5))
subjects = ["Maths", "English", "Science"]
colors = ["steelblue", "coral", "mediumseagreen"]
for ax, subject, color in zip(axes, subjects, colors):
sns.histplot(df[subject], bins=5, color=color, kde=True, ax=ax)
ax.set_title(f"Distribution of {subject} Scores")
ax.set_xlabel("Score")
plt.tight_layout()
plt.show()
Code example box plot to spot outliers:
# Box plot to check for outliers across all subjects
plt.figure(figsize=(8, 5))
sns.boxplot(data=df[["Maths", "English", "Science"]], palette="Set2")
plt.title("Score Distribution by Subject")
plt.ylabel("Score")
plt.show()
In a box plot, the box shows where the middle 50% of values fall. The line inside the box is the median. Any dots appearing above or below the whiskers are outliers. These are values that sit far outside the normal range and often need special attention.
Step 5: Find Relationships Between Variables
Why this step matters: The most valuable insights in data often come from understanding how two or more columns relate to each other. Does studying more lead to higher scores? Do students with better attendance perform better? This step answers those questions.
Code example correlation matrix:
# Calculate how strongly each pair of columns is related
correlation = df[["Maths", "English", "Science", "Attendance", "Study_Hours"]].corr()
print(correlation.round(2))
Correlation values range from -1 to 1:
- A value close to 1 means the two columns move together — when one goes up, the other goes up
- A value close to -1 means they move in opposite directions
- A value close to 0 means there is no clear relationship
Visualising the correlation with a heatmap:
plt.figure(figsize=(8, 6))
sns.heatmap(
correlation,
annot=True,
fmt=".2f",
cmap="coolwarm",
linewidths=0.5
)
plt.title("Correlation Between Student Performance Variables")
plt.show()
The heatmap makes the correlation matrix easy to read at a glance. Dark red cells show strong positive relationships. Dark blue cells show strong negative relationships. Light cells show little to no relationship.
Code example scatter plot to explore one relationship:
# Do students who study more get higher Maths scores?
plt.figure(figsize=(7, 5))
sns.scatterplot(
data=df,
x="Study_Hours",
y="Maths",
hue="Attendance",
size="Attendance",
palette="viridis",
sizes=(50, 200)
)
plt.title("Study Hours vs Maths Score")
plt.xlabel("Weekly Study Hours")
plt.ylabel("Maths Score")
plt.show()
This chart shows each student as a dot. The position tells you their study hours and Maths score. The colour and size tell you their attendance. In one chart, you can see the relationship between three variables at once.
If you see a clear upward trend from left to right, it confirms that more study hours tend to lead to higher scores.
Putting the Full EDA Workflow Together
Here is the complete EDA workflow in one place so you can save it and reuse it on any dataset:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# ── STEP 1: LOAD AND INSPECT ────
print("Shape:", df.shape)
print(df.info())
print(df.head())
# ── STEP 2: CHECK DATA QUALITY ────
print("Missing values:\n", df.isnull().sum())
print("Duplicate rows:", df.duplicated().sum())
df = df.fillna(df.mean(numeric_only=True))
# ── STEP 3: NUMERICAL SUMMARY ────
print(df.describe().round(1))
# ── STEP 4: DISTRIBUTIONS ────────
df[["Maths", "English", "Science"]].hist(figsize=(12, 4), bins=5, color="steelblue")
plt.tight_layout()
plt.show()
# ── STEP 5: RELATIONSHIPS ──────
correlation = df[["Maths", "English", "Science", "Attendance", "Study_Hours"]].corr()
sns.heatmap(correlation, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
Copy this template. Swap in your own dataset. Run it. You will have a solid first look at any data in under five minutes.
Conclusions
EDA is not a one-time task you complete and move on from. It is a mindset. The habit of asking "what does this data actually look like before I start working with it?" is what separates reliable analysts from those who build conclusions on faulty foundations.
The five steps covered in this article are:
- Load and inspect — understand the size and structure of your data
- Check data quality — find missing values and duplicates early
- Summarise numerically — use statistics to understand your columns
- Explore distributions — visualise how values are spread
- Find relationships — discover how variables connect to each other
Once you are comfortable with these steps, you will be ready to move into building machine learning models, creating dashboards, and performing deeper statistical analysis.
Next, put your EDA skills to work visually. Revisit our guide on the 5 best Python data visualization libraries to learn how to present your findings in polished charts. And if you want to go deeper on numerical operations, check out our guide on 5 NumPy functions every data analyst should know.
How to Install the Libraries Used in This Article
pip install pandas numpy seaborn matplotlib
References
- Exploratory Data Analysis — Statology — statology.org
- A Beginner's Guide to EDA with Python — Deepnote — deepnote.com
- EDA in Python — Case Western Research Guide — researchguides.case.edu
- Exploratory Data Analysis in Python — DataCamp — datacamp.com
- Pandas Official Documentation — pandas.pydata.org
- Seaborn Statistical Data Visualization — seaborn.pydata.org
- NumPy Official Documentation — numpy.org
Published on JacobIsah Programming Hub | enemzy.blogspot.com
%20in%20Python%20(With%20Code%20Examples).png)
Post a Comment