ZMedia Purwodadi

5 Steps to Perform Exploratory Data Analysis (EDA) in Python (With Code Examples)

Table of Contents

You have a dataset in front of you. Maybe it is a CSV file from your manager, a download from a government database, or data you collected yourself. Now what?

This is the moment most beginners get stuck. They open the file, see hundreds of rows and dozens of columns, and have no idea where to begin.

The answer is Exploratory Data Analysis, commonly known as EDA.

EDA is the process of investigating a dataset to summarise its key characteristics, such as mean, median, and data types. It helps identify errors like missing values, outliers, and duplicates. Think of it as your first conversation with the data. Before you build any charts, run any models, or draw any conclusions, EDA helps you understand what you are actually working with.

By analysing and visualising data through EDA, you can get a true sense of what the data looks like, discover trends and patterns, spot outliers and other anomalies, and answer key research questions.

In this article, you will learn the 5 steps to perform EDA in Python using Pandas, NumPy, and Seaborn, with code examples you can follow along with today.

Before you start: This article connects directly to our article on 5 NumPy Functions Every Data Analyst Should Know and the 5 Best Python Libraries for Data Visualization. Reading those first will make this much easier to follow.

5 Steps to Perform Exploratory Data Analysis (EDA) in Python (With Code Examples)

The Dataset We Will Use

Throughout this article, we will use a simple student performance dataset. It contains information about 10 students, including their names, scores in three subjects, attendance percentage, and study hours per week.

You can create it yourself by running this code at the top of your notebook:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create a sample student dataset
data = {
    "Student": ["Ada", "Emeka", "Fatima", "Chidi", "Ngozi",
                 "Tunde", "Amara", "Bello", "Grace", "Uche"],
    "Maths": [85, 92, 78, None, 88, 95, 60, 73, 90, 55],
    "English": [76, 88, 82, 91, 70, 84, 65, None, 87, 72],
    "Science": [90, 85, 79, 88, 76, 92, 68, 80, 85, 62],
    "Attendance": [95, 98, 88, 75, 92, 99, 70, 85, 96, 65],
    "Study_Hours": [5, 7, 6, 4, 6, 8, 3, 5, 7, 2]
}

df = pd.DataFrame(data)
print(df)

This dataset has something important built in: two missing values. One in the Maths column and one in English. This is intentional because real datasets always have missing values, and EDA is how you find them.

Quick Summary: The 5 EDA Steps

Step What You Do Tools Used
1. Load and Inspect Look at size, columns, and data types df.head(), df.info(), df.shape
2. Check Data Quality Find missing values and duplicates df.isnull(), df.duplicated()
3. Summarise Numerically Get statistics for each column df.describe(), NumPy functions
4. Explore Distributions See how values are spread Histograms, value counts
5. Find Relationships See how columns relate to each other Correlation, scatter plots, heatmaps

Step 1: Load and Inspect Your Data

Why this step matters: Before you do anything else, you need to answer four basic questions about your dataset:

  • How many rows and columns does it have?
  • What are the column names?
  • What type of data is in each column?
  • What does the first few rows look like?

These questions sound simple, but skipping this step is how analysts end up wasting hours working on data they did not fully understand.

Code example:

# How many rows and columns?
print("Shape:", df.shape)
# Output: Shape: (10, 6)

# Column names and data types
print(df.info())
# Output:
# RangeIndex: 10 entries, 0 to 9
# Data columns (total 6 columns):
#  #   Column       Non-Null Count  Dtype
# ---  ------       --------------  -----
#  0   Student      10 non-null     object
#  1   Maths        9 non-null      float64
#  2   English      9 non-null      float64
#  3   Science      10 non-null     float64
#  4   Attendance   10 non-null     int64
#  5   Study_Hours  10 non-null     int64

# Preview the first 5 rows
print(df.head())

# Preview the last 3 rows
print(df.tail(3))

What to look for in df.info():

Look at the "Non-Null Count" column. Any column where this number is less than your total row count has missing values. In our dataset, Maths shows 9 non-null values (out of 10 rows) and English also shows 9. Those are the two missing values we built in.

Also check the "Dtype" column. If a column that should contain numbers shows up as "object" (which is Pandas text type), it means something in that column is stored as text and will need to be fixed before analysis.

Step 2: Check for Missing Values and Duplicates

Why this step matters: Missing values and duplicate rows are the two most common data quality problems. If you ignore them, every calculation you run later will either be wrong or produce an error.

EDA is a critical early step to understand the data you are working with, detect patterns or anomalies, and form hypotheses. Catching quality issues early saves you from discovering them after you have already shared your analysis with someone.

Code example finding missing values:

# Count missing values in each column
print(df.isnull().sum())
# Output:
# Student        0
# Maths          1
# English        1
# Science        0
# Attendance     0
# Study_Hours    0

# See missing values as a percentage of total rows
missing_percent = (df.isnull().sum() / len(df)) * 100
print(missing_percent.round(1))
# Output:
# Maths      10.0%
# English    10.0%

Code example finding duplicate rows:

# Check for duplicate rows
print("Duplicate rows:", df.duplicated().sum())
# Output: Duplicate rows: 0

# If duplicates exist, remove them like this:
# df = df.drop_duplicates()

What to do after finding missing values:

You have three main options, and the right choice depends on your situation:

# Option 1: Drop rows with missing values (use when few rows are affected)
df_dropped = df.dropna()

# Option 2: Fill missing values with the column average (use for numerical data)
df_filled = df.fillna(df.mean(numeric_only=True))

# Option 3: Fill with a specific value
df_filled2 = df.fillna(0)

# For this article, we will fill with the column mean
df = df.fillna(df.mean(numeric_only=True))
print("Missing values remaining:", df.isnull().sum().sum())
# Output: Missing values remaining: 0

Filling with the mean is the most common approach for numerical data. It replaces the missing value with something reasonable rather than removing the entire row.

Step 3: Summarise Your Data Numerically

Why this step matters: Now that your data is clean, you need to understand what the numbers actually look like. What is the highest score? The lowest? The average? Are the values tightly packed together or spread out widely?

This is where df.describe() and a few NumPy functions become very useful.

Code example:

# Get statistical summary of all numerical columns
print(df.describe().round(1))

# Output:
#        Maths  English  Science  Attendance  Study_Hours
# count   10.0     10.0     10.0        10.0         10.0
# mean    81.7     81.2     80.5        86.3          5.3
# std     13.1      9.0     10.1        11.7          1.8
# min     55.0     65.0     62.0        65.0          2.0
# 25%     75.5     72.5     73.8        78.8          4.8
# 50%     86.5     82.0     82.5        89.5          5.5
# 75%     90.8     87.8     88.5        95.8          6.8
# max     95.0     91.0     92.0        99.0          8.0

Reading the output:

  • count tells you how many valid (non-missing) values are in each column
  • mean is the average, useful but sensitive to extreme values
  • Standard deviation (std) tells you how spread out the values are. A high std means wide variation; a low std means values cluster near the mean
  • min and max show the range
  • 25%, 50%, 75% are quartiles the 50% value is the median

Using NumPy for a deeper look:

import numpy as np

# Focus on Maths scores
maths_scores = df["Maths"].values

print(f"Highest score:  {np.max(maths_scores):.0f}")
print(f"Lowest score:   {np.min(maths_scores):.0f}")
print(f"Average score:  {np.mean(maths_scores):.1f}")
print(f"Median score:   {np.median(maths_scores):.1f}")
print(f"Std deviation:  {np.std(maths_scores):.1f}")

# Output:
# Highest score:  95
# Lowest score:   55
# Average score:  81.7
# Median score:   86.5
# Std deviation:  13.1

Notice that the average (81.7) and the median (86.5) are different. This tells you the data is slightly skewed a few low scores are pulling the average down. In a real analysis, this would prompt you to investigate the low-scoring students further.

Step 4: Explore Distributions

Why this step matters: Numbers in a table do not tell the whole story. A distribution chart shows you where most values fall, whether there are any extreme values (outliers), and whether the data is balanced or heavily skewed to one side.

This is where your data visualization skills come in.

Code example histogram:

import matplotlib.pyplot as plt
import seaborn as sns

# Plot distribution of Maths scores
plt.figure(figsize=(8, 5))
sns.histplot(df["Maths"], bins=6, color="steelblue", kde=True)
plt.title("Distribution of Maths Scores")
plt.xlabel("Score")
plt.ylabel("Number of Students")
plt.show()

The kde=True parameter adds a smooth curve over the histogram that shows the overall shape of the distribution. If the curve is roughly bell-shaped, the data is fairly normal. If it leans heavily to one side, the data is skewed.

Code example checking all score columns at once:

# Plot distributions for all three subjects side by side
fig, axes = plt.subplots(1, 3, figsize=(14, 5))

subjects = ["Maths", "English", "Science"]
colors = ["steelblue", "coral", "mediumseagreen"]

for ax, subject, color in zip(axes, subjects, colors):
    sns.histplot(df[subject], bins=5, color=color, kde=True, ax=ax)
    ax.set_title(f"Distribution of {subject} Scores")
    ax.set_xlabel("Score")

plt.tight_layout()
plt.show()

Code example box plot to spot outliers:

# Box plot to check for outliers across all subjects
plt.figure(figsize=(8, 5))
sns.boxplot(data=df[["Maths", "English", "Science"]], palette="Set2")
plt.title("Score Distribution by Subject")
plt.ylabel("Score")
plt.show()

In a box plot, the box shows where the middle 50% of values fall. The line inside the box is the median. Any dots appearing above or below the whiskers are outliers. These are values that sit far outside the normal range and often need special attention.

Step 5: Find Relationships Between Variables

Why this step matters: The most valuable insights in data often come from understanding how two or more columns relate to each other. Does studying more lead to higher scores? Do students with better attendance perform better? This step answers those questions.

Code example correlation matrix:

# Calculate how strongly each pair of columns is related
correlation = df[["Maths", "English", "Science", "Attendance", "Study_Hours"]].corr()
print(correlation.round(2))

Correlation values range from -1 to 1:

  • A value close to 1 means the two columns move together — when one goes up, the other goes up
  • A value close to -1 means they move in opposite directions
  • A value close to 0 means there is no clear relationship

Visualising the correlation with a heatmap:

plt.figure(figsize=(8, 6))
sns.heatmap(
    correlation,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    linewidths=0.5
)
plt.title("Correlation Between Student Performance Variables")
plt.show()

The heatmap makes the correlation matrix easy to read at a glance. Dark red cells show strong positive relationships. Dark blue cells show strong negative relationships. Light cells show little to no relationship.

Code example scatter plot to explore one relationship:

# Do students who study more get higher Maths scores?
plt.figure(figsize=(7, 5))
sns.scatterplot(
    data=df,
    x="Study_Hours",
    y="Maths",
    hue="Attendance",
    size="Attendance",
    palette="viridis",
    sizes=(50, 200)
)
plt.title("Study Hours vs Maths Score")
plt.xlabel("Weekly Study Hours")
plt.ylabel("Maths Score")
plt.show()

This chart shows each student as a dot. The position tells you their study hours and Maths score. The colour and size tell you their attendance. In one chart, you can see the relationship between three variables at once.

If you see a clear upward trend from left to right, it confirms that more study hours tend to lead to higher scores.

Putting the Full EDA Workflow Together

Here is the complete EDA workflow in one place so you can save it and reuse it on any dataset:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# ── STEP 1: LOAD AND INSPECT ────
print("Shape:", df.shape)
print(df.info())
print(df.head())

# ── STEP 2: CHECK DATA QUALITY ────
print("Missing values:\n", df.isnull().sum())
print("Duplicate rows:", df.duplicated().sum())
df = df.fillna(df.mean(numeric_only=True))

# ── STEP 3: NUMERICAL SUMMARY ────
print(df.describe().round(1))

# ── STEP 4: DISTRIBUTIONS ────────
df[["Maths", "English", "Science"]].hist(figsize=(12, 4), bins=5, color="steelblue")
plt.tight_layout()
plt.show()

# ── STEP 5: RELATIONSHIPS ──────
correlation = df[["Maths", "English", "Science", "Attendance", "Study_Hours"]].corr()
sns.heatmap(correlation, annot=True, fmt=".2f", cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

Copy this template. Swap in your own dataset. Run it. You will have a solid first look at any data in under five minutes.

Conclusions

EDA is not a one-time task you complete and move on from. It is a mindset. The habit of asking "what does this data actually look like before I start working with it?" is what separates reliable analysts from those who build conclusions on faulty foundations.

The five steps covered in this article are:

  1. Load and inspect — understand the size and structure of your data
  2. Check data quality — find missing values and duplicates early
  3. Summarise numerically — use statistics to understand your columns
  4. Explore distributions — visualise how values are spread
  5. Find relationships — discover how variables connect to each other

Once you are comfortable with these steps, you will be ready to move into building machine learning models, creating dashboards, and performing deeper statistical analysis.

Next, put your EDA skills to work visually. Revisit our guide on the 5 best Python data visualization libraries to learn how to present your findings in polished charts. And if you want to go deeper on numerical operations, check out our guide on 5 NumPy functions every data analyst should know.

How to Install the Libraries Used in This Article

pip install pandas numpy seaborn matplotlib

References

  1. Exploratory Data Analysis — Statology — statology.org
  2. A Beginner's Guide to EDA with Python — Deepnote — deepnote.com
  3. EDA in Python — Case Western Research Guide — researchguides.case.edu
  4. Exploratory Data Analysis in Python — DataCamp — datacamp.com
  5. Pandas Official Documentation — pandas.pydata.org
  6. Seaborn Statistical Data Visualization — seaborn.pydata.org
  7. NumPy Official Documentation — numpy.org

Published on JacobIsah Programming Hub | enemzy.blogspot.com

Post a Comment