How to Filter, Group, and Summarise Data in Python with Pandas
This Is Where Python Starts Feeling Powerful
Reading data. Cleaning data. Those are setup steps. This is where data analysis actually begins.
Filtering, grouping, and summarising data is how you go from a raw spreadsheet to real insights. It's how you answer questions like: Which states had the highest average scores? How many students enrolled per month? What's the pass rate for each course?
These are the questions employers ask. These are the questions clients pay for. And in this tutorial, you'll learn exactly how to answer them with Python.
The Dataset We'll Use
We'll continue with the cleaned student data from the last tutorial. If you don't have it, create a file called students.csv with this content:
You cn red it hereHow to Clean Messy Data in Python with Pandas (Beginner Tutorial)
student_name,state,course,score,date_enrolled
Adeola Bello,Lagos,Data Analysis,87,2024-01-15
Emeka Okonkwo,Anambra,Python Programming,83,2024-02-20
Fatima Aliyu,Kano,Data Analysis,78,2024-03-20
Chidi Nwosu,Rivers,Python Programming,85,2024-04-10
Ngozi Eze,Enugu,Data Analysis,91,2024-05-05
Bola Adeyemi,Lagos,Python Programming,72,2024-05-18
Uche Okeke,Anambra,Data Analysis,65,2024-06-02
Amina Suleiman,Kano,Python Programming,88,2024-06-14
Tunde Lawal,Lagos,Data Analysis,79,2024-07-01
Chioma Ibe,Imo,Python Programming,94,2024-07-22
Load it with:
import pandas as pd
df = pd.read_csv("students.csv")
print(df)
Part 1: Filtering — Asking "Show Me Only…"
Filtering means selecting only the rows that meet a condition.
Example 1: Show only students from Lagos
lagos_students = df[df["state"] == "Lagos"]
print(lagos_students)
Example 2: Show only students who scored above 80
high_scorers = df[df["score"] > 80]
print(high_scorers)
Example 3: Multiple conditions — students from Lagos who scored above 80
lagos_high = df[(df["state"] == "Lagos") & (df["score"] > 80)]
print(lagos_high)
Note the & symbol (and) between conditions, and the fact that each condition is wrapped in its own parentheses. This is easy to forget when starting out.
You can also use | for "or":
# Students from Lagos OR Kano
df[(df["state"] == "Lagos") | (df["state"] == "Kano")]
Part 2: Grouping — Asking "For Each… Show Me…"
Grouping lets you split your data into categories and calculate something for each category. This is one of the most powerful tools in data analysis.
Example 1: Average score by course
avg_by_course = df.groupby("course")["score"].mean()
print(avg_by_course)
Example 2: Number of students enrolled per state
students_per_state = df.groupby("state")["student_name"].count()
print(students_per_state)
Example 3: Multiple statistics at once
summary = df.groupby("course")["score"].agg(["mean", "min", "max", "count"])
print(summary)
The agg() function lets you apply multiple aggregation functions in one step. Here we're getting the mean, minimum, maximum, and count of scores for each course.
Part 3: Sorting — Putting Results in Order
After grouping, you often want to sort your results.
# Top scorers — highest score first
df.sort_values("score", ascending=False).head()
# Sort average scores from highest to lowest
avg_by_course.sort_values(ascending=False)
The .head() at the end shows only the top 5 rows. You can put a number inside like .head(3) for the top 3.
Part 4: Summarising the Whole Dataset
Sometimes you want a quick overview of numbers across the entire dataset, not broken down by category.
# Overall statistics for all numeric columns
print(df.describe())
# Total number of students
print("Total students:", len(df))
# Average score across all students
print("Overall average:", df["score"].mean().round(2))
# Pass rate (assuming 70 is the pass mark)
pass_rate = (df["score"] >= 70).mean() * 100
print(f"Pass rate: {pass_rate:.1f}%")
That last line uses a neat pandas trick: when you write df["score"] >= 70, it returns True or False for each row. Taking the .mean() of True/False values gives you the proportion that are True — which is exactly the pass rate.
Putting It All Together: A Mini Analysis
Here's a short analysis that uses everything you've learned:
import pandas as pd
df = pd.read_csv("students.csv")
print("=== OVERALL SUMMARY ===")
print(f"Total students: {len(df)}")
print(f"Average score: {df['score'].mean():.1f}")
print(f"Pass rate: {(df['score'] >= 70).mean() * 100:.1f}%")
print("\n=== PERFORMANCE BY COURSE ===")
print(df.groupby("course")["score"].agg(["mean", "count"]).round(1))
print("\n=== TOP 3 STATES BY AVERAGE SCORE ===")
top_states = df.groupby("state")["score"].mean().sort_values(ascending=False).head(3)
print(top_states.round(1))
print("\n=== STUDENTS WHO NEED SUPPORT (score below 75) ===")
struggling = df[df["score"] < 75][["student_name", "state", "course", "score"]]
print(struggling)
Run this in Jupyter and look at what it produces. This is real data analysis — not just "running code," but answering specific questions about a group of people using data.
Practice Challenge
Using the student dataset (or any dataset you have), answer these three questions with code:
- Which course has the higher average score?
- Which state has the most enrolled students?
- What percentage of students scored 85 or above?
Write your answers — and your code — in the comments below. I read every one.
What's Next in the Series
You now know how to load data, clean it, and analyse it. The next step is visualising your findings — turning numbers into charts that tell a story. That's coming up in the next post: How to Create Charts and Graphs in Python with Matplotlib and Seaborn.
If you're ready to go deeper with structured exercises, the Python Exercise Library has 50+ hands-on data tasks built around real Nigerian contexts — with solutions included. Available now on Selar.
Jacob Isah teaches practical Python and data skills at JacobIsah Programming Hub. From beginner to builder — one real skill at a time.

Post a Comment