5 NumPy Functions Every Data Analyst Should Know in 2026 (With Code Examples)
Before you can truly work with data in Python, you need to understand the tool that powers almost everything underneath: NumPy.
NumPy short for 'Numerical Python', is the foundation on which Pandas, Matplotlib, Scikit-learn, and nearly every data science library in Python is built. When you process a column in Pandas or create a chart in Matplotlib, NumPy is quietly working behind the scenes.
Yet many beginners skip NumPy entirely and jump straight into Pandas. This works for a while, but eventually you hit a wall. Calculations become slow. Errors become confusing. Certain data operations seem impossible.
Learning even a handful of key NumPy functions will make you a faster and more confident data analyst.
In this article, you will learn the 5 most important NumPy functions that data analysts use regularly, with simple examples you can practice today.
Prerequisite: Basic Python knowledge is all you need. If you want to see how NumPy connects to visualization, check out our article on the 5 best Python libraries for data visualization in 2026.
What Is NumPy and Why Does It Matter?
A NumPy array is like a Python list, but much faster and more powerful. While a Python list can hold different types of data at once (numbers, strings, booleans), a NumPy array holds one type of data. This makes it extremely efficient for numerical calculations.
Here is a quick comparison:
import numpy as np
# A regular Python list
python_list = [10, 20, 30, 40, 50]
# A NumPy array
numpy_array = np.array([10, 20, 30, 40, 50])
# Multiply every element by 2
print([x * 2 for x in python_list]) # The Python way (slower)
print(numpy_array * 2) # The NumPy way (faster)
Both give you [20, 40, 60, 80, 100], but NumPy does it significantly faster when your dataset is large. For small datasets the difference is invisible. For datasets with millions of rows, it is massive.
Now let us get into the five functions you need to know.
Quick Summary Table
| Function | What It Does | Common Use Case |
|---|---|---|
np.array() |
Creates a NumPy array | Converting lists or data into arrays |
np.mean(), np.median(), np.std() |
Calculates statistics | Summarising numerical columns |
np.where() |
Applies conditions to arrays | Categorising or flagging data |
np.reshape() |
Changes the shape of an array | Preparing data for machine learning |
np.unique() |
Finds distinct values in an array | Counting categories in your data |
np.array(): The Building Block of NumPy
What it does:
np.array() converts a Python list (or other data structure) into a NumPy array. This is the first function you will use every single time you work with NumPy.
Understanding this function well means you understand the core object that all other NumPy operations work on.
Code example:
import numpy as np
# Create a 1D array (like a single column of data)
scores = np.array([85, 90, 78, 92, 88, 76, 95])
print(scores)
# Output: [85 90 78 92 88 76 95]
# Create a 2D array (like a table with rows and columns)
student_data = np.array([
[1, 85, 90],
[2, 78, 88],
[3, 92, 95]
])
print(student_data)
# Output:
# [[ 1 85 90]
# [ 2 78 88]
# [ 3 92 95]]
# Check the shape of your array
print(student_data.shape)
# Output: (3, 3) — 3 rows, 3 columns
Why the shape matters:
The .shape attribute tells you how many rows and columns your array has. You will use this constantly when preparing data for machine learning models, which are very strict about the shape of input data.
When to use it:
Use np.array() whenever you need to convert data from Python lists into a format that NumPy, Pandas, Matplotlib, or Scikit-learn can work with efficiently.
np.mean(), np.median(), np.std() — Descriptive Statistics in One Line
What they do: These three functions let you calculate the most common statistical summaries of your data instantly:
np.mean()— the average valuenp.median()— the middle value (resistant to outliers)np.std()— the standard deviation (how spread out the values are)
If you have worked with Excel formulas like AVERAGE() or MEDIAN(), these do the same thing but on large datasets in milliseconds.
Code example:
import numpy as np
# Monthly revenue data for a business (in thousands of Naira)
revenue = np.array([520, 610, 480, 730, 990, 450, 1200, 820, 770, 640, 580, 910])
average_revenue = np.mean(revenue)
median_revenue = np.median(revenue)
std_revenue = np.std(revenue)
print(f"Average monthly revenue: ₦{average_revenue:.0f}k")
print(f"Median monthly revenue: ₦{median_revenue:.0f}k")
print(f"Standard deviation: ₦{std_revenue:.0f}k")
# Output:
# Average monthly revenue: ₦725k
# Median monthly revenue: ₦700k
# Standard deviation: ₦218k
Mean vs Median which one should you use?
This is a common question for beginners. The mean is great for data that is evenly spread out. But if your data has extreme values (called outliers), the mean gets pulled towards those extremes and gives a misleading picture.
The median is more honest in that situation. For example, if you are analysing salaries in a company where one executive earns significantly more than everyone else, the median salary is the more accurate reflection of what most employees earn.
When to use these functions: Use them during your first look at any new dataset to quickly understand the range and distribution of numerical columns before you start deeper analysis.
np.where(): The "IF" Statement of NumPy
What it does:
np.where() checks a condition across every element in an array and returns one value if the condition is true and another value if it is false.
If you have ever written an IF() formula in Excel, this is the NumPy equivalent but it runs across thousands of rows in an instant.
The syntax:
np.where(condition, value_if_true, value_if_false)
Code example:
import numpy as np
# Student exam scores
scores = np.array([45, 72, 58, 88, 34, 91, 65, 50, 79, 42])
# Label each score as Pass or Fail (passing mark is 50)
results = np.where(scores >= 50, "Pass", "Fail")
print(results)
# Output: ['Fail' 'Pass' 'Pass' 'Pass' 'Fail' 'Pass' 'Pass' 'Pass' 'Pass' 'Fail']
# Count how many students passed
passed = np.sum(scores >= 50)
print(f"{passed} out of {len(scores)} students passed")
# Output: 7 out of 10 students passed
A more practical example with multiple conditions:
import numpy as np
# Product stock levels
stock = np.array([120, 5, 80, 2, 45, 300, 18, 0])
# Categorise stock as: Critical, Low, or OK
status = np.where(stock == 0, "Out of Stock",
np.where(stock < 10, "Critical",
np.where(stock < 50, "Low", "OK")))
print(status)
# Output: ['OK' 'Critical' 'OK' 'Critical' 'Low' 'OK' 'Low' 'Out of Stock']
This is the kind of logic you would normally write with complex if/else loops. NumPy handles it in a single, readable expression.
When to use np.where():
Use it whenever you need to label, categorise, or flag values in a column based on a condition. It replaces slow Python loops and is much cleaner than nested if statements.
np.reshape(): Change the Shape of Your Data
What it does:
np.reshape() changes how your array is organised without changing the actual data inside it. You are simply rearranging the same values into a different structure.
This might sound abstract, but it has a very practical use: machine learning models are strict about the shape of the data you feed them. A lot of beginner errors in machine learning come down to a shape mismatch, and np.reshape() is the fix.
Code example:
import numpy as np
# A flat array of 12 values
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
print("Original shape:", data.shape)
# Output: Original shape: (12,)
# Reshape into 3 rows and 4 columns
reshaped = data.reshape(3, 4)
print("Reshaped (3x4):\n", reshaped)
# Output:
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
# Reshape into 4 rows and 3 columns
reshaped2 = data.reshape(4, 3)
print("Reshaped (4x3):\n", reshaped2)
# Output:
# [[ 1 2 3]
# [ 4 5 6]
# [ 7 8 9]
# [10 11 12]]
A common real-world use case:
import numpy as np
# A single column of house price predictions (flat array)
predictions = np.array([250000, 320000, 180000, 410000, 290000])
print("Shape before reshape:", predictions.shape)
# Output: Shape before reshape: (5,)
# Many ML models expect a 2D array, not a 1D array
predictions_2d = predictions.reshape(-1, 1)
print("Shape after reshape:", predictions_2d.shape)
# Output: Shape after reshape: (5, 1)
The -1 in reshape(-1, 1) tells NumPy to figure out the number of rows automatically. This is a shortcut you will see constantly in data science code.
When to use np.reshape(): Use it when a machine learning model throws a shape error, when you need to convert a flat list into a 2D table, or when you want to restructure data before passing it into a function.
np.unique(): Find and Count Distinct Values
What it does:
np.unique() returns all the distinct values in your array, sorted in order. With one extra argument, it can also tell you how many times each value appears.
This is one of the most useful functions for early data exploration because it answers two questions immediately: "What categories exist in this column?" and "How common is each one?"
Code example:
import numpy as np
# Survey responses about preferred programming language
responses = np.array([
"Python", "SQL", "Python", "R", "Python",
"SQL", "Java", "Python", "R", "SQL",
"Python", "Java", "SQL", "Python", "R"
])
# Get all unique values
unique_languages = np.unique(responses)
print("Languages mentioned:", unique_languages)
# Output: Languages mentioned: ['Java' 'Python' 'R' 'SQL']
# Get unique values AND how many times each appears
unique_languages, counts = np.unique(responses, return_counts=True)
for language, count in zip(unique_languages, counts):
print(f"{language}: {count} responses")
# Output:
# Java: 2 responses
# Python: 6 responses
# R: 3 responses
# SQL: 4 responses
Another practical example finding duplicate entries:
import numpy as np
# Transaction IDs (some are duplicates)
transaction_ids = np.array([1001, 1002, 1003, 1001, 1004, 1002, 1005])
unique_ids, counts = np.unique(transaction_ids, return_counts=True)
# Find IDs that appear more than once
duplicates = unique_ids[counts > 1]
print("Duplicate transaction IDs:", duplicates)
# Output: Duplicate transaction IDs: [1001 1002]
Finding duplicates is a common data cleaning task and np.unique() makes it fast and readable.
When to use np.unique(): Use it at the start of any analysis to understand the categories in your data. It is especially useful for checking data quality, finding duplicates, and building frequency tables.
Putting It All Together A Mini Analysis
Here is a short example that uses all five functions together on a single dataset:
import numpy as np
# Sales data for 10 products
product_names = np.array(["Rice", "Beans", "Garri", "Yam", "Rice",
"Beans", "Garri", "Rice", "Yam", "Garri"])
units_sold = np.array([120, 85, 200, 60, 140, 95, 175, 110, 55, 190])
# 1. Basic statistics
print(f"Average units sold: {np.mean(units_sold):.0f}")
print(f"Median units sold: {np.median(units_sold):.0f}")
# 2. Flag high sellers (above 150 units)
performance = np.where(units_sold > 150, "High Seller", "Regular")
print("Performance:", performance)
# 3. Count how many products are high sellers
print(f"High sellers: {np.sum(units_sold > 150)}")
# 4. Find unique product names and their frequency
products, freq = np.unique(product_names, return_counts=True)
print("\nProduct frequency:")
for p, f in zip(products, freq):
print(f" {p}: {f} entries")
In just a few lines, you have performed a complete basic analysis: summary statistics, conditional labelling, and frequency counting.
Conclusion
NumPy is not glamorous. It does not produce beautiful charts like Matplotlib or give you a neat table like Pandas. But it is the engine running under all of those tools.
Learning these five functions gives you a solid foundation to work faster, debug errors more easily, and understand what is happening inside your Python data analysis code.
The five functions to remember are:
np.array()— Create and understand arraysnp.mean(),np.median(),np.std()— Summarise your data instantlynp.where()— Apply conditions without writing loopsnp.reshape()— Prepare data for machine learningnp.unique()— Explore categories and find duplicates
Start by practising each one on data from your own projects. The more you use them, the more natural they become.
Ready to visualise the data you just analysed? Check out our guide on the 5 best Python libraries for data visualization in 2026.
How to Install NumPy
If you do not have NumPy installed, open your terminal or command prompt and run:
pip install numpy
Then import it at the top of your script with:
import numpy as np
The as np part is a convention used by the entire Python community. You will see it in every tutorial, textbook, and course.
References
- NumPy Official Documentation — numpy.org
- NumPy User Guide — numpy.org
- Introduction to NumPy — DataCamp — datacamp.com
- Data Analysis with Python — GeeksforGeeks — geeksforgeeks.org
- NumPy for Absolute Beginners — Towards Data Science — towardsdatascience.com
- Top Python Libraries for Data Science Beginners 2026 — wininlifeacademy.com
Published on JacobIsah Programming Hub | enemzy.blogspot.com
.png)
Post a Comment