Python is a powerful tool for data professionals, largely due to its extensive collection of open-source libraries and packages. In this blog, we'll explore the fundamentals of pandas, with a special focus on its core data structures: Series and DataFrame.
The Building Blocks: Series and DataFrame
Series: The One-Dimensional Powerhouse
A Series in pandas is akin to a column in a spreadsheet or a one-dimensional NumPy array. It’s a labeled array that can hold any data type, with each element having an associated label called an index. This indexing feature allows for efficient and intuitive data manipulation.
Here’s a quick look at how to create and use a Series:
import pandas as pd
# Creating a Series
s = pd.Series([1, 2, 3, 4, 5])
print(s)
Output:
0 1
1 2
2 3
3 4
4 5
dtype: int64
DataFrame : The Two-Dimensional Workhorse
A DataFrame is the heart of pandas, representing a two-dimensional labeled data structure with columns and rows, similar to a table or spreadsheet. Each column in a DataFrame is a Series.
Here’s how to create a DataFrame:
# Creating a DataFrame from a dictionary
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
print(df)
Output:
col1 col2
0 1 3
1 2 4
# Creating a DataFrame from a NumPy array
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
print(df2)
Output:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
Unlocking the Potential of DataFrame Attributes and Methods
DataFrames come with a plethora of built-in attributes and methods that simplify common data analysis tasks. Here are some of the most commonly used ones:
Essential Attributes
columns
: Returns the column labels of the DataFrame.dtypes
: Returns the data types of the columns.iloc
: Accesses rows and columns using integer-based indexing.loc
: Accesses rows and columns by labels or Boolean arrays.shape
: Returns a tuple representing the dimensionality.values
: Returns a NumPy representation of the DataFrame.
Examples:
print(df2.columns)
print(df2.dtypes)
print(df2.shape)
print(df2.values)
Output:
Index(['a', 'b', 'c'], dtype='object')
a int64
b int64
c int64
dtype: object
(3, 3)
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
Vital Methods
apply()
: Applies a function along an axis.copy()
: Creates a copy of the DataFrame.describe()
: Provides descriptive statistics.drop()
: Drops specified labels from rows or columns.groupby()
: Groups DataFrame using a mapper or by a Series of columns.head()
: Returns the first n rows.info()
: Prints a concise summary of the DataFrame.isna()
: Detects missing values.sort_values()
: Sorts by the values along an axis.value_counts()
: Returns a Series containing counts of unique values.
Example:
print(df2.describe())
Output:
a b c
count 3.000000 3.000000 3.000000
mean 4.000000 5.000000 6.000000
std 3.000000 3.000000 3.000000
min 1.000000 2.000000 3.000000
25% 2.500000 3.500000 4.500000
50% 4.000000 5.000000 6.000000
75% 5.500000 6.500000 7.500000
max 7.000000 8.000000 9.000000
Selecting Data with Pandas
Selecting and manipulating data within a DataFrame is a crucial skill. Pandas offers several ways to achieve this.
Row Selection
Using loc[]
:
Select rows by label:
df = pd.DataFrame({
'A': ['alpha', 'apple', 'arsenic', 'angel', 'android'],
'B': [1, 2, 3, 4, 5],
'C': ['coconut', 'curse', 'cassava', 'cuckoo', 'clarinet'],
'D': [6, 7, 8, 9, 10]
}, index=[0, 1, 2, 3, 4])
print(df.loc[1])
print(df.loc[[1, 3]])
Output:
A apple
B 2
C curse
D 7
Name: 1, dtype: object
A B C D
1 apple 2 curse 7
3 angel 4 cuckoo 9
Using iloc[]
:
Select rows by position:
print(df.iloc[1])
print(df.iloc[[0, 2]])
Output:
A apple
B 2
C curse
D 7
Name: 1, dtype: object
A B C D
0 alpha 1 coconut 6
2 arsenic 3 cassava 8
Column Selection
Bracket Notation:
print(df['C'])
print(df[['A', 'C']])
Output:
0 coconut
1 curse
2 cassava
3 cuckoo
4 clarinet
Name: C, dtype: object
A C
0 alpha coconut
1 apple curse
2 arsenic cassava
3 angel cuckoo
4 android clarinet
Dot Notation:
print(df.A)
Output:
0 alpha
1 apple
2 arsenic
3 angel
4 android
Name: A, dtype: object
Selecting Rows and Columns Together
Using loc[]
:
print(df.loc[0:2, ['A', 'C']])
Output:
A C
0 alpha coconut
1 apple curse
2 arsenic cassava
Using iloc[]
:
print(df.iloc[[2, 4], 0:3])
Output:
A B C
2 arsenic 3 cassava
4 android 5 clarinet
Key Takeaways
Pandas DataFrames are an essential tool for working with tabular data. Each row and column in a DataFrame is represented by a pandas Series, making data manipulation intuitive and efficient. The robust suite of methods and attributes available in pandas allows for sophisticated data operations with minimal code. As you gain experience with pandas, you’ll find it an invaluable tool in your data science toolkit.
Additional Resources
For more detailed information, refer to the official pandas documentation:
By mastering the fundamentals of pandas, you'll be on your way to becoming a proficient and effective data professional.