Mastering the Fundamentals of Pandas

Mastering the Fundamentals of Pandas

Python is a powerful tool for data professionals, largely due to its extensive collection of open-source libraries and packages. In this blog, we'll explore the fundamentals of pandas, with a special focus on its core data structures: Series and DataFrame.

The Building Blocks: Series and DataFrame

Series: The One-Dimensional Powerhouse

A Series in pandas is akin to a column in a spreadsheet or a one-dimensional NumPy array. It’s a labeled array that can hold any data type, with each element having an associated label called an index. This indexing feature allows for efficient and intuitive data manipulation.

Here’s a quick look at how to create and use a Series:

import pandas as pd
# Creating a Series
s = pd.Series([1, 2, 3, 4, 5])
print(s)

Output:

0    1
1    2
2    3
3    4
4    5
dtype: int64

DataFrame : The Two-Dimensional Workhorse

A DataFrame is the heart of pandas, representing a two-dimensional labeled data structure with columns and rows, similar to a table or spreadsheet. Each column in a DataFrame is a Series.

Here’s how to create a DataFrame:

# Creating a DataFrame from a dictionary
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
print(df)

Output:

   col1  col2
0     1     3
1     2     4
# Creating a DataFrame from a NumPy array
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
print(df2)

Output:

   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Unlocking the Potential of DataFrame Attributes and Methods

DataFrames come with a plethora of built-in attributes and methods that simplify common data analysis tasks. Here are some of the most commonly used ones:

Essential Attributes

  • columns: Returns the column labels of the DataFrame.

  • dtypes: Returns the data types of the columns.

  • iloc: Accesses rows and columns using integer-based indexing.

  • loc: Accesses rows and columns by labels or Boolean arrays.

  • shape: Returns a tuple representing the dimensionality.

  • values: Returns a NumPy representation of the DataFrame.

Examples:

print(df2.columns)
print(df2.dtypes)
print(df2.shape)
print(df2.values)

Output:

Index(['a', 'b', 'c'], dtype='object')

a    int64
b    int64
c    int64
dtype: object

(3, 3)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Vital Methods

  • apply(): Applies a function along an axis.

  • copy(): Creates a copy of the DataFrame.

  • describe(): Provides descriptive statistics.

  • drop(): Drops specified labels from rows or columns.

  • groupby(): Groups DataFrame using a mapper or by a Series of columns.

  • head(): Returns the first n rows.

  • info(): Prints a concise summary of the DataFrame.

  • isna(): Detects missing values.

  • sort_values(): Sorts by the values along an axis.

  • value_counts(): Returns a Series containing counts of unique values.

Example:

print(df2.describe())

Output:

              a         b         c
count  3.000000  3.000000  3.000000
mean   4.000000  5.000000  6.000000
std    3.000000  3.000000  3.000000
min    1.000000  2.000000  3.000000
25%    2.500000  3.500000  4.500000
50%    4.000000  5.000000  6.000000
75%    5.500000  6.500000  7.500000
max    7.000000  8.000000  9.000000

Selecting Data with Pandas

Selecting and manipulating data within a DataFrame is a crucial skill. Pandas offers several ways to achieve this.

Row Selection

Using loc[]:

Select rows by label:

df = pd.DataFrame({
   'A': ['alpha', 'apple', 'arsenic', 'angel', 'android'],
   'B': [1, 2, 3, 4, 5],
   'C': ['coconut', 'curse', 'cassava', 'cuckoo', 'clarinet'],
   'D': [6, 7, 8, 9, 10]
}, index=[0, 1, 2, 3, 4])

print(df.loc[1])
print(df.loc[[1, 3]])

Output:

A       apple
B           2
C       curse
D           7
Name: 1, dtype: object

          A  B       C  D
1     apple  2   curse  7
3     angel  4  cuckoo  9

Using iloc[]:

Select rows by position:

print(df.iloc[1])
print(df.iloc[[0, 2]])

Output:

A       apple
B           2
C       curse
D           7
Name: 1, dtype: object

         A  B        C  D
0     alpha  1  coconut  6
2   arsenic  3   cassava  8

Column Selection

Bracket Notation:

print(df['C'])
print(df[['A', 'C']])

Output:

0    coconut
1      curse
2    cassava
3     cuckoo
4   clarinet
Name: C, dtype: object

        A        C
0   alpha  coconut
1   apple    curse
2 arsenic  cassava
3   angel   cuckoo
4 android clarinet

Dot Notation:

print(df.A)

Output:

0     alpha
1     apple
2   arsenic
3     angel
4   android
Name: A, dtype: object

Selecting Rows and Columns Together

Using loc[]:

print(df.loc[0:2, ['A', 'C']])

Output:

        A        C
0   alpha  coconut
1   apple    curse
2 arsenic  cassava

Using iloc[]:

print(df.iloc[[2, 4], 0:3])

Output:

        A  B        C
2 arsenic  3   cassava
4 android  5  clarinet

Key Takeaways

Pandas DataFrames are an essential tool for working with tabular data. Each row and column in a DataFrame is represented by a pandas Series, making data manipulation intuitive and efficient. The robust suite of methods and attributes available in pandas allows for sophisticated data operations with minimal code. As you gain experience with pandas, you’ll find it an invaluable tool in your data science toolkit.

Additional Resources

For more detailed information, refer to the official pandas documentation:

By mastering the fundamentals of pandas, you'll be on your way to becoming a proficient and effective data professional.