Pandas Basics Mastery Guide

Python is a powerful tool for data professionals, largely due to its extensive collection of open-source libraries and packages. In this blog, we'll explore the fundamentals of pandas, with a special focus on its core data structures: Series and DataFrame.

The Building Blocks: Series and DataFrame

Series: The One-Dimensional Powerhouse

A Series in pandas is akin to a column in a spreadsheet or a one-dimensional NumPy array. It’s a labeled array that can hold any data type, with each element having an associated label called an index. This indexing feature allows for efficient and intuitive data manipulation.

Here’s a quick look at how to create and use a Series:

import pandas as pd
# Creating a Series
s = pd.Series([1, 2, 3, 4, 5])
print(s)

Output:

0    1
1    2
2    3
3    4
4    5
dtype: int64

DataFrame : The Two-Dimensional Workhorse

A DataFrame is the heart of pandas, representing a two-dimensional labeled data structure with columns and rows, similar to a table or spreadsheet. Each column in a DataFrame is a Series.

Here’s how to create a DataFrame:

# Creating a DataFrame from a dictionary
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
print(df)

Output:

   col1  col2
0     1     3
1     2     4

# Creating a DataFrame from a NumPy array
import numpy as np
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
print(df2)

Output:

Unlocking the Potential of DataFrame Attributes and Methods

DataFrames come with a plethora of built-in attributes and methods that simplify common data analysis tasks. Here are some of the most commonly used ones:

Essential Attributes

columns: Returns the column labels of the DataFrame.
dtypes: Returns the data types of the columns.
iloc: Accesses rows and columns using integer-based indexing.
loc: Accesses rows and columns by labels or Boolean arrays.
shape: Returns a tuple representing the dimensionality.
values: Returns a NumPy representation of the DataFrame.

Examples:

print(df2.columns)
print(df2.dtypes)
print(df2.shape)
print(df2.values)

Output:

Index(['a', 'b', 'c'], dtype='object')

a    int64
b    int64
c    int64
dtype: object

(3, 3)

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Vital Methods

apply(): Applies a function along an axis.
copy(): Creates a copy of the DataFrame.
describe(): Provides descriptive statistics.
drop(): Drops specified labels from rows or columns.
groupby(): Groups DataFrame using a mapper or by a Series of columns.
head(): Returns the first n rows.
info(): Prints a concise summary of the DataFrame.
isna(): Detects missing values.
sort_values(): Sorts by the values along an axis.
value_counts(): Returns a Series containing counts of unique values.

Example:

print(df2.describe())

Output:

              a         b         c
count  3.000000  3.000000  3.000000
mean   4.000000  5.000000  6.000000
std    3.000000  3.000000  3.000000
min    1.000000  2.000000  3.000000
25%    2.500000  3.500000  4.500000
50%    4.000000  5.000000  6.000000
75%    5.500000  6.500000  7.500000
max    7.000000  8.000000  9.000000

Selecting Data with Pandas

Selecting and manipulating data within a DataFrame is a crucial skill. Pandas offers several ways to achieve this.

Row Selection

Using `loc[]`:

Select rows by label:

df = pd.DataFrame({
   'A': ['alpha', 'apple', 'arsenic', 'angel', 'android'],
   'B': [1, 2, 3, 4, 5],
   'C': ['coconut', 'curse', 'cassava', 'cuckoo', 'clarinet'],
   'D': [6, 7, 8, 9, 10]
}, index=[0, 1, 2, 3, 4])

print(df.loc[1])
print(df.loc[[1, 3]])

Output:

A       apple
B           2
C       curse
D           7
Name: 1, dtype: object

          A  B       C  D
1     apple  2   curse  7
3     angel  4  cuckoo  9

Using `iloc[]`:

Select rows by position:

print(df.iloc[1])
print(df.iloc[[0, 2]])

Output:

A       apple
B           2
C       curse
D           7
Name: 1, dtype: object

         A  B        C  D
0     alpha  1  coconut  6
2   arsenic  3   cassava  8

Column Selection

Bracket Notation:

print(df['C'])
print(df[['A', 'C']])

Output:

0    coconut
1      curse
2    cassava
3     cuckoo
4   clarinet
Name: C, dtype: object

        A        C
0   alpha  coconut
1   apple    curse
2 arsenic  cassava
3   angel   cuckoo
4 android clarinet

Dot Notation:

print(df.A)

Output:

0     alpha
1     apple
2   arsenic
3     angel
4   android
Name: A, dtype: object

Selecting Rows and Columns Together

Using `loc[]`:

print(df.loc[0:2, ['A', 'C']])

Output:

        A        C
0   alpha  coconut
1   apple    curse
2 arsenic  cassava

Using `iloc[]`:

print(df.iloc[[2, 4], 0:3])

Output:

        A  B        C
2 arsenic  3   cassava
4 android  5  clarinet

Key Takeaways

Pandas DataFrames are an essential tool for working with tabular data. Each row and column in a DataFrame is represented by a pandas Series, making data manipulation intuitive and efficient. The robust suite of methods and attributes available in pandas allows for sophisticated data operations with minimal code. As you gain experience with pandas, you’ll find it an invaluable tool in your data science toolkit.

Additional Resources

For more detailed information, refer to the official pandas documentation:

By mastering the fundamentals of pandas, you'll be on your way to becoming a proficient and effective data professional.

Mastering the Fundamentals of Pandas

The Building Blocks: Series and DataFrame

Series: The One-Dimensional Powerhouse

DataFrame : The Two-Dimensional Workhorse

Unlocking the Potential of DataFrame Attributes and Methods

Essential Attributes

Vital Methods

Selecting Data with Pandas

Row Selection

Using `loc[]`:

Using `iloc[]`:

Column Selection

Bracket Notation:

Dot Notation:

Selecting Rows and Columns Together

Using `loc[]`:

Using `iloc[]`:

Key Takeaways

Additional Resources

Comments

Data & Analytics Insights

Tools and Technologies For Big Data

More from this blog

Bias, Variance, Under-fitting, and Over-fitting and bias variance tradeoff

Monolithic vs Microservices architecture

Big Data analytics and Data science

Mastering NumPy Arrays

Command Palette

The Building Blocks: Series and DataFrame

Series: The One-Dimensional Powerhouse

DataFrame : The Two-Dimensional Workhorse

Unlocking the Potential of DataFrame Attributes and Methods

Essential Attributes

Vital Methods

Selecting Data with Pandas

Row Selection

Using loc[]:

Using iloc[]:

Column Selection

Bracket Notation:

Dot Notation:

Selecting Rows and Columns Together

Using loc[]:

Using iloc[]:

Key Takeaways

Additional Resources

Comments

Data & Analytics Insights

Tools and Technologies For Big Data

More from this blog

Using `loc[]`:

Using `iloc[]`:

Using `loc[]`:

Using `iloc[]`: