How to Utilize Python Pandas GroupBy Function for Data Analysis

The pandas `groupby()` function is a powerful tool for working with tabular data in Python. It allows you to group rows based on one or more columns and then perform various operations on each group. This functionality is essential for data summarization, analysis, and manipulation.

Table of Contents

1. Basic Principle of GroupBy.

Grouping: You specify a column (or list of columns) in your DataFrame to create groups. Each unique value in the grouping column(s) becomes a group name.

Operations on Groups: The `groupby()` function returns a `GroupBy` object, which provides methods to perform operations on each group.

2. Common GroupBy Operations.

There are three main categories of operations you can perform on grouped data:
Aggregation (agg): This involves summarizing data within each group using functions like `sum()`, `mean()`, `median()`, `count()`, etc.

Transformation (transform): This applies a function to each group, potentially creating new columns or modifying existing ones. Common functions include `fillna()`, `apply()`, and custom functions.
Application (apply): This allows you to apply a function to each group (like a mini-DataFrame) to perform more complex computations.

3. Example Data.

Let’s create a sample DataFrame to illustrate these concepts:

import pandas as pd

def create_df_data():
    # Define example data
    data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'],
            'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'],
            'sales': [100, 200, 150, 75, 225, None]}  # Include a missing value (None) for demonstration
    
    # Create DataFrame from the data
    df = pd.DataFrame(data)
    
    # Print the original DataFrame
    print('\r\nOriginal df:')
    print(df)
    
    # Return the created DataFrame
    return df


# The following code will only execute if this script is run directly,
# not if it is imported as a module in another script.
if __name__ == "__main__":

    # Call the function to create and display the DataFrame
    df = create_df_data()

Output.

Original df:
     store product  sales
0  Store A   Shirt  100.0
1  Store A   Jeans  200.0
2  Store B   Shirt  150.0
3  Store B     Hat   75.0
4  Store C   Shirt  225.0
5  Store C     Hat    NaN

4. Aggregation (agg).

Aggregate functions operate on each group and return a single value summarizing that group.

Here’s how to calculate total sales per store and the average sales per product category:

import pandas as pd

def create_df_data():
    # Define example data
    data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'],
            'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'],
            'sales': [100, 200, 150, 75, 225, None]}  # Include a missing value (None) for demonstration
    
    # Create DataFrame from the data
    df = pd.DataFrame(data)
    
    # Print the original DataFrame
    print('\r\nOriginal df:')
    print(df)
    
    # Return the created DataFrame
    return df

def test_agg_function(df):
    # Total sales per store
    store_totals = df.groupby('store')['sales'].sum()
    print('\r\nTotal sales per store:')
    print(store_totals)

    # Average sales per product
    product_avg_sales = df.groupby('product')['sales'].mean()
    print('\r\nAverage sales per product:')
    print(product_avg_sales)

    # Median sales per product
    product_medians = df.groupby('product')['sales'].median()
    print("\r\nMedian Sales per Product:")
    print(product_medians)

    # Count of sales per product
    product_counts = df.groupby('product')['sales'].count()
    print("\r\nCount of Sales per Product:")
    print(product_counts)


# The following code will only execute if this script is run directly,
# not if it is imported as a module in another script.
if __name__ == "__main__":

    # Call the function to create and display the DataFrame
    df = create_df_data()
    test_agg_function(df)

Output.

Original df:
     store product  sales
0  Store A   Shirt  100.0
1  Store A   Jeans  200.0
2  Store B   Shirt  150.0
3  Store B     Hat   75.0
4  Store C   Shirt  225.0
5  Store C     Hat    NaN

Total sales per store:
store
Store A    300.0
Store B    225.0
Store C    225.0
Name: sales, dtype: float64

Average sales per product:
product
Hat       75.000000
Jeans    200.000000
Shirt    158.333333
Name: sales, dtype: float64

Median Sales per Product:
product
Hat       75.0
Jeans    200.0
Shirt    150.0
Name: sales, dtype: float64

Count of Sales per Product:
product
Hat      1
Jeans    1
Shirt    3
Name: sales, dtype: int64

5. Transformation (transform).

Transformation functions modify or create new columns within each group.

Here’s an example of filling missing sales values with the group mean and calculating sales percentages within each store:

import pandas as pd

def create_df_data():
    # Define example data
    data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'],
            'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'],
            'sales': [100, 200, 150, 75, 225, None]}  # Include a missing value (None) for demonstration
    
    # Create DataFrame from the data
    df = pd.DataFrame(data)
    
    # Print the original DataFrame
    print('\r\nOriginal df:')
    print(df)
    
    # Return the created DataFrame
    return df


def test_transformation_function(df):
    # Fill missing sales with group mean (assuming no missing values here)
    df['sales_filled'] = df.groupby('store')['sales'].transform('fillna', df['sales'].mean())

    # Calculate sales percentages within each store (assuming no zeros in 'sales')
    df['sales_pct'] = df.groupby('store')['sales_filled'].transform(lambda x: x / x.sum())

    print('\r\nTransformaed df:')
    print(df)


# The following code will only execute if this script is run directly,
# not if it is imported as a module in another script.
if __name__ == "__main__":

    # Call the function to create and display the DataFrame
    df = create_df_data()
    
    test_transformation_function(df)

Output.

Original df:
     store product  sales
0  Store A   Shirt  100.0
1  Store A   Jeans  200.0
2  Store B   Shirt  150.0
3  Store B     Hat   75.0
4  Store C   Shirt  225.0
5  Store C     Hat    NaN

Transformaed df:
     store product  sales  sales_filled  sales_pct
0  Store A   Shirt  100.0         100.0   0.333333
1  Store A   Jeans  200.0         200.0   0.666667
2  Store B   Shirt  150.0         150.0   0.666667
3  Store B     Hat   75.0          75.0   0.333333
4  Store C   Shirt  225.0         225.0   0.600000
5  Store C     Hat    NaN         150.0   0.400000

6. Application (apply).

The `apply()` method allows you to apply a custom function to each group (like a mini-DataFrame).

Here’s an example of finding the most expensive product (by sales) within each store:

import pandas as pd

def create_df_data():
    # Define example data
    data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'],
            'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'],
            'sales': [100, 200, 150, 75, 225, None]}  # Include a missing value (None) for demonstration
    
    # Create DataFrame from the data
    df = pd.DataFrame(data)
    
    # Print the original DataFrame
    print('\r\nOriginal df:')
    print(df)
    
    # Return the created DataFrame
    return df


def get_most_expensive(group):
    # Define a function to retrieve the row with the maximum sales value within each group
    return group.loc[group['sales'].idxmax()]

def test_apply_function(df):
    # Group the DataFrame by 'store' and apply the custom function to get the most expensive item in each store
    most_expensive_per_store = df.groupby('store').apply(get_most_expensive)
    
    # Print the result
    print(most_expensive_per_store)



# The following code will only execute if this script is run directly,
# not if it is imported as a module in another script.
if __name__ == "__main__":

    # Call the function to create and display the DataFrame
    df = create_df_data()

    test_apply_function(df)

Output.

Original df:
     store product  sales
0  Store A   Shirt  100.0
1  Store A   Jeans  200.0
2  Store B   Shirt  150.0
3  Store B     Hat   75.0
4  Store C   Shirt  225.0
5  Store C     Hat    NaN
           store product  sales
store                          
Store A  Store A   Jeans  200.0
Store B  Store B   Shirt  150.0
Store C  Store C   Shirt  225.0

1. Basic Principle of GroupBy.

2. Common GroupBy Operations.

3. Example Data.

4. Aggregation (agg).

5. Transformation (transform).

6. Application (apply).

Leave a Comment Cancel Reply