How to Utilize Python Pandas GroupBy Function for Data Analysis

The pandas `groupby()` function is a powerful tool for working with tabular data in Python. It allows you to group rows based on one or more columns and then perform various operations on each group. This functionality is essential for data summarization, analysis, and manipulation.

1. Basic Principle of GroupBy.

  1. Grouping: You specify a column (or list of columns) in your DataFrame to create groups. Each unique value in the grouping column(s) becomes a group name.
  2. Operations on Groups: The `groupby()` function returns a `GroupBy` object, which provides methods to perform operations on each group.

2. Common GroupBy Operations.

  1. There are three main categories of operations you can perform on grouped data:
  2. Aggregation (agg): This involves summarizing data within each group using functions like `sum()`, `mean()`, `median()`, `count()`, etc.
  3. Transformation (transform): This applies a function to each group, potentially creating new columns or modifying existing ones. Common functions include `fillna()`, `apply()`, and custom functions.
  4. Application (apply): This allows you to apply a function to each group (like a mini-DataFrame) to perform more complex computations.

3. Example Data.

  1. Let’s create a sample DataFrame to illustrate these concepts:
    import pandas as pd
    
    def create_df_data():
        # Define example data
        data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'],
                'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'],
                'sales': [100, 200, 150, 75, 225, None]}  # Include a missing value (None) for demonstration
        
        # Create DataFrame from the data
        df = pd.DataFrame(data)
        
        # Print the original DataFrame
        print('\r\nOriginal df:')
        print(df)
        
        # Return the created DataFrame
        return df
    
    
    # The following code will only execute if this script is run directly,
    # not if it is imported as a module in another script.
    if __name__ == "__main__":
    
        # Call the function to create and display the DataFrame
        df = create_df_data()
  2. Output.
    Original df:
         store product  sales
    0  Store A   Shirt  100.0
    1  Store A   Jeans  200.0
    2  Store B   Shirt  150.0
    3  Store B     Hat   75.0
    4  Store C   Shirt  225.0
    5  Store C     Hat    NaN

4. Aggregation (agg).

  1. Aggregate functions operate on each group and return a single value summarizing that group.
  2. Here’s how to calculate total sales per store and the average sales per product category:
    import pandas as pd
    
    def create_df_data():
        # Define example data
        data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'],
                'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'],
                'sales': [100, 200, 150, 75, 225, None]}  # Include a missing value (None) for demonstration
        
        # Create DataFrame from the data
        df = pd.DataFrame(data)
        
        # Print the original DataFrame
        print('\r\nOriginal df:')
        print(df)
        
        # Return the created DataFrame
        return df
    
    def test_agg_function(df):
        # Total sales per store
        store_totals = df.groupby('store')['sales'].sum()
        print('\r\nTotal sales per store:')
        print(store_totals)
    
        # Average sales per product
        product_avg_sales = df.groupby('product')['sales'].mean()
        print('\r\nAverage sales per product:')
        print(product_avg_sales)
    
        # Median sales per product
        product_medians = df.groupby('product')['sales'].median()
        print("\r\nMedian Sales per Product:")
        print(product_medians)
    
        # Count of sales per product
        product_counts = df.groupby('product')['sales'].count()
        print("\r\nCount of Sales per Product:")
        print(product_counts)
    
    
    # The following code will only execute if this script is run directly,
    # not if it is imported as a module in another script.
    if __name__ == "__main__":
    
        # Call the function to create and display the DataFrame
        df = create_df_data()
        test_agg_function(df)
    
  3. Output.
    Original df:
         store product  sales
    0  Store A   Shirt  100.0
    1  Store A   Jeans  200.0
    2  Store B   Shirt  150.0
    3  Store B     Hat   75.0
    4  Store C   Shirt  225.0
    5  Store C     Hat    NaN
    
    Total sales per store:
    store
    Store A    300.0
    Store B    225.0
    Store C    225.0
    Name: sales, dtype: float64
    
    Average sales per product:
    product
    Hat       75.000000
    Jeans    200.000000
    Shirt    158.333333
    Name: sales, dtype: float64
    
    Median Sales per Product:
    product
    Hat       75.0
    Jeans    200.0
    Shirt    150.0
    Name: sales, dtype: float64
    
    Count of Sales per Product:
    product
    Hat      1
    Jeans    1
    Shirt    3
    Name: sales, dtype: int64

5. Transformation (transform).

  1. Transformation functions modify or create new columns within each group.
  2. Here’s an example of filling missing sales values with the group mean and calculating sales percentages within each store:
    import pandas as pd
    
    def create_df_data():
        # Define example data
        data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'],
                'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'],
                'sales': [100, 200, 150, 75, 225, None]}  # Include a missing value (None) for demonstration
        
        # Create DataFrame from the data
        df = pd.DataFrame(data)
        
        # Print the original DataFrame
        print('\r\nOriginal df:')
        print(df)
        
        # Return the created DataFrame
        return df
    
    
    def test_transformation_function(df):
        # Fill missing sales with group mean (assuming no missing values here)
        df['sales_filled'] = df.groupby('store')['sales'].transform('fillna', df['sales'].mean())
    
        # Calculate sales percentages within each store (assuming no zeros in 'sales')
        df['sales_pct'] = df.groupby('store')['sales_filled'].transform(lambda x: x / x.sum())
    
        print('\r\nTransformaed df:')
        print(df)
    
    
    # The following code will only execute if this script is run directly,
    # not if it is imported as a module in another script.
    if __name__ == "__main__":
    
        # Call the function to create and display the DataFrame
        df = create_df_data()
        
        test_transformation_function(df)
    
  3. Output.
    Original df:
         store product  sales
    0  Store A   Shirt  100.0
    1  Store A   Jeans  200.0
    2  Store B   Shirt  150.0
    3  Store B     Hat   75.0
    4  Store C   Shirt  225.0
    5  Store C     Hat    NaN
    
    Transformaed df:
         store product  sales  sales_filled  sales_pct
    0  Store A   Shirt  100.0         100.0   0.333333
    1  Store A   Jeans  200.0         200.0   0.666667
    2  Store B   Shirt  150.0         150.0   0.666667
    3  Store B     Hat   75.0          75.0   0.333333
    4  Store C   Shirt  225.0         225.0   0.600000
    5  Store C     Hat    NaN         150.0   0.400000

6. Application (apply).

  1. The `apply()` method allows you to apply a custom function to each group (like a mini-DataFrame).
  2. Here’s an example of finding the most expensive product (by sales) within each store:
    import pandas as pd
    
    def create_df_data():
        # Define example data
        data = {'store': ['Store A', 'Store A', 'Store B', 'Store B', 'Store C', 'Store C'],
                'product': ['Shirt', 'Jeans', 'Shirt', 'Hat', 'Shirt', 'Hat'],
                'sales': [100, 200, 150, 75, 225, None]}  # Include a missing value (None) for demonstration
        
        # Create DataFrame from the data
        df = pd.DataFrame(data)
        
        # Print the original DataFrame
        print('\r\nOriginal df:')
        print(df)
        
        # Return the created DataFrame
        return df
    
    
    def get_most_expensive(group):
        # Define a function to retrieve the row with the maximum sales value within each group
        return group.loc[group['sales'].idxmax()]
    
    def test_apply_function(df):
        # Group the DataFrame by 'store' and apply the custom function to get the most expensive item in each store
        most_expensive_per_store = df.groupby('store').apply(get_most_expensive)
        
        # Print the result
        print(most_expensive_per_store)
    
    
    
    # The following code will only execute if this script is run directly,
    # not if it is imported as a module in another script.
    if __name__ == "__main__":
    
        # Call the function to create and display the DataFrame
        df = create_df_data()
    
        test_apply_function(df)
    
  3. Output.
    Original df:
         store product  sales
    0  Store A   Shirt  100.0
    1  Store A   Jeans  200.0
    2  Store B   Shirt  150.0
    3  Store B     Hat   75.0
    4  Store C   Shirt  225.0
    5  Store C     Hat    NaN
               store product  sales
    store                          
    Store A  Store A   Jeans  200.0
    Store B  Store B   Shirt  150.0
    Store C  Store C   Shirt  225.0

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.