When working with multiple datasets in Python using the Pandas library, you might encounter scenarios where you need to transform and replace values in a column based on multiple matching conditions. In this article, we’ll explore a real-world example and demonstrate how to achieve this using Pandas.
1. Problem Statement.
- We have two datasets, A and B, and we want to replace values in the `project_id` column of dataset A based on matching conditions with dataset B.
- The matching conditions involve comparing multiple columns in both datasets.
2. Example Datasets.
- Let’s consider the following datasets A and B:
- Dataset A.
Year ID delivery Gen type vendor project_id 0 2022 BR Q2 2022 L aa d BR2 aa1 Q2 2022 - L 1 2022 BR Q2 2022 L dd d BR2 dd1 Q2 2022 - L 2 2022 BR Q2 2022 L dd d BR2 dd2 Q2 2022 - L 3 2022 BR Q3 2022 L bb d BR2 bb1 Q3 2022 - L 4 2022 BR Q4 2022 L aa d BR2 aa1 Q4 2022 - L 5 2022 BR Q4 2022 L dd nd BR2 dd1 Q4 2022 - L
- Dataset B.
Tags ID qtr TYPE msc Project Name 0 2022 BOLOL Q4 2022 aa d BB H_AA01 Q4 2022 1 2022 BR Q4 2022 dd nd BR2 H_DD_nd02 Q4 2022 2 2022 BR Q3 2022 bb d BR2 BB01 Q3.2022 3 2022 BR Q2 2022 dd d BR2 H_DD01 Q2 2022 4 2022 BR Q2 2022 dd d BR2 H_DD02 Q2 2022 5 2022 BR Q2 2022 aa d BR2 H_AA01 Q2 2022
3. Example Source Code.
- We can use the `merge` function in Pandas to combine the datasets based on matching conditions and then update the `project_id` column in dataset A.
- Here’s the step-by-step process:
- Merge datasets A and B based on matching columns.
- Update the `project_id` column in dataset A with the corresponding values from dataset B.
- Handle cases where there are no matches to avoid introducing NaN values.
- Below is the example source code.
# import the pandas library. import pandas as pd # define the dataset A. data_A = { 'Year': [2022, 2022, 2022, 2022, 2022, 2022], 'ID': ['BR', 'BR', 'BR', 'BR', 'BR', 'BR'], 'delivery': ['Q2 2022', 'Q2 2022', 'Q2 2022', 'Q3 2022', 'Q4 2022', 'Q4 2022'], 'Gen': ['L', 'L', 'L', 'L', 'L', 'L'], 'type': ['aa', 'dd', 'dd', 'bb', 'aa', 'dd'], 'vendor': ['d', 'd', 'd', 'd', 'd', 'nd'], 'project_id': ['BR2 aa1 Q2 2022 - L', 'BR2 dd1 Q2 2022 - L', 'BR2 dd2 Q2 2022 - L', 'BR2 bb1 Q3 2022 - L', 'BR2 aa1 Q4 2022 - L', 'BR2 dd1 Q4 2022 - L'] } df_A = pd.DataFrame(data_A) print(df_A) print("========================================================================================") # define the Dataset B data_B = { 'Project Name': ['BB H_AA01 Q4 2022', 'BR2 H_DD_nd02 Q4 2022', 'BR2 BB01 Q3.2022', 'BR2 H_DD01 Q2 2022', 'BR2 H_DD02 Q2 2022', 'BR2 H_AA01 Q2 2022'], 'Tags': [2022, 2022, 2022, 2022, 2022, 2022], 'ID': ['BOLOL', 'BR', 'BR', 'BR', 'BR', 'BR'], 'qtr': ['Q4 2022', 'Q4 2022', 'Q3 2022', 'Q2 2022', 'Q2 2022', 'Q2 2022'], 'TYPE': ['aa', 'dd', 'bb', 'dd', 'dd', 'aa'], 'msc': ['d', 'nd', 'd', 'd', 'd', 'd'], 'NUM': [1, 2, 1, 1, 2, 1] } df_B = pd.DataFrame(data_B) df_B = df_B[['Tags', 'ID', 'qtr', 'TYPE', 'msc', 'Project Name']] print(df_B) print("========================================================================================") # Step 1: Merge datasets A and B based on matching columns merged_df = pd.merge(df_A[['Year', 'ID', 'delivery', 'type', 'vendor', 'project_id', 'Gen']], df_B[['Tags', 'ID', 'qtr', 'TYPE', 'msc', 'Project Name']], left_on=['Year', 'ID', 'delivery', 'type', 'vendor'], right_on=['Tags', 'ID', 'qtr', 'TYPE', 'msc'], how='left') print(merged_df) print("========================================================================================") # define the columns list that will be used to check duplicate rows. duplicate_subset=['Year', 'ID', 'delivery', 'type', 'vendor', 'Tags', 'ID', 'qtr', 'TYPE', 'msc', 'Project Name'] # drop duplicate rows based on the above columns. remove_duplicate_df = merged_df.drop_duplicates(subset=duplicate_subset).reset_index(drop=True) print("remove_duplicate_df: ") print(remove_duplicate_df) print("========================================================================================") # if the Project Name is none, then use the project_id to replace it. remove_duplicate_df['project_name'] = remove_duplicate_df['Project Name'].fillna(remove_duplicate_df['project_id']) print("remove_duplicate_df replace none value Project Name: ") print(remove_duplicate_df) print("========================================================================================") # get the final data set with the related columns. final_df = remove_duplicate_df[['Year', 'ID', 'delivery', 'type', 'vendor', 'project_name', 'Gen']] print("final_df: ") print(final_df)
- When you run the above source code, you will get the below output.
Year ID delivery Gen type vendor project_id 0 2022 BR Q2 2022 L aa d BR2 aa1 Q2 2022 - L 1 2022 BR Q2 2022 L dd d BR2 dd1 Q2 2022 - L 2 2022 BR Q2 2022 L dd d BR2 dd2 Q2 2022 - L 3 2022 BR Q3 2022 L bb d BR2 bb1 Q3 2022 - L 4 2022 BR Q4 2022 L aa d BR2 aa1 Q4 2022 - L 5 2022 BR Q4 2022 L dd nd BR2 dd1 Q4 2022 - L ======================================================================================== Tags ID qtr TYPE msc Project Name 0 2022 BOLOL Q4 2022 aa d BB H_AA01 Q4 2022 1 2022 BR Q4 2022 dd nd BR2 H_DD_nd02 Q4 2022 2 2022 BR Q3 2022 bb d BR2 BB01 Q3.2022 3 2022 BR Q2 2022 dd d BR2 H_DD01 Q2 2022 4 2022 BR Q2 2022 dd d BR2 H_DD02 Q2 2022 5 2022 BR Q2 2022 aa d BR2 H_AA01 Q2 2022 ======================================================================================== Year ID delivery type vendor project_id Gen Tags qtr TYPE msc Project Name 0 2022 BR Q2 2022 aa d BR2 aa1 Q2 2022 - L L 2022.0 Q2 2022 aa d BR2 H_AA01 Q2 2022 1 2022 BR Q2 2022 dd d BR2 dd1 Q2 2022 - L L 2022.0 Q2 2022 dd d BR2 H_DD01 Q2 2022 2 2022 BR Q2 2022 dd d BR2 dd1 Q2 2022 - L L 2022.0 Q2 2022 dd d BR2 H_DD02 Q2 2022 3 2022 BR Q2 2022 dd d BR2 dd2 Q2 2022 - L L 2022.0 Q2 2022 dd d BR2 H_DD01 Q2 2022 4 2022 BR Q2 2022 dd d BR2 dd2 Q2 2022 - L L 2022.0 Q2 2022 dd d BR2 H_DD02 Q2 2022 5 2022 BR Q3 2022 bb d BR2 bb1 Q3 2022 - L L 2022.0 Q3 2022 bb d BR2 BB01 Q3.2022 6 2022 BR Q4 2022 aa d BR2 aa1 Q4 2022 - L L NaN NaN NaN NaN NaN 7 2022 BR Q4 2022 dd nd BR2 dd1 Q4 2022 - L L 2022.0 Q4 2022 dd nd BR2 H_DD_nd02 Q4 2022 ======================================================================================== remove_duplicate_df: Year ID delivery type vendor project_id Gen Tags qtr TYPE msc Project Name 0 2022 BR Q2 2022 aa d BR2 aa1 Q2 2022 - L L 2022.0 Q2 2022 aa d BR2 H_AA01 Q2 2022 1 2022 BR Q2 2022 dd d BR2 dd1 Q2 2022 - L L 2022.0 Q2 2022 dd d BR2 H_DD01 Q2 2022 2 2022 BR Q2 2022 dd d BR2 dd1 Q2 2022 - L L 2022.0 Q2 2022 dd d BR2 H_DD02 Q2 2022 3 2022 BR Q3 2022 bb d BR2 bb1 Q3 2022 - L L 2022.0 Q3 2022 bb d BR2 BB01 Q3.2022 4 2022 BR Q4 2022 aa d BR2 aa1 Q4 2022 - L L NaN NaN NaN NaN NaN 5 2022 BR Q4 2022 dd nd BR2 dd1 Q4 2022 - L L 2022.0 Q4 2022 dd nd BR2 H_DD_nd02 Q4 2022 ======================================================================================== remove_duplicate_df replace none value Project Name: Year ID delivery type vendor project_id Gen Tags qtr TYPE msc Project Name project_name 0 2022 BR Q2 2022 aa d BR2 aa1 Q2 2022 - L L 2022.0 Q2 2022 aa d BR2 H_AA01 Q2 2022 BR2 H_AA01 Q2 2022 1 2022 BR Q2 2022 dd d BR2 dd1 Q2 2022 - L L 2022.0 Q2 2022 dd d BR2 H_DD01 Q2 2022 BR2 H_DD01 Q2 2022 2 2022 BR Q2 2022 dd d BR2 dd1 Q2 2022 - L L 2022.0 Q2 2022 dd d BR2 H_DD02 Q2 2022 BR2 H_DD02 Q2 2022 3 2022 BR Q3 2022 bb d BR2 bb1 Q3 2022 - L L 2022.0 Q3 2022 bb d BR2 BB01 Q3.2022 BR2 BB01 Q3.2022 4 2022 BR Q4 2022 aa d BR2 aa1 Q4 2022 - L L NaN NaN NaN NaN NaN BR2 aa1 Q4 2022 - L 5 2022 BR Q4 2022 dd nd BR2 dd1 Q4 2022 - L L 2022.0 Q4 2022 dd nd BR2 H_DD_nd02 Q4 2022 BR2 H_DD_nd02 Q4 2022 ======================================================================================== final_df: Year ID delivery type vendor project_name Gen 0 2022 BR Q2 2022 aa d BR2 H_AA01 Q2 2022 L 1 2022 BR Q2 2022 dd d BR2 H_DD01 Q2 2022 L 2 2022 BR Q2 2022 dd d BR2 H_DD02 Q2 2022 L 3 2022 BR Q3 2022 bb d BR2 BB01 Q3.2022 L 4 2022 BR Q4 2022 aa d BR2 aa1 Q4 2022 - L L 5 2022 BR Q4 2022 dd nd BR2 H_DD_nd02 Q4 2022 L
4. Conclusion.
- This article provided a step-by-step guide on how to transform and replace values in two Pandas DataFrame based on multiple conditions using a real-world example.
- By leveraging the `merge` function and handling missing values, you can efficiently update specific columns in a DataFrame.