How to Extract Specific Rows from Multiple CSV Files Using Pandas

In data analysis, extracting specific rows from a multitude of CSV files is a common task. Using Python and the powerful Pandas library, you can efficiently accomplish this task. Suppose you have a directory containing 500 CSV files, and you need to extract rows based on a certain criterion from a subset of files with a particular naming convention, such as files ending with ‘_add.csv‘. Here’s a step-by-step guide on how to achieve this:

1. Setting Up the Environment.

  1. Before we dive into the code, make sure you have Python and Pandas installed on your system.
  2. You can run the command pip show pandas to confirm whether it has been installed or not.
  3. The below output shows that pandas have been installed on your system.
    $ pip show pandas
    Name: pandas
    Version: 1.5.3
    Summary: Powerful data structures for data analysis, time series, and statistics
    Home-page: https://pandas.pydata.org
    Author: The Pandas Development Team
    Author-email: [email protected]
    License: BSD-3-Clause
    Location: /Users/songzhao/anaconda3/lib/python3.11/site-packages
    Requires: numpy, numpy, python-dateutil, pytz
    Required-by: bokeh, datashader, holoviews, hvplot, panel, seaborn, statsmodels, xarray
  4. If you can not find pandas on your system, you can install Pandas using the pip command like below.
    pip install pandas
  5. Next, import the required libraries in your Python script.
    import os
    import pandas as pd

2. Example Dataset.

  1. To illustrate the process, let’s consider a set of CSV files containing information about sales transactions.
  2. We want to extract rows where the value in the 10th column matches a specific criterion.
  3. Example CSV File 1: sales_data_1_add.csv.
    Date       ,Product,Price,Quantity,Total ,CustomerID,Category,Region,Discount,PaymentMethod
    2024-01-01,Apple  ,1.99 ,5       ,9.95  ,1001      ,Fruit   ,North  ,0.05    ,Credit Card   
    2024-01-02,Banana ,0.99 ,10      ,9.90  ,1002      ,Fruit   ,South  ,0.00    ,PayPal        
    2024-01-03,Orange ,2.49 ,3       ,7.47  ,1003      ,Fruit   ,West   ,0.10    ,Cash          
    2024-01-04,Apple  ,1.99 ,7       ,13.93 ,1004      ,Fruit   ,East   ,0.15    ,Debit Card    
    2024-01-05,Banana ,0.99 ,8       ,7.92  ,1005      ,Fruit   ,North  ,0.05    ,Cash          
    2024-01-06,Orange ,2.49 ,6       ,14.94 ,1006      ,Fruit   ,South  ,0.00    ,PayPal        
    2024-01-07,Apple  ,1.99 ,4       ,7.96  ,1007      ,Fruit   ,West   ,0.10    ,Credit Card   
    2024-01-08,Banana ,0.99 ,9       ,8.91  ,1008      ,Fruit   ,East   ,0.15    ,PayPal        
    2024-01-09,Orange ,2.49 ,2       ,4.98  ,1009      ,Fruit   ,North  ,0.05    ,Debit Card    
    2024-01-10,Apple  ,1.99 ,6       ,11.94 ,1010      ,Fruit   ,South  ,0.00    ,Cash          
    2024-01-11,Banana ,0.99 ,7       ,6.93  ,1011      ,Fruit   ,West   ,0.10    ,PayPal        
    2024-01-12,Orange ,2.49 ,5       ,12.45 ,1012      ,Fruit   ,East   ,0.15    ,Credit Card   
    2024-01-13,Apple  ,1.99 ,3       ,5.97  ,1013      ,Fruit   ,North  ,0.05    ,PayPal        
    2024-01-14,Banana ,0.99 ,6       ,5.94  ,1014      ,Fruit   ,South  ,0.00    ,Debit Card    
    2024-01-15,Orange ,2.49 ,4       ,9.96  ,1015      ,Fruit   ,West   ,0.10    ,Cash          
    2024-01-16,Apple  ,1.99 ,8       ,15.92 ,1016      ,Fruit   ,East   ,0.15    ,Credit Card   
    
  4. Example CSV File 2: sales_data_2_add.csv.
    Date       ,Product,Price,Quantity,Total ,CustomerID,Category,Region,Discount,PaymentMethod
    2024-01-17,Apple  ,1.99 ,5       ,9.95  ,1017      ,Fruit   ,North  ,0.05    ,Cash          
    2024-01-18,Banana ,0.99 ,10      ,9.90  ,1018      ,Fruit   ,South  ,0.00    ,Debit Card    
    2024-01-19,Orange ,2.49 ,3       ,7.47  ,1019      ,Fruit   ,West   ,0.10    ,Credit Card   
    2024-01-20,Apple  ,1.99 ,7       ,13.93 ,1020      ,Fruit   ,East   ,0.15    ,PayPal        
    2024-01-21,Banana ,0.99 ,8       ,7.92  ,1021      ,Fruit   ,North  ,0.05    ,Cash          
    2024-01-22,Orange ,2.49 ,6       ,14.94 ,1022      ,Fruit   ,South  ,0.00    ,Debit Card    
    2024-01-23,Apple  ,1.99 ,4       ,7.96  ,1023      ,Fruit   ,West   ,0.10    ,Credit Card   
    2024-01-24,Banana ,0.99 ,9       ,8.91  ,1024      ,Fruit   ,East   ,0.15    ,PayPal        
    2024-01-25,Orange ,2.49 ,2       ,4.98  ,1025      ,Fruit   ,North  ,0.05    ,Cash          
    2024-01-26,Apple  ,1.99 ,6       ,11.94 ,1026      ,Fruit   ,South  ,0.00    ,Debit Card    
    2024-01-27,Banana ,0.99 ,7       ,6.93  ,1027      ,Fruit   ,West   ,0.10    ,PayPal        
    2024-01-28,Orange ,2.49 ,5       ,12.45 ,1028      ,Fruit   ,East   ,0.15    ,Credit Card   
    2024-01-29,Apple  ,1.99 ,3       ,5.97  ,1029      ,Fruit   ,North  ,0.05    ,Cash          
    2024-01-30,Banana ,0.99 ,6       ,5.94  ,1030      ,Fruit   ,South  ,0.00    ,Debit Card    
    2024-01-31,Orange ,2.49 ,4       ,9.96  ,1031      ,Fruit   ,West   ,0.10    ,Credit Card   
    2024-02-01,Apple  ,1.99 ,8       ,15.92 ,1032      ,Fruit   ,East   ,0.15    ,PayPal        
    

3. Extracting Rows Based on Criteria.

  1. Now, let’s proceed with the Python code to extract rows based on a specific criterion.
  2. In this example, we’ll search for rows where the value in the 10th column (Condition) is “PayPal” from a subset of CSV files ending with ‘_add.csv‘.

3.1 Define File Paths and Parameters.

  1. Set up the file paths for your source directory containing the CSV files and the destination directory where you want to save the extracted rows.
  2. Define any parameters such as the column number based on which you want to extract rows (in this case, column number 10).
    source_directory = 'path/to/source/directory'
    destination_directory = 'path/to/destination/directory'
    target_files_pattern = '_add.csv'
    column_to_check = 10
    desired_value = "PayPal".strip().lower()
    

3.2 Iterate Through Files and Extract Rows.

  1. Iterate through each file in the source directory. Check if the file matches the specified pattern.
  2. If it does, load the CSV file into a Pandas DataFrame and filter rows based on the criterion (value in column number 10). Append the filtered rows to a list.
    extracted_rows = []
    
        for file in os.listdir(source_directory):
            if file.endswith(target_files_pattern):
                file_path = os.path.join(source_directory, file)
                df = pd.read_csv(file_path)
                #print(df)
                rows = len(df.index) 
                #print(rows)
    
                for i in range(rows):
                    column_value = df.iat[i, column_to_check - 1]
                    if column_value.strip().lower() == desired_value:
                        extracted_rows.append(df.iloc[i])
    

3.3 Concatenate and Save Extracted Rows.

  1. Concatenate all the extracted rows into a single DataFrame and save it to a new CSV file in the destination directory.
    result_df = pd.DataFrame(extracted_rows)
    
    print(result_df)
    
    result_df.to_csv(os.path.join(destination_directory, 'extracted_rows.csv'), index=False)
    

4. Full Example Source Code.

  1. First, save the example dataset in section 2 to two CSV files under the ./resource-files folder.
  2. Then create a Python file with the name how-to-extract-specific-rows-from-multiple-csv-files-using-pandas.py.
  3. Copy and paste the below Python source code to the Python file.
    import os, pandas as pd
    
    def load_rows_from_multiple_csv():
        # Define source and destination directories
        source_directory = './resource-files'
        destination_directory = './resource-files'
        
        # Define pattern for target CSV files and column number to check
        target_files_pattern = '_add.csv'
        column_to_check = 10
        
        # Define desired value after stripping whitespace and converting to lowercase
        desired_value = "PayPal".strip().lower()
    
        # Initialize list to store extracted rows
        extracted_rows = []
    
        # Iterate through files in the source directory
        for file in os.listdir(source_directory):
            # Check if the file matches the target pattern
            if file.endswith(target_files_pattern):
                file_path = os.path.join(source_directory, file)
                df = pd.read_csv(file_path)
                
                # Get the number of rows in the DataFrame
                rows = len(df.index) 
    
                # Iterate through each row
                for i in range(rows):
                    # Extract the value in the specified column
                    column_value = df.iat[i, column_to_check - 1]
                    
                    # Check if the value matches the desired value after stripping whitespace and converting to lowercase
                    if column_value.strip().lower() == desired_value:
                        # If match, append the entire row to the list of extracted rows
                        extracted_rows.append(df.iloc[i])
    
        # Create a DataFrame from the list of extracted rows
        result_df = pd.DataFrame(extracted_rows)
    
        # Print the resulting DataFrame
        print(result_df)
    
        # Write the resulting DataFrame to a CSV file in the destination directory
        result_df.to_csv(os.path.join(destination_directory, 'extracted_rows.csv'), index=False)
    
    
    if __name__ == "__main__":
        # Call the function to load rows from multiple CSV files
        load_rows_from_multiple_csv()
  4. When you run the above code, it will generate the target file ./resource-files/extracted_rows.csv.
  5. Below is the above-generated file content.
    Date       ,Product,Price,Quantity,Total ,CustomerID,Category,Region,Discount,PaymentMethod
    2024-01-20,Apple  ,1.99,7,13.93,1020,Fruit   ,East   ,0.15,PayPal        
    2024-01-24,Banana ,0.99,9,8.91,1024,Fruit   ,East   ,0.15,PayPal        
    2024-01-27,Banana ,0.99,7,6.93,1027,Fruit   ,West   ,0.1,PayPal        
    2024-02-01,Apple  ,1.99,8,15.92,1032,Fruit   ,East   ,0.15,PayPal        
    2024-01-02,Banana ,0.99,10,9.9,1002,Fruit   ,South  ,0.0,PayPal        
    2024-01-06,Orange ,2.49,6,14.94,1006,Fruit   ,South  ,0.0,PayPal        
    2024-01-08,Banana ,0.99,9,8.91,1008,Fruit   ,East   ,0.15,PayPal        
    2024-01-11,Banana ,0.99,7,6.93,1011,Fruit   ,West   ,0.1,PayPal        
    2024-01-13,Apple  ,1.99,3,5.97,1013,Fruit   ,North  ,0.05,PayPal        
    

5. Conclusion.

  1. By utilizing the power of Pandas, we can efficiently extract specific rows from a subset of CSV files based on custom criteria.
  2. This method proves invaluable when dealing with large datasets spread across multiple files, enabling streamlined data extraction and analysis.
  3. Now you can adapt this code to your specific requirements, such as modifying the criteria or handling different file structures, to effectively extract the desired data from your CSV files.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.