How to Extract Specific Rows from Multiple CSV Files Using Pandas

In data analysis, extracting specific rows from a multitude of CSV files is a common task. Using Python and the powerful Pandas library, you can efficiently accomplish this task. Suppose you have a directory containing 500 CSV files, and you need to extract rows based on a certain criterion from a subset of files with a particular naming convention, such as files ending with ‘_add.csv‘. Here’s a step-by-step guide on how to achieve this:

Table of Contents

1. Setting Up the Environment.

Before we dive into the code, make sure you have Python and Pandas installed on your system.

You can run the command pip show pandas to confirm whether it has been installed or not.

The below output shows that pandas have been installed on your system.

$ pip show pandas
Name: pandas
Version: 1.5.3
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /Users/songzhao/anaconda3/lib/python3.11/site-packages
Requires: numpy, numpy, python-dateutil, pytz
Required-by: bokeh, datashader, holoviews, hvplot, panel, seaborn, statsmodels, xarray

If you can not find pandas on your system, you can install Pandas using the pip command like below.
```
pip install pandas
```

Next, import the required libraries in your Python script.
```
import os
import pandas as pd
```

2. Example Dataset.

To illustrate the process, let’s consider a set of CSV files containing information about sales transactions.
We want to extract rows where the value in the 10th column matches a specific criterion.

Example CSV File 1: sales_data_1_add.csv.

Date       ,Product,Price,Quantity,Total ,CustomerID,Category,Region,Discount,PaymentMethod
2024-01-01,Apple  ,1.99 ,5       ,9.95  ,1001      ,Fruit   ,North  ,0.05    ,Credit Card   
2024-01-02,Banana ,0.99 ,10      ,9.90  ,1002      ,Fruit   ,South  ,0.00    ,PayPal        
2024-01-03,Orange ,2.49 ,3       ,7.47  ,1003      ,Fruit   ,West   ,0.10    ,Cash          
2024-01-04,Apple  ,1.99 ,7       ,13.93 ,1004      ,Fruit   ,East   ,0.15    ,Debit Card    
2024-01-05,Banana ,0.99 ,8       ,7.92  ,1005      ,Fruit   ,North  ,0.05    ,Cash          
2024-01-06,Orange ,2.49 ,6       ,14.94 ,1006      ,Fruit   ,South  ,0.00    ,PayPal        
2024-01-07,Apple  ,1.99 ,4       ,7.96  ,1007      ,Fruit   ,West   ,0.10    ,Credit Card   
2024-01-08,Banana ,0.99 ,9       ,8.91  ,1008      ,Fruit   ,East   ,0.15    ,PayPal        
2024-01-09,Orange ,2.49 ,2       ,4.98  ,1009      ,Fruit   ,North  ,0.05    ,Debit Card    
2024-01-10,Apple  ,1.99 ,6       ,11.94 ,1010      ,Fruit   ,South  ,0.00    ,Cash          
2024-01-11,Banana ,0.99 ,7       ,6.93  ,1011      ,Fruit   ,West   ,0.10    ,PayPal        
2024-01-12,Orange ,2.49 ,5       ,12.45 ,1012      ,Fruit   ,East   ,0.15    ,Credit Card   
2024-01-13,Apple  ,1.99 ,3       ,5.97  ,1013      ,Fruit   ,North  ,0.05    ,PayPal        
2024-01-14,Banana ,0.99 ,6       ,5.94  ,1014      ,Fruit   ,South  ,0.00    ,Debit Card    
2024-01-15,Orange ,2.49 ,4       ,9.96  ,1015      ,Fruit   ,West   ,0.10    ,Cash          
2024-01-16,Apple  ,1.99 ,8       ,15.92 ,1016      ,Fruit   ,East   ,0.15    ,Credit Card

Example CSV File 2: sales_data_2_add.csv.

Date       ,Product,Price,Quantity,Total ,CustomerID,Category,Region,Discount,PaymentMethod
2024-01-17,Apple  ,1.99 ,5       ,9.95  ,1017      ,Fruit   ,North  ,0.05    ,Cash          
2024-01-18,Banana ,0.99 ,10      ,9.90  ,1018      ,Fruit   ,South  ,0.00    ,Debit Card    
2024-01-19,Orange ,2.49 ,3       ,7.47  ,1019      ,Fruit   ,West   ,0.10    ,Credit Card   
2024-01-20,Apple  ,1.99 ,7       ,13.93 ,1020      ,Fruit   ,East   ,0.15    ,PayPal        
2024-01-21,Banana ,0.99 ,8       ,7.92  ,1021      ,Fruit   ,North  ,0.05    ,Cash          
2024-01-22,Orange ,2.49 ,6       ,14.94 ,1022      ,Fruit   ,South  ,0.00    ,Debit Card    
2024-01-23,Apple  ,1.99 ,4       ,7.96  ,1023      ,Fruit   ,West   ,0.10    ,Credit Card   
2024-01-24,Banana ,0.99 ,9       ,8.91  ,1024      ,Fruit   ,East   ,0.15    ,PayPal        
2024-01-25,Orange ,2.49 ,2       ,4.98  ,1025      ,Fruit   ,North  ,0.05    ,Cash          
2024-01-26,Apple  ,1.99 ,6       ,11.94 ,1026      ,Fruit   ,South  ,0.00    ,Debit Card    
2024-01-27,Banana ,0.99 ,7       ,6.93  ,1027      ,Fruit   ,West   ,0.10    ,PayPal        
2024-01-28,Orange ,2.49 ,5       ,12.45 ,1028      ,Fruit   ,East   ,0.15    ,Credit Card   
2024-01-29,Apple  ,1.99 ,3       ,5.97  ,1029      ,Fruit   ,North  ,0.05    ,Cash          
2024-01-30,Banana ,0.99 ,6       ,5.94  ,1030      ,Fruit   ,South  ,0.00    ,Debit Card    
2024-01-31,Orange ,2.49 ,4       ,9.96  ,1031      ,Fruit   ,West   ,0.10    ,Credit Card   
2024-02-01,Apple  ,1.99 ,8       ,15.92 ,1032      ,Fruit   ,East   ,0.15    ,PayPal

3. Extracting Rows Based on Criteria.

Now, let’s proceed with the Python code to extract rows based on a specific criterion.

In this example, we’ll search for rows where the value in the 10th column (Condition) is “PayPal” from a subset of CSV files ending with ‘_add.csv‘.

3.1 Define File Paths and Parameters.

Set up the file paths for your source directory containing the CSV files and the destination directory where you want to save the extracted rows.

Define any parameters such as the column number based on which you want to extract rows (in this case, column number 10).

source_directory = 'path/to/source/directory'
destination_directory = 'path/to/destination/directory'
target_files_pattern = '_add.csv'
column_to_check = 10
desired_value = "PayPal".strip().lower()

3.2 Iterate Through Files and Extract Rows.

Iterate through each file in the source directory. Check if the file matches the specified pattern.

If it does, load the CSV file into a Pandas DataFrame and filter rows based on the criterion (value in column number 10). Append the filtered rows to a list.

extracted_rows = []

    for file in os.listdir(source_directory):
        if file.endswith(target_files_pattern):
            file_path = os.path.join(source_directory, file)
            df = pd.read_csv(file_path)
            #print(df)
            rows = len(df.index) 
            #print(rows)

            for i in range(rows):
                column_value = df.iat[i, column_to_check - 1]
                if column_value.strip().lower() == desired_value:
                    extracted_rows.append(df.iloc[i])

3.3 Concatenate and Save Extracted Rows.

Concatenate all the extracted rows into a single DataFrame and save it to a new CSV file in the destination directory.

result_df = pd.DataFrame(extracted_rows)

print(result_df)

result_df.to_csv(os.path.join(destination_directory, 'extracted_rows.csv'), index=False)

4. Full Example Source Code.

First, save the example dataset in section 2 to two CSV files under the ./resource-files folder.
Then create a Python file with the name how-to-extract-specific-rows-from-multiple-csv-files-using-pandas.py.

Copy and paste the below Python source code to the Python file.

import os, pandas as pd

def load_rows_from_multiple_csv():
    # Define source and destination directories
    source_directory = './resource-files'
    destination_directory = './resource-files'
    
    # Define pattern for target CSV files and column number to check
    target_files_pattern = '_add.csv'
    column_to_check = 10
    
    # Define desired value after stripping whitespace and converting to lowercase
    desired_value = "PayPal".strip().lower()

    # Initialize list to store extracted rows
    extracted_rows = []

    # Iterate through files in the source directory
    for file in os.listdir(source_directory):
        # Check if the file matches the target pattern
        if file.endswith(target_files_pattern):
            file_path = os.path.join(source_directory, file)
            df = pd.read_csv(file_path)
            
            # Get the number of rows in the DataFrame
            rows = len(df.index) 

            # Iterate through each row
            for i in range(rows):
                # Extract the value in the specified column
                column_value = df.iat[i, column_to_check - 1]
                
                # Check if the value matches the desired value after stripping whitespace and converting to lowercase
                if column_value.strip().lower() == desired_value:
                    # If match, append the entire row to the list of extracted rows
                    extracted_rows.append(df.iloc[i])

    # Create a DataFrame from the list of extracted rows
    result_df = pd.DataFrame(extracted_rows)

    # Print the resulting DataFrame
    print(result_df)

    # Write the resulting DataFrame to a CSV file in the destination directory
    result_df.to_csv(os.path.join(destination_directory, 'extracted_rows.csv'), index=False)


if __name__ == "__main__":
    # Call the function to load rows from multiple CSV files
    load_rows_from_multiple_csv()

When you run the above code, it will generate the target file ./resource-files/extracted_rows.csv.

Below is the above-generated file content.

Date       ,Product,Price,Quantity,Total ,CustomerID,Category,Region,Discount,PaymentMethod
2024-01-20,Apple  ,1.99,7,13.93,1020,Fruit   ,East   ,0.15,PayPal        
2024-01-24,Banana ,0.99,9,8.91,1024,Fruit   ,East   ,0.15,PayPal        
2024-01-27,Banana ,0.99,7,6.93,1027,Fruit   ,West   ,0.1,PayPal        
2024-02-01,Apple  ,1.99,8,15.92,1032,Fruit   ,East   ,0.15,PayPal        
2024-01-02,Banana ,0.99,10,9.9,1002,Fruit   ,South  ,0.0,PayPal        
2024-01-06,Orange ,2.49,6,14.94,1006,Fruit   ,South  ,0.0,PayPal        
2024-01-08,Banana ,0.99,9,8.91,1008,Fruit   ,East   ,0.15,PayPal        
2024-01-11,Banana ,0.99,7,6.93,1011,Fruit   ,West   ,0.1,PayPal        
2024-01-13,Apple  ,1.99,3,5.97,1013,Fruit   ,North  ,0.05,PayPal

5. Conclusion.

By utilizing the power of Pandas, we can efficiently extract specific rows from a subset of CSV files based on custom criteria.

This method proves invaluable when dealing with large datasets spread across multiple files, enabling streamlined data extraction and analysis.
Now you can adapt this code to your specific requirements, such as modifying the criteria or handling different file structures, to effectively extract the desired data from your CSV files.