How To Read Excel File With Python Pandas

This article will show you how to use the python pandas module to read Microsoft Excel file’s one worksheet, multiple worksheets, all worksheets, and specified worksheet columns data.

1. How To Use Python Pandas Module To Read Microsoft Excel Files.

  1. Open a terminal and run the below command to make sure you have installed the python pandas library in your python environment. You can refer to the article How To Install Python Package Numpy, Pandas, Scipy, Matplotlib On Windows, Mac, And Linux to learn more.
    $ pip show pandas
    Name: pandas
    Version: 1.2.3
    Summary: Powerful data structures for data analysis, time series, and statistics
    Home-page: https://pandas.pydata.org
    Author: None
    Author-email: None
    License: BSD
    Location: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages
    Requires: numpy, python-dateutil, pytz
    Required-by: xarray, vaex-core, streamlit, statsmodels, plotly-express, bqplot, altair
  2. Import the python pandas library in your python source code.
    import pandas as pd
  3. Invoke the python pandas module’s read_excel() function to read an excel file worksheet, the first parameter is the excel file path ( you can add r at the beginning of the file path string to avoid character escape issue, for example, the string r’C:\abc\read.xlsx’ will not treat \r as return character. ), the second parameter is the worksheet name. It returns a pandas.core.frame.DataFrame object.
    df = pd.read_excel(excel_file_path, worksheet_name)
  4. If you want to read multiple worksheets in one excel file, you can pass a worksheet name list to the second parameter of the read_excel() function. In this case, the function returns a python dictionary object, you can iterate the python dictionary object to get each worksheet data.
    df = pd.read_excel(excel_file_path, ['worksheet_1', 'worksheet_2'])
  5. When you pass None to the second parameter of the pandas read_excel function, it will return all the worksheets data in a python dictionary object.
    df = pd.read_excel(excel_file_path, None)
  6. If you just want to get the specified excel columns data, you can use the returned pandas.core.frame.DataFrame object’s DataFrame() function.
    # Get the pandas.core.frame.DataFrame object
    df = pd.read_excel(excel_file_path, worksheet_name)
    
    # Read the special columns in the worksheet.
    pd.DataFrame(df_one_sheet, columns= ['Name', 'Salary'])
  7. Resolve column name misalignment. The default output result of pandas will cause the problems of column misalignment or incomplete display of multiple rows and columns. These two problems can be solved by using the set_option function.

    pd.set_option('display.unicode.east_asian_width', True)
  8. Solve the problem of incomplete row and column display. By setting the display.max_rows and display.max_columns to the maximum number of rows and columns.
    pd.set_option('display.max.rows', 1000)
    pd.set_option('display.max.columns', 1000)

2. Read Excel File With Pandas Examples.

2.1 The Example Excel WorkSheet Data Introduction.

  1. In this example, there are 2 worksheets ( sheet_1, sheet_2 ) in the excel file.
  2. Below is the data in Excel sheet_1.
          Name   Hire Date  Salary
    0    jerry  2010-01-01   16000
    1      tom  2011-08-19    6000
    2    kevin  2009-02-08   13000
    3  richard  2012-03-19    5000
    4   jackie  2015-06-08   28000
    5   steven  2008-02-01   36000
    6     jack  2006-09-19    8000
    7     gary  2018-01-16   19000
    8     john  2017-10-01   16600
  3. Below is the data in Excel sheet_2.
               Job Title                  Job Description
    0  Software Engineer  Experience in java, python, c++
    1                 QA               WebDriver Selenium

2.2 Read Excel File With Pandas Examples.

  1. The example python file name is ReadExcelFiles.py, it contains 5 example functions, we will introduce them one by one.
2.2.1 read_one_excel_sheet().
  1. This function will read one worksheet data in the Excel file.
    '''
    Created on Oct 16, 2021
    
    @author: songzhao
    '''
    
    # Import the python pandas module.
    import pandas as pd
    
    excel_file_path = "./employee_info.xlsx"
    
    '''
    This function will read the excel file worksheet data with the specified value.
    '''
    def read_work_sheet(excel_file_path, work_sheet_name):
        
        # Align column names by setting display.unicode.east_asian_width to True.
        pd.set_option('display.unicode.east_asian_width', True)
        
        # Solve the problem of incomplete row and column display
        pd.set_option('display.max.rows', 1000) 
        pd.set_option('display.max.columns', 1000)
        
        # Invoke the pandas module read_excel method to load the specified excel file worksheet data.
        df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name)
        
        return df
    
    # This function print out the specified python object's data type.
    def print_object_type(prefix, object):
        
        object_type = type(object)
        
        print(prefix + str(object_type))
    
    
    # This function will read one worksheet in an excel file.
    def read_one_excel_sheet():
        
        print("****** read_one_excel_sheet() ******\r\n")
        
        # Define the Excel work sheet name.
        work_sheet_name = "sheet_1"
        
        # Get the Excel file work sheet object.
        df_one_sheet = read_work_sheet(excel_file_path, work_sheet_name)
        
        # Print out the work sheet data type.
        print_object_type("work_sheet_type: ", df_one_sheet)
    
        # Print out the loaded excel file work sheet data.
        print(df_one_sheet)
    
        print("=============================")
        
        # Print out the first 3 rows in the Excel sheet, if not specified then the default print out row count number is 5.
        print(df_one_sheet.head(3))
        
    
    if __name__ == '__main__':
        
        read_one_excel_sheet()
    
  2. Below is the above example execution output.
    ****** read_one_excel_sheet() ******
    
    work_sheet_type: <class 'pandas.core.frame.DataFrame'>
          Name   Hire Date  Salary
    0    jerry  2010-01-01   16000
    1      tom  2011-08-19    6000
    2    kevin  2009-02-08   13000
    3  richard  2012-03-19    5000
    4   jackie  2015-06-08   28000
    5   steven  2008-02-01   36000
    6     jack  2006-09-19    8000
    7     gary  2018-01-16   19000
    8     john  2017-10-01   16600
    =============================
        Name   Hire Date  Salary
    0  jerry  2010-01-01   16000
    1    tom  2011-08-19    6000
    2  kevin  2009-02-08   13000
2.2.2 read_multiple_excel_sheet().
  1. This function will read multiple Excel worksheets data.
    '''
    Created on Oct 16, 2021
    
    @author: songzhao
    '''
    
    # Import the python pandas module.
    import pandas as pd
    
    excel_file_path = "./employee_info.xlsx"
    
    '''
    This function will read the excel file worksheet data with the specified value.
    '''
    def read_work_sheet(excel_file_path, work_sheet_name):
        
        # Align column names by setting display.unicode.east_asian_width to True.
        pd.set_option('display.unicode.east_asian_width', True)
        
        # Solve the problem of incomplete row and column display
        pd.set_option('display.max.rows', 1000) 
        pd.set_option('display.max.columns', 1000)
        
        # Invoke the pandas module read_excel method to load the specified excel file worksheet data.
        df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name)
        
        return df
    
    # This function print out the specified python object's data type.
    def print_object_type(prefix, object):
        
        object_type = type(object)
        
        print(prefix + str(object_type))
       
    
    # This function will read multiple worksheets in one excel file.
    def read_multiple_excel_sheet():
        
        print("\r\n****** read_multiple_excel_sheet() ******\r\n")
         
        work_sheet_name_array = ["sheet_1", "sheet_2"]
        
        df_multiple_sheet = read_work_sheet(excel_file_path, work_sheet_name_array)
        
        df_sheet_1 = df_multiple_sheet['sheet_1']
        
        # Print out the work sheet data type.
        print_object_type("work_sheet_1_type: ", df_sheet_1)
        
        print("\n")
        
        print(df_sheet_1)
        
        print("\n\n")
        
        
        df_sheet_2 = df_multiple_sheet['sheet_2']
        
        # Print out the work sheet data type.
        print_object_type("work_sheet_2_type: ", df_sheet_2)    
        
        print("\n")
          
        print(df_sheet_2)
        
        
    if __name__ == '__main__':
        
        read_multiple_excel_sheet()
  2. Below is the above example source code execution result.
    ****** read_multiple_excel_sheet() ******
    
    work_sheet_1_type: <class 'pandas.core.frame.DataFrame'>
    
    
          Name   Hire Date  Salary
    0    jerry  2010-01-01   16000
    1      tom  2011-08-19    6000
    2    kevin  2009-02-08   13000
    3  richard  2012-03-19    5000
    4   jackie  2015-06-08   28000
    5   steven  2008-02-01   36000
    6     jack  2006-09-19    8000
    7     gary  2018-01-16   19000
    8     john  2017-10-01   16600
    
    
    
    work_sheet_2_type: <class 'pandas.core.frame.DataFrame'>
    
    
               Job Title                  Job Description
    0  Software Engineer  Experience in java, python, c++
    1                 QA               WebDriver Selenium
2.2.3 read_all_excel_sheet().
  1. This function will read all the Excel worksheets data.
    '''
    Created on Oct 16, 2021
    
    @author: songzhao
    '''
    
    # Import the python pandas module.
    import pandas as pd
    
    excel_file_path = "./employee_info.xlsx"
    
    '''
    This function will read the excel file worksheet data with the specified value.
    '''
    def read_work_sheet(excel_file_path, work_sheet_name):
        
        # Align column names by setting display.unicode.east_asian_width to True.
        pd.set_option('display.unicode.east_asian_width', True)
        
        # Solve the problem of incomplete row and column display
        pd.set_option('display.max.rows', 1000) 
        pd.set_option('display.max.columns', 1000)
        
        # Invoke the pandas module read_excel method to load the specified excel file worksheet data.
        df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name)
        
        return df
    
    # This function print out the specified python object's data type.
    def print_object_type(prefix, object):
        
        object_type = type(object)
        
        print(prefix + str(object_type))
        
        
    # This function will read all worksheets in an excel file.    
    def read_all_excel_sheet():
        
        print("****** read_all_excel_sheet() ******\r\n")
        
        # return a python dictionary object.
        df_all_sheet = read_work_sheet(excel_file_path, None)
        
        # loop in the python dictionary object.
        for key in df_all_sheet:
            
            # print out the key and value. key is the worksheet name, value is the workhsheet data.
            print(key)
            
            print(df_all_sheet[key])
            
            print("\n\n")
               
    
    if __name__ == '__main__':
        
        read_all_excel_sheet()
2.2.4 read_excel_columns_by_filtering_dataframe().
  1. This function will read all the data in one Excel worksheet, and then filter out 2 columns of data from the returned DataFrame object.
    '''
    Created on Oct 16, 2021
    
    @author: songzhao
    '''
    
    # Import the python pandas module.
    import pandas as pd
    
    excel_file_path = "./employee_info.xlsx"
    
    '''
    This function will read the excel file worksheet data with the specified value.
    '''
    def read_work_sheet(excel_file_path, work_sheet_name):
        
        # Align column names by setting display.unicode.east_asian_width to True.
        pd.set_option('display.unicode.east_asian_width', True)
        
        # Solve the problem of incomplete row and column display
        pd.set_option('display.max.rows', 1000) 
        pd.set_option('display.max.columns', 1000)
        
        # Invoke the pandas module read_excel method to load the specified excel file worksheet data.
        df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name)
        
        return df
    
    # This function print out the specified python object's data type.
    def print_object_type(prefix, object):
        
        object_type = type(object)
        
        print(prefix + str(object_type))
    
            
    # This function read the provided columns in the worksheet.        
    def read_excel_columns_by_filtering_dataframe():
        
        print("****** read_excel_columns_by_filtering_dataframe() ******\r\n")
        
        # Define the Excel work sheet name.
        work_sheet_name = "sheet_1"
        
        # Get the whole Excel file work sheet data.
        df_one_sheet = read_work_sheet(excel_file_path, work_sheet_name)
        
        # Return the 2 specified excel columns value in the above Excel worksheet.
        df_columns = pd.DataFrame(df_one_sheet, columns= ['Name', 'Salary'])
        
        # Print the Excel columns value.
        print(df_columns)
        
    
    if __name__ == '__main__':
        
        read_excel_columns_by_filtering_dataframe()
    
  2. Below is the above example execution output.
    ****** read_excel_columns_by_filtering_dataframe() ******
    
          Name  Salary
    0    jerry   16000
    1      tom    6000
    2    kevin   13000
    3  richard    5000
    4   jackie   28000
    5   steven   36000
    6     jack    8000
    7     gary   19000
    8     john   16600
    
2.2.5 read_excel_columns_by_pandas_read_excel_method().
  1. Besides using the python pandas module’s DataFrame class’s columns attributes to filter out the required columns from the whole worksheet data, you can also specify the columns when invoking the pandas.read_excel() method and provide the usecols parameter like below. This way is a more efficient way because it only returns the required columns from the Excel worksheet, not the whole data of the Excel file worksheet.
    df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, usecols=['Name', 'Salary'])
  2. The usecols parameter’s value is a python list object, the list element value can be number index or the column name.
    # Get the specified columns in the Excel worksheet. 
    # usecols=[0, 1] return the first and second column. 
    df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, usecols=[0, 1])
    
    # usecols=['Name', 'Salary'] return the Name, Salary column.
    df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, usecols=['Name', 'Salary'])
  3. The below example source code will read the 2 columns ( ‘Name’, ‘Salary’ ) data only from the Excel worksheet and then print out each column and it’s value.
    '''
    Created on Oct 16, 2021
    
    @author: songzhao
    '''
    
    # Import the python pandas module.
    import pandas as pd
    
    excel_file_path = "./employee_info.xlsx"
        
    def read_excel_columns_by_pandas_read_excel_method():
        
            # Align column names by setting display.unicode.east_asian_width to True.
        pd.set_option('display.unicode.east_asian_width', True)
        
        # Solve the problem of incomplete row and column display
        pd.set_option('display.max.rows', 1000) 
        pd.set_option('display.max.columns', 1000)
        
        # Define the Excel work sheet name.
        work_sheet_name = "sheet_1"
        
        # Invoke the pandas module read_excel method to load the specified excel file worksheet data.
        
        # Get the specified columns in the Excel worksheet. usecols=[0, 1] return the first and second column, usecols=['Name', 'Salary'] return the Name, Salary column. 
        #df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, usecols=[0, 1])
        df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, usecols=['Name', 'Salary'])
        
        print(df)
        
        print_dataframe_object(df)
        
    
    '''
       Loop the passed in DataFrame object's columns and print out each column name and column value. 
    '''
    def print_dataframe_object(df):
        
        print('============ print_dataframe_object ================')
        
        for col in df.columns:
            
            print('column : ', col)
            
            series = df[col]
            
            print('series.index : ', series.index)
            
            for index in series.index:
                
                value = series[index]
                
                print(index, ' : ', value)
            
            print('\n\n')
     
    
    if __name__ == '__main__':
    
        read_excel_columns_by_pandas_read_excel_method()
    
  4. Below is the above example execution output.
          Name  Salary
    0    jerry   16000
    1      tom    6000
    2    kevin   13000
    3  richard    5000
    4   jackie   28000
    5   steven   36000
    6     jack    8000
    7     gary   19000
    8     john   16600
    ============ print_dataframe_object ================
    column :  Name
    series.index :  RangeIndex(start=0, stop=9, step=1)
    0  :  jerry
    1  :  tom
    2  :  kevin
    3  :  richard
    4  :  jackie
    5  :  steven
    6  :  jack
    7  :  gary
    8  :  john
    
    
    column :  Salary
    series.index :  RangeIndex(start=0, stop=9, step=1)
    0  :  16000
    1  :  6000
    2  :  13000
    3  :  5000
    4  :  28000
    5  :  36000
    6  :  8000
    7  :  19000
    8  :  16600
    
2.2.6 read_excel_columns_set_column_index().
  1. When you call the python pandas module’s read_excel() method, you can pass a header input parameter, this parameter value defines which row data is used as the column index.
  2. Below is an example, I pass header=1 to the read_excel() function, so it will use the first row’s data in the Excel worksheet as the column index.
    '''
    Created on Oct 16, 2021
    
    @author: songzhao
    '''
    
    # Import the python pandas module.
    import pandas as pd
    
    excel_file_path = "./employee_info.xlsx" 
        
    def read_excel_columns_set_column_index():
        
        # Align column names by setting display.unicode.east_asian_width to True.
        pd.set_option('display.unicode.east_asian_width', True)
        
        # Solve the problem of incomplete row and column display
        pd.set_option('display.max.rows', 1000) 
        pd.set_option('display.max.columns', 1000)
        
        # Define the Excel work sheet name.
        work_sheet_name = "sheet_1"
        
        # Invoke the pandas module read_excel method to load the specified excel file worksheet data.
        
        # Set the second column as the row index (index_col=1).
        #df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, index_col=0)
        
        # Set the first row as the column index (header=1).You can set header value to None to use number as the column index.
        df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, header=1)
        
        print(df)
        
        print_dataframe_object(df)
        
    
    '''
       Loop the passed in DataFrame object's columns and print out each column name and column value. 
    '''
    def print_dataframe_object(df):
        
        print('============ print_dataframe_object ================')
        
        for col in df.columns:
            
            print('column : ', col)
            
            series = df[col]
            
            print('series.index : ', series.index)
            
            for index in series.index:
                
                value = series[index]
                
                print(index, ' : ', value)
            
            print('\n\n')
     
    
    if __name__ == '__main__':
        
        read_excel_columns_set_column_index()
  3. Below is the above example execution output. The Excel worksheet first row’s value jerry, 2010-01-01, 16000 will be used as the column index.
         jerry  2010-01-01  16000
    0      tom  2011-08-19   6000
    1    kevin  2009-02-08  13000
    2  richard  2012-03-19   5000
    3   jackie  2015-06-08  28000
    4   steven  2008-02-01  36000
    5     jack  2006-09-19   8000
    6     gary  2018-01-16  19000
    7     john  2017-10-01  16600
    ============ print_dataframe_object ================
    column :  jerry
    series.index :  RangeIndex(start=0, stop=8, step=1)
    0  :  tom
    1  :  kevin
    2  :  richard
    3  :  jackie
    4  :  steven
    5  :  jack
    6  :  gary
    7  :  john
    
    column :  2010-01-01
    series.index :  RangeIndex(start=0, stop=8, step=1)
    0  :  2011-08-19
    1  :  2009-02-08
    2  :  2012-03-19
    3  :  2015-06-08
    4  :  2008-02-01
    5  :  2006-09-19
    6  :  2018-01-16
    7  :  2017-10-01
    
    column :  16000
    series.index :  RangeIndex(start=0, stop=8, step=1)
    0  :  6000
    1  :  13000
    2  :  5000
    3  :  28000
    4  :  36000
    5  :  8000
    6  :  19000
    7  :  16600
2.2.7 read_excel_columns_set_row_index().
  1. This function will pass the input parameter index_col=0 to the pandas.read_excel() method.
  2. It will use the first Excel worksheet column ( ‘Name’ ) as each row’s index. Below is the source code.
    '''
    Created on Oct 16, 2021
    
    @author: songzhao
    '''
    
    # Import the python pandas module.
    import pandas as pd
    
    excel_file_path = "./employee_info.xlsx"
        
    def read_excel_columns_set_row_index():
        
        # Align column names by setting display.unicode.east_asian_width to True.
        pd.set_option('display.unicode.east_asian_width', True)
        
        # Solve the problem of incomplete row and column display
        pd.set_option('display.max.rows', 1000) 
        pd.set_option('display.max.columns', 1000)
        
        # Define the Excel work sheet name.
        work_sheet_name = "sheet_1"
        
        # Invoke the pandas module read_excel method to load the specified excel file worksheet data.
        
        # Set the second column as the row index (index_col=1).
        df = pd.read_excel(excel_file_path, sheet_name = work_sheet_name, index_col=0)
        
        print(df)
        
        print_dataframe_object(df)     
        
    
    '''
       Loop the passed in DataFrame object's columns and print out each column name and column value. 
    '''
    def print_dataframe_object(df):
        
        print('============ print_dataframe_object ================')
        
        for col in df.columns:
            
            print('column : ', col)
            
            series = df[col]
            
            print('series.index : ', series.index)
            
            for index in series.index:
                
                value = series[index]
                
                print(index, ' : ', value)
            
            print('\n\n')
     
    
    if __name__ == '__main__':
        
        read_excel_columns_set_row_index()
    
  3. When you run the above example, it will generate the below output. You can see for each column data, the row data index is ‘Name’.
              Hire Date  Salary
    Name                       
    jerry    2010-01-01   16000
    tom      2011-08-19    6000
    kevin    2009-02-08   13000
    richard  2012-03-19    5000
    jackie   2015-06-08   28000
    steven   2008-02-01   36000
    jack     2006-09-19    8000
    gary     2018-01-16   19000
    john     2017-10-01   16600
    ============ print_dataframe_object ================
    column :  Hire Date
    series.index :  Index(['jerry', 'tom', 'kevin', 'richard', 'jackie', 'steven', 'jack', 'gary',
           'john'],
          dtype='object', name='Name')
    jerry  :  2010-01-01
    tom  :  2011-08-19
    kevin  :  2009-02-08
    richard  :  2012-03-19
    jackie  :  2015-06-08
    steven  :  2008-02-01
    jack  :  2006-09-19
    gary  :  2018-01-16
    john  :  2017-10-01
    
    
    
    column :  Salary
    series.index :  Index(['jerry', 'tom', 'kevin', 'richard', 'jackie', 'steven', 'jack', 'gary',
           'john'],
          dtype='object', name='Name')
    jerry  :  16000
    tom  :  6000
    kevin  :  13000
    richard  :  5000
    jackie  :  28000
    steven  :  36000
    jack  :  8000
    gary  :  19000
    john  :  16600

References

  1. Python Pandas Read/Write CSV File And Convert To Excel File Example

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.