Pandas Read_Html Example

This article will tell you how to read Html files through a web page URL in the Python Pandas module. It will also show you an example, in this example, it will retrieve all the NBA player’s salaries from the ESPN website URL list such as http://www.espn.com/nba/salaries/_/page/1.

1. How To Read Html Web Page Data Through Python Pandas read_html() Method.

  1. Import python pandas module.
  2. Invoke the Python pandas module’s read_html(web_page_url) method to get and parse the web page.
  3. The pandas module’s read_html(web_page_url) method will return a list of the pandas DataFrame object.
  4. The web page must contain an Html table tag to construct the data tables, you should use the web browser inspector to inspect the source code to verify ( you can refer to the article How To Find Web Element Using Web Browser Inspector ). And the pandas module’s read_html(web_page_url) method will return all the data in the Html table tag in the DataFrame list.

2. Python Pandas read_html() Method Example.

  1. This example will get all the NBA player’s salaries from ESPN website URL page by page.
  2. For each page, it will check whether there are NBA players’ salaries data on the page or not, if there are no more data, the program will exit.
  3. It will retrieve and parse out the data on a separate page and then append them all to a global python pandas DataFrame object.
  4. When it parses out all the pages data, it will write the DataFrame object to a CSV file.
  5. The python file name is ReadHtmlFiles.py, it contains three functions, they are read_html_table(url), read_nba_salary_web_pages(), and parse_dataframe_list(df_list).
  6. Below is the full source code of the ReadHtmlFiles.py file, you can see each function’s comments for the detailed explanation.
    '''
    Created on Oct 30, 2021
    
    @author: songzhao
    '''
    
    import pandas as pd
    
    
    '''
    
     This function will invoke the pandas read_html() method to retrieve the web page specified by the page url.
     
     It will return a DataFrame list.
     
    '''
    def read_html_table(url):
        
        print('========== start read_html_table ==========')  
        
        print('========== url= ', url, ' ==========') 
        
        df_list = pd.read_html(url)
        
        print('========== end read_html_table ==========') 
        
        return df_list
        
        
    # This function will retrieve all the ESPN NBA player's salaries web page and parse them.    
    def read_nba_salary_web_pages():
        
        print('========== start read_nba_salary_web_pages ==========')  
        
        # Create an empty DataFrame object, this object will be used to store all the web page DataFrame objects.
        df_all = pd.DataFrame()
        
        # Define the base web page url.
        base_url = 'https://www.espn.com/nba/salaries/_/page/'
        
        # Define the page number.
        i = 1
        
        # Create the first web page URL.
        url = base_url + str(i)
        
        # Read and parse the web page URL, return a DataFrame objects list.
        df_list = read_html_table(url)
        
        # Get and parse the first page.
        (contain_data, df_all_one_page) = parse_dataframe_list(df_list)
        
        # When the web page contains NBA player's salary data.
        while contain_data :
            
            # Append the web page DataFrame object to the df_all object.
            df_all = df_all.append(df_all_one_page)
            
            # Web page number plus 1.
            i = i + 1
            
            # Create the next web page URL.
            url = base_url + str(i)  
            
            # Retrieve the next web page.
            df_list = read_html_table(url)
            
            # Parse the next web page DataFrame list.
            (contain_data, df_all_one_page) = parse_dataframe_list(df_list)
            
        # Print all the web page DataFrame object data.
        print('df_all = \r\n', df_all)
        
        # Reset the DataFrame object index.
        df_all = df_all.reset_index(drop = True)
        
        # Write all the DataFrame data to a CSV file. 
        df_all.to_csv('./nba_salary.csv')
            
            
        print('========== end read_nba_salary_web_pages ==========')    
            
          
    '''
     This function will parse out the DataFrame objects list and return a tuple object. 
     
     The first element in the return tuple is a boolean value, True means the DataFrame list contains data, False means the DataFrame list does not contain data.
     
     The second element in the return tuple is a DataFrame object that contains the web page table data.
    '''    
    def parse_dataframe_list(df_list):
        
        # Define a boolean variable.
        contain_data = True
        
        # Create an Empty DataFrame object.
        df_all_one_page = pd.DataFrame()
        
        # Get the DataFrame list size.
        size = len(df_list)
        
        # Loop in the DataFrame list.
        for i in range(0, size):
            
            # Get the DataFrame object in the list.
            df = df_list[i]
            
            # Append one DataFrame object in the list to the df_all_one_page object.
            df_all_one_page = df_all_one_page.append(df)
            
            # Get the DataFrame object index values.
            index = df.index
            
            # Get the index start and stop value.
            start = index.start
            stop = index.stop
            
            print('index start = ', start)
            print('index stop = ', stop)
            
            # stop == 1 means there is no more NBA player's salary data.
            if stop == 1:
                
                contain_data = False
                
            #print(df)    
         
        # Return the tuple data.    
        return (contain_data, df_all_one_page)           
        
    
    if __name__ == '__main__':
        
        read_nba_salary_web_pages()
  7. When you run the above example, you will see the below output in the console.
    ========== start read_nba_salary_web_pages ==========
    ========== start read_html_table ==========
    ========== url=  https://www.espn.com/nba/salaries/_/page/1  ==========
    ========== end read_html_table ==========
    index start =  0
    index stop =  44
    ......
    ========== start read_html_table ==========
    ========== url=  https://www.espn.com/nba/salaries/_/page/13  ==========
    ========== end read_html_table ==========
    index start =  0
    index stop =  2
    ========== start read_html_table ==========
    ========== url=  https://www.espn.com/nba/salaries/_/page/14  ==========
    ========== end read_html_table ==========
    index start =  0
    index stop =  1
    df_all = 
           0                       1                      2            3
    0    RK                    NAME                   TEAM       SALARY
    1     1       Stephen Curry, PG  Golden State Warriors  $45,780,966
    2     2        James Harden, SG          Brooklyn Nets  $44,310,840
    3     3           John Wall, PG        Houston Rockets  $44,310,840
    4     4   Russell Westbrook, PG     Los Angeles Lakers  $44,211,146
    ..  ...                     ...                    ...          ...
    41  478  Brandon Boston Jr., SG            LA Clippers     $925,258
    42  479           Luka Garza, C        Detroit Pistons     $925,258
    43  480      Marko Simonovic, C          Chicago Bulls     $925,258
    0    RK                    NAME                   TEAM       SALARY
    1   481         Ayo Dosunmu, SG          Chicago Bulls     $925,258
    
    [530 rows x 4 columns]
    ========== end read_nba_salary_web_pages ==========
  8. And it will also generate a CSV file nba_salary.csv in the current python file executing directory.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.