Python Get Webpage Html Examples

Python provides some modules for you to get webpage Html source code from a URL. It includes the modules urllib ( urllib2 is not supported in python3 ), urllib3, and request. This article will show you how to use these python modules to get webpage Html source code with examples.

1. Python Get Webpage Html Use urllib Module Example.

1.1 Python urllib Module Introduction.

  1. Python’s built-in urllib library is used to obtain the HTML source code of web pages.
  2. The urllib library is a standard library module of Python and does not need to be installed separately.

1.2 Python urllib Library request Module Introduction.

  1. Before you can use the urllib library request module, you need to import it into your source code.
    # import the urllib.request module
    import urllib.request
    
    # or use the below method to import the request module from the urllib library
    from urllib import request
  2. urllib.request.urlopen(url,timeout): this method makes a request to the website URL and gets the response object.
    url: the requested web page URL.
    
    timeout: response timeout. If no response is received within the specified time, a timeout exception will be thrown
  3. urllib.request.Request(url,headers): This method is used to create the request object and wrap the request headers, such as reconstructing the user agent (that is, the user agent refers to the browser used by the user) to make the program more like human requests rather than machines.
    url: the request web page URL.
    
    headers: the request headers.

1.3 The http.client.HTTPResponse Class.

  1. All the above urllib.request module’s methods will return an http.client.HTTPResponse object. Below will introduce it’s methods.
  2. read(): read the bytes data from the response object.
  3. bytes.decode(“utf-8”): convert the bytes data to string data.
  4. string.encode(“utf-8”): convert string data to bytes data.
  5. geturl(): return the URL address of the response object.
  6. getcode(): return the HTTP response code.

1.4 Use Python urllib.request To Crawl Web Page Examples.

  1. This example will show you how to use python urllib.request module to request a web page by URL and how to get webpage html content and headers.
    import urllib.request
    
    # or 
    # from urllib import request
    
    # this function will request the url page and get the response object. 
    def urllib_request_web_page(url):
        
        # send request to the url web page and get the response object.
        response = urllib.request.urlopen(url)
        # print out the response object.
        print(response)
        
        # get the response url.
        resp_url = response.geturl()
        print('Response url : ', resp_url)
        
        # get the response code.
        resp_code = response.getcode()
        print('Response code : ', resp_code)  
        
        # get all the response headers in a list object.
        resp_headers_list = response.getheaders()
        # loop in the response headers.
        for resp_headers in resp_headers_list:
            # get the response header name 
            header_name = resp_headers[0]
            # get the response header value.
            header_value = resp_headers[1]
            
            print(resp_headers)
            print(header_name, ' = ', header_value)  
        
        
        # read the response content in bytes object.
        bytes = response.read()
        print(bytes)
        
        # convert the bytes object to string.
        html_content = bytes.decode('utf-8')
        print(html_content)
    
    
    if __name__ == '__main__':
        
        url = "https://www.bing.com"
        
        urllib_request_web_page(url)
  2. When you run the above source code, you may get the below output.
    <http.client.HTTPResponse object at 0x7f9d82d7c910>
    Response url :  https://www.bing.com
    Response code :  200
    ('Cache-Control', 'private')
    Cache-Control  =  private
    ('Transfer-Encoding', 'chunked')
    Transfer-Encoding  =  chunked
    ......
    ......
    b'<!doctype html>......</html>'

2. Python Get Webpage Html Use urllib3 Module Example.

  1. Python module urllib2 has been removed from Python 3, but there is urllib3 which is similar to the module urllib.
  2. But the python urllib3 module is not python built-in, it needs to be installed in your python environment first.

2.1 How To Install Python urllib3 Module.

  1. Open a terminal and run the command pip install urllib3 to install the python module urllib3.
    > pip install urllib3
    Defaulting to user installation because normal site-packages is not writeable
    Collecting urllib3
      Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
         ---------------------------------------- 140.4/140.4 kB 106.8 kB/s eta 0:00:00
    Installing collected packages: urllib3
    Successfully installed urllib3-1.26.12
  2. Run the command pip show urllib3 to get the installed urllib3 module’s information.
    > pip show urllib3
    Name: urllib3
    Version: 1.26.12
    Summary: HTTP library with thread-safe connection pooling, file post, and more.
    Home-page: https://urllib3.readthedocs.io/
    Author: Andrey Petrov
    Author-email: [email protected]
    License: MIT
    Location: c:\users\zhao song\appdata\roaming\python\python39\site-packages
    Requires:
    Required-by:
  3. Below is the example source code that uses the python module urllib3 to request a web page and get the page Html source code.
    # import the urllib3 module first.
    import urllib3
    
    # define the function to get Html web page source code by URL.
    def get_webpage_html_use_urllib3(url):
    
        # Get the HTTP pool manager object in urllib3.
        http_pool_manager = urllib3.PoolManager()
    
        # Send the request to the url using the http pool manager.
        response = http_pool_manager.request('GET', url)
    
        # Print out the response status code. 
        print(response.status)
    
        # Print out the response header.
        print(response.headers)
    
        # Print out the response webpage Html source code.
        print(response.data)
    
    
    if __name__ == '__main__':
        
        url = "https://www.bing.com"
        
        get_webpage_html_use_urllib3(url)
    

 

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.