How To Use Python Urllib Module To Get Web Page Html Data Example

Python’s built-in urllib library is used to obtain HTML information of web pages. The urllib library is a standard library module of Python and does not need to be installed separately. It is a common module of Python crawlers. This article will show you how to use the Python urllib library to get Html web page content with examples.

1. Python urllib Library request Module Introduction.

  1. Before you can use the urllib library request module, you need to import it in your source code.
    # import the urllib.request module
    import urllib.request
    
    # or use the below method to import the request module from the urllib library
    from urllib import request
  2. urllib.request.urlopen(url,timeout): this method makes a request to the website URL and gets the response object.
    url: the requested web page URL.
    
    timeout: response timeout. If no response is received within the specified time, a timeout exception will be thrown
  3. urllib.request.Request(url,headers): This method is used to create the request object and wrap the request headers, such as reconstructing the user agent (that is, the user agent refers to the browser used by the user) to make the program more like human requests rather than machines.
    url: the request web page URL.
    
    headers: the request headers.

2. The http.client.HTTPResponse Class.

  1. All the above urllib.request module’s methods will return an http.client.HTTPResponse object. Below will introduce it’s methods.
  2. read(): read the bytes data from the response object.
  3. bytes.decode(“utf-8”): convert the bytes data to string data.
  4. string.encode(“utf-8”): convert string data to bytes data.
  5. geturl(): return the URL address of the response object.
  6. getcode(): return the HTTP response code.

3. Use Python urllib.request To Crawl Web Page Examples.

  1. This example will show you how to use python urllib.request module to request a web page by URL and how to parse out it’s response web page content and headers.
    import urllib.request
    
    # or 
    # from urllib import request
    
    # this function will request the url page and get the response object. 
    def urllib_request_web_page(url):
        
        # send request to the url web page and get the response object.
        response = urllib.request.urlopen(url)
        # print out the response object.
        print(response)
        
        # get the response url.
        resp_url = response.geturl()
        print('Response url : ', resp_url)
        
        # get the response code.
        resp_code = response.getcode()
        print('Response code : ', resp_code)  
        
        # get all the response headers in a list object.
        resp_headers_list = response.getheaders()
        # loop in the response headers.
        for resp_headers in resp_headers_list:
            # get the response header name 
            header_name = resp_headers[0]
            # get the response header value.
            header_value = resp_headers[1]
            
            print(resp_headers)
            print(header_name, ' = ', header_value)  
        
        
        # read the response content in bytes object.
        bytes = response.read()
        print(bytes)
        
        # convert the bytes object to string.
        html_content = bytes.decode('utf-8')
        print(html_content)
    
    
    if __name__ == '__main__':
        
        url = "https://www.bing.com"
        
        urllib_request_web_page(url)
  2. When you run the above source code, you may get the below output.
    <http.client.HTTPResponse object at 0x7f9d82d7c910>
    Response url :  https://www.bing.com
    Response code :  200
    ('Cache-Control', 'private')
    Cache-Control  =  private
    ('Transfer-Encoding', 'chunked')
    Transfer-Encoding  =  chunked
    ......
    ......
    b'<!doctype html>......</html>'

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Clicky