How To Get Redirection URL In Python Crawler

When you use Python to write a web crawler, you may encounter the situation that the web page is redirected when you crawl the page. When you request a web page URL, it redirects to another web page URL, the URL address change will result in unable to get the web page content. This article introduces two methods to obtain the redirection URL when you use a Python crawler to request a web page.

1. Use Python urllib Library To Get Redirection URL.

  1. The Python urllib library is a Python built-in library, so it has been installed with Python, you do not need to install it.
  2. Open a terminal, run the command python to go to the interactive console.
  3. Import urllib.request module.
    >>> from urllib import request
  4. Define a web page URL, suppose this URL will be redirected when you send a request to it.
    >>> url = 'http://www.google.com/'
  5. Get the response object.
    >>> response = request.urlopen(url)
  6. Get the webserver returned response status code, if the code is 301 then it means the URL has been redirected permanently.
    >>> response.status
    301
  7. Get the redirection URL from the above response object.
    >>> new_url = response.geturl()
  8. Print out the new_url to verify the URL has been redirected.
    >>> print(new_url)

2. Use Python Requests Module To Get The Redirection URL.

  1. Run the below command to make sure the Python requests module has been installed on your OS.
    # Show python requests module installation information.
    $ pip show requests
    
    # If the python requests module does not exist then install it.
    $ pip install requests
  2. Get User-Agent value by input about://version in Google Chrome address bar and press enter key to browse it. Then it will list the User-Agent value on the page.
    Google Chrome	88.0.4324.150 (Official Build) (x86_64)
    ......
    User Agent	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)
  3. Add a User-Agent header to simulate a real web browser to request the web page URL.
    >>> headers = headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko)'}
  4. Import the Python requests module.
    >>> import requests
  5. Define a web page URL string.
    >>> url = 'http://www.google.com'
  6. Request the above web URL with the HTTP get method and get the response object.
    >>> response = requests.get(url, headers=headers)
  7. Get the web server returned status code. If the status code is between 301 and 308 that means the URL has been redirected.
    >>> status_code = response.status_code
  8. Get the web server redirection URL.
    >>> new_url = response.url

Reference

  1. Redirections in HTTPurllib

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.