How To Use Python Scrapy To Crawl Javascript Dynamically Loaded Pagination Web Page

Most websites use the URL link to implement pagination, but some website does not have such pagination links on their web page, they use javascript to load the next page content dynamically when a user scrolls the web page.

For example the website https://unsplash.com/. It is an image repository website, it will show next page images when you scroll the web page, and you can not find the pagination link on the web page. This article will tell you how to use Python Scrapy to crawl such kinds of web pages.

1. Find The Pagination URL For JavaScript Dynamically Loaded Web Page.

  1. Open the web page https://unsplash.com/ in Google Chrome.
  2. Right-click the page, click Inspect menu item in the popup menu list.
  3. Click the Network tab on the web page right side.
  4. Reload the web page until it loads all the first page images.
  5. Input photos? in the Network filters text box, scroll down the web page slowly, then you can see it will load the pagination URL like photos?per_page=12&page=10, click it to get the full pagination URL is https://unsplash.com/napi/photos?per_page=12&page=10.
    get javascript dynamically loaded pagination url in google chrome inspector
  6. If you do not know the pagination URL characters such as photos? you need to find it carefully in the Network tab loaded resources name list.
  7. You can also sort the javascript loaded web resources by Type column in the Network tab, and you can find all the pagination URL resource has fetch Type.
    sort the network loaded resource list by resource type fetch

2. Get The Image URL.

  1. Open the URL https://unsplash.com/napi/photos?per_page=12&page=10 in Firefox, it will display the URL returned text in a JSON viewer like below.
    firefox json viewer shows a web url returned json text
  2. We can also install a Google Chrome JSON viewer extension to view the server returned JSON text as Firefox also. We will tell you how to install it later.
  3. From the Firefox JSON viewer, we can see that the webserver returns 12 JSON text in an array, and each item in the array contains one image information.
  4. And the image download URL link is saved in one-image-json-item—>links—>download.
  5. Now we will verify this in the python source code.
  6. Open a terminal, and run the command scrapy shell 'https://unsplash.com/napi/photos?per_page=12&page=10', it will download the pagination URL in the Scrapy shell. Please note the URL should be wrapped with '', otherwise, it will fetch the URL https://unsplash.com/napi/photos?per_page=12 , which is not correct.
    $ scrapy shell 'https://unsplash.com/napi/photos?per_page=12&page=10'
    ......
    [s]   view(response)    View response in a browser
    >>> 
    
  7. Execute the command print(response.text) in the Scrapy shell, it will print the response JSON format text of the URL page.
    >>> print(response.text)
  8. Now we will use the Python json module to load the webserver returned JSON format text and get the image download URL.
    # Import the python json module.
    >>> import json
    >>> 
    # Loads the pagination URL web page content into a JSON array list.
    >>> image_json_list = json.loads(response.text)
    >>> 
    # Print the list length.
    >>> print(len(image_json_list))
    12
    # Print the first item in the list.
    >>> print(image_json_list[0])
    >>>
    # Get the 'links' item in one image json text. 
    >>> image_json_list[0]['links']
    {'self': 'https://api.unsplash.com/photos/e59Y6vqbL7Y', 'html': 'https://unsplash.com/photos/e59Y6vqbL7Y', 'download': 'https://unsplash.com/photos/e59Y6vqbL7Y/download', 'download_location': 'https://api.unsplash.com/photos/e59Y6vqbL7Y/download'}
    >>> 
    # Get image download URL.
    >>> image_json_list[0]['links']['download']
    'https://unsplash.com/photos/e59Y6vqbL7Y/download'
    
    

3. Download The Above Images In Scrapy Project Automatically Example.

  1. Now we will create a Scrapy project and crawl the website https://unsplash.com/ to download the images. We will write it later if any reader wants to learn. 🙂

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.