Most websites use the URL link to implement pagination, but some website does not have such pagination links on their web page, they use javascript to load the next page content dynamically when a user scrolls the web page.
For example the website https://unsplash.com/. It is an image repository website, it will show next page images when you scroll the web page, and you can not find the pagination link on the web page. This article will tell you how to use Python Scrapy to crawl such kinds of web pages.
1. Find The Pagination URL For JavaScript Dynamically Loaded Web Page.
- Open the web page https://unsplash.com/ in Google Chrome.
- Right-click the page, click Inspect menu item in the popup menu list.
- Click the Network tab on the web page right side.
- Reload the web page until it loads all the first page images.
- Input photos? in the Network filters text box, scroll down the web page slowly, then you can see it will load the pagination URL like
photos?per_page=12&page=10
, click it to get the full pagination URL ishttps://unsplash.com/napi/photos?per_page=12&page=10
.
- If you do not know the pagination URL characters such as photos? you need to find it carefully in the Network tab loaded resources name list.
- You can also sort the javascript loaded web resources by Type column in the Network tab, and you can find all the pagination URL resource has fetch Type.
2. Get The Image URL.
- Open the URL
https://unsplash.com/napi/photos?per_page=12&page=10
in Firefox, it will display the URL returned text in a JSON viewer like below.
- We can also install a Google Chrome JSON viewer extension to view the server returned JSON text as Firefox also. We will tell you how to install it later.
- From the Firefox JSON viewer, we can see that the webserver returns 12 JSON text in an array, and each item in the array contains one image information.
- And the image download URL link is saved in one-image-json-item—>links—>download.
- Now we will verify this in the python source code.
- Open a terminal, and run the command
scrapy shell 'https://unsplash.com/napi/photos?per_page=12&page=10'
, it will download the pagination URL in the Scrapy shell. Please note the URL should be wrapped with''
, otherwise, it will fetch the URLhttps://unsplash.com/napi/photos?per_page=12
, which is not correct.$ scrapy shell 'https://unsplash.com/napi/photos?per_page=12&page=10' ...... [s] view(response) View response in a browser >>>
- Execute the command
print(response.text)
in the Scrapy shell, it will print the response JSON format text of the URL page.>>> print(response.text)
- Now we will use the Python
json
module to load the webserver returned JSON format text and get the image download URL.# Import the python json module. >>> import json >>> # Loads the pagination URL web page content into a JSON array list. >>> image_json_list = json.loads(response.text) >>> # Print the list length. >>> print(len(image_json_list)) 12 # Print the first item in the list. >>> print(image_json_list[0]) >>> # Get the 'links' item in one image json text. >>> image_json_list[0]['links'] {'self': 'https://api.unsplash.com/photos/e59Y6vqbL7Y', 'html': 'https://unsplash.com/photos/e59Y6vqbL7Y', 'download': 'https://unsplash.com/photos/e59Y6vqbL7Y/download', 'download_location': 'https://api.unsplash.com/photos/e59Y6vqbL7Y/download'} >>> # Get image download URL. >>> image_json_list[0]['links']['download'] 'https://unsplash.com/photos/e59Y6vqbL7Y/download'
3. Download The Above Images In Scrapy Project Automatically Example.
- Now we will create a Scrapy project and crawl the website https://unsplash.com/ to download the images. We will write it later if any reader wants to learn. 🙂