How To Create Scrapy Project To Crawl Web Page Example

Scrapy is a Python library that can be used to crawl web pages and extract the web page elements by XPath or CSS selector in python code. This article will tell you how to create a Scrapy project and how to implement the Scrapy related classes in the project to crawl and extract a job search website job list page. It can extract all the job items one by one and also can extract the job list pagination URL link to parse all job list pages.

1. Create Python Scrapy Project Steps.

  1. Because Scrapy is a Python package, you should run the command pip show scrapy in a terminal to make sure it has been installed in your python environment.
    $ pip show scrapy
    Name: Scrapy
    Version: 2.4.1
    Summary: A high-level Web Crawling and Web Scraping framework
    Home-page: https://scrapy.org
    Author: Scrapy developers
    Author-email: None
    License: BSD
    Location: /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages
    Requires: queuelib, itemadapter, pyOpenSSL, parsel, w3lib, lxml, Twisted, cssselect, service-identity, cryptography, protego, itemloaders, PyDispatcher, zope.interface
    Required-by: 
    
  2. If it does not install Scrapy, you should run the command pip install scrapy to install it. If you meet errors during the Scrapy installation, you can read the article How To Fix Running Setup.py Install For Twisted Error When Install Python Scrapy.
  3. After installing Scrapy, you should create a local folder and then run the command scrapy startproject project-name in a terminal in the local folder, it will create the python Scrapy project files skeleton in the folder like below. What we need to do is just edit the below python files accordingly, then you can run it to crawl web pages.
    └── TestScrapyProject
        ├── TestScrapyProject
        │   ├── items.py
        │   ├── middlewares.py
        │   ├── pipelines.py
        │   ├── settings.py
        │   └── spiders
        └── scrapy.cfg
    

2. Define Web Page Element Data Transfer Object Class.

  1. In this example, we will crawl the web page https://www.indeed.com/q-python-developer-jobs.html, and then extract the python job data information in the page.
  2. The web page lists a lot of python jobs, and each job item contains title, company, location, salary, description, etc.
  3. We need to extract all these job elements and save them in a data transfer object ( DTO ).
  4. The DTO is implemented by the items.py file in the Scrapy project. This file is created by Scrapy automatically.
  5. Open the Scrapy project in eclipse pydev, and edit the items.py file as below. We can see the project item class extends scrapy.Item class. We should declare all the item fields ( scrapy.Field type ) related to the web element data ( job item property ) in the below file.
    import scrapy
    
    class TestscrapyprojectItem(scrapy.Item):
        # define the fields for your item here like:
        title = scrapy.Field()
        
        company = scrapy.Field()
        
        location = scrapy.Field()
        
        salary = scrapy.Field()
        
        description = scrapy.Field()

3. Create Spider Class.

  1. Open a terminal and go to the Scrapy project spiders folder( TestScrapyProject / spiders ).
  2. Run the command scrapy genspider job_url https://www.indeed.com/q-python-developer-jobs.html, it will create a new python file job_url.py in the spiders folder.
    └── TestScrapyProject
        ├── TestScrapyProject
        │   ├── items.py
        │   ├── middlewares.py
        │   ├── pipelines.py
        │   ├── settings.py
        │   └── spiders
        │       └── job_url.py
        └── scrapy.cfg
    
  3. This file just defines the spider class ( JobUrlSpider), and we will write python code in this class to extract the job item data.
  4. The Scrapy spider class has a parse(self, response) method, Scrapy framework will use this method to parse the web element by XPath. The response parameter contains all the web page elements. You can read the article How To Use Scrapy Xpath Selectors To Extract Data In Scrapy Shell to learn how to get and verify the web element XPath value.
  5. Edit the job_url.py file as below, please see the code comments for details.
    import scrapy
    
    from scrapy.shell import inspect_response
    
    from TestScrapyProject.items import TestscrapyprojectItem
    
    class JobUrlSpider(scrapy.Spider):
        
        # This is the spider name.
        name = 'job_url'
        
        # Define the domains that allow this spider to crawl. The value should be domain only, can not be web page url.
        allowed_domains = ['www.indeed.com']
        
        # Define the page urls that this spider can crawl. We can add multiple page url in this list array.
        start_urls = ['https://www.indeed.com/q-python-developer-jobs.html', 'https://www.indeed.com/q-angular-jobs.html']
        #start_urls = ['https://www.indeed.com/q-python-developer-jobs.html']
    
        # This method will extract the job list data returned by the response object.
        def parse(self, response):
            
            # Extract all child web elements from the job list page.
            job_list = response.xpath('//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]')
              
            print("There are total ", len(job_list), " jobs.")
            
            # Loop in the job item list.
            for job_item in job_list:
                
                # Create a TestscrapyprojectItem object.
                item = TestscrapyprojectItem()
                
                # Extract the job title. Please note the xpath value start with . (dot), this means it is relative to the job_item. 
                title = job_item.xpath('./h2/a/@title').extract()
                
                # For debug only.
                #print("Title in parse method: ", title)
                
                company = job_item.xpath('./div[@class="sjcl"]/div/span[@class="company"]//text()').extract()
                
                location = job_item.xpath('./div[@class="sjcl"]/span[@class="location accessible-contrast-color-location"]//text()').extract()
                
                salary = job_item.xpath('./div[@class="salarySnippet holisticSalary"]/span/span[@class="salaryText"]/text()').extract()
                
                description = job_item.xpath('./div[@class="summary"]//text()').extract()
                
                # Assign parsed out web element value to the Scrapy DTO object.
                item['title'] = title
                
                item['company'] = company
                
                item['location'] = location
                
                item['salary'] = salary
                
                item['description'] = description
                
                # Use yield keyword to return the Scrapy DTO object back to Scrapy engine. It does not use return because return will terminate the loop
                # but yield will create a generator only and does not terminate the loop.
                # Scrapy engine will collect all these items and pass the items to the PipeLine class defined in file pipelines.py.
                yield item
    
               
                # Extract the next page url web element list by xpath.
                next_page_url_list = response.xpath('//ul[@class="pagination-list"]//a[@aria-label="Next"]/@href').extract()
                
                # Get the next page url in the above list.
                if next_page_url_list and len(next_page_url_list) > 0:
                    
                    next_page_url = next_page_url_list[0]
                    
                    # Send request to the next page url, the callback parameter is the method that will parse the request response web page.
                    yield scrapy.Request('https://www.indeed.com/'+next_page_url, callback=self.parse)
                
    
  6. When the parse method complete, the Scrapy engine will pass all the parsed out items to the project pipeline class defined in pipeline.py file.

4. Save DTO Data In Scrapy Pipeline Class.

  1. Edit the TestScrapyProject / pipelines.py file like below. We just need to edit the process_item method in the pipelines.py file. The item parameter records the parsed out job item data, we can save this job data to a database or file. In this example, we print the item data on the screen for simplicity.
    class TestscrapyprojectPipeline:
        
        '''
        This method receive the item object passed from the Scrapy Spider class ( JobUrlSpider ).
        You can save all these items to a file or database.
        We just print them out in this example.
        '''
        def process_item(self, item, spider):
            
            print("********* One Job Item ************")
            
            print("Title: ", item['title'])
            
            print("Company: ", item['company'])
            
            print("Location: ", item['location'])
            
            print("Salary: ", item['salary'])
            
            print("Description: ", item['description'])
    

5. Modify HTTP Request Header In Scrapy Project settings.py File.

  1. If you want to modify the HTTP request header such as add user-agent header value to disguise real web browser requests, you should edit the TestScrapyProject/TestScrapyProject/settings.py file.
  2. Uncomments the DEFAULT_REQUEST_HEADERS section and add the USER-AGENT HTTP request header in it.
    # Override the default request headers:
    DEFAULT_REQUEST_HEADERS = {
       'USER-AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36', 
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Language': 'en',
    }

6. Parse The Next Job List Web Page.

  1. If you want to parse all the jobs from the website, you need to parse out the next job list page URL link and then request the next page by the URL link.
  2. Then you can parse the next page web content returned by the request.
  3. You can add the below python code at the end of the parse method in the job_url.py file ( the Scrapy project spider file ) to implement this.
    # Extract the next page url web element list by xpath.
    next_page_url_list = response.xpath('//ul[@class="pagination-list"]//a[@aria-label="Next"]/@href').extract()
    
    # Get the next page url in the above list.
    if next_page_url_list and len(next_page_url_list) > 0:
        
        next_page_url = next_page_url_list[0]
        
        # Send request to the next page url, the callback parameter is the method that will parse the request response web page.
        yield scrapy.Request('https://www.indeed.com/'+next_page_url, callback=self.parse)

7. Run The Scrapy Spider Class.

  1. Open a terminal and go to the Scrapy project root directory.
    $ pwd
    /Users/songzhao/Documents/WorkSpace/dev2qa.com-example-code/PythonExampleProject/com/dev2qa/example/crawler/TestScrapyProject
  2. Run the command scrapy crawl job_url in the terminal, then it will print all the parsed out job data in the console.

    $ scrapy crawl job_url
    2021-02-18 13:25:22 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: TestScrapyProject)
    ......
    2021-02-18 13:25:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.com/robots.txt> (referer: None)
    2021-02-18 13:25:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.com/q-python-developer-jobs.html> (referer: None)
    ......
    2021-02-18 13:25:27 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.indeed.com/q-python-developer-jobs.html>
    {'company': ['\nOXO Solutions'],
     'description': ['\n',
                     ' \n ',
                     'Collaborate with other ',
                     'developers',
                     ', testers, and system engineers to ensure quality product '
                     'enhancements.',
                     '\n ',
                     'We are looking for a ',
                     'Python',
                     ' Web ',
                     'Developer',
                     ' responsible…',
                     '\n'],
     'location': ['Ruby, SC'],
     'salary': ['\n$4,000 a week'],
     'title': ['Python Developer Intern']}
    
    2021-02-18 13:25:29 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.indeed.com/q-angular-jobs.html>
    {'company': ['\n', '\nInfosys Limited'],
     'description': ['\n',
                     ' \n ',
                     'In the role of Technology Lead, you will interface with key '
                     'stakeholders and apply your technical proficiency across '
                     'different stages of the Software…',
                     '\n'],
     'location': ['Atlanta, GA'],
     'salary': [],
     'title': ['Angular Developer']}
    2021-02-18 13:25:29 [scrapy.core.engine] INFO: Closing spider (finished)
    2021-02-18 13:25:29 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    ......
    2021-02-18 13:25:29 [scrapy.core.engine] INFO: Spider closed (finished)

If you like this article, we will continue to write how to save the crawled data into JSON file, database, and how to display the crawled data graphically later.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.