How To Use Scrapy Xpath Selectors To Extract Data In Scrapy Shell

This article will tell you how to start Scrapy shell debugging tool to extract the information on a web page. How to simulate a real web browser to send a request to a web server by specifying User-Agent header value. It will also tell you how to extract web element value use xpath in Scrapy shell.

1. Start Scrapy Shell To Fetch Data From A Web Page Url.

The web page URL I want to grab is https://www.indeed.com/q-python-developer-jobs.html. Open a terminal and run the command scrapy shell https://www.indeed.com/q-python-developer-jobs.html to open the Scrapy shell. When it displays the prompt character ( >>> ), that means the Scarpy shell is ready for you to run commands to extracts data from the web page.

$ scrapy shell https://www.indeed.com/q-python-developer-jobs.html
2021-02-04 11:10:08 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: TestScrapyProject)
2021-02-04 11:10:08 [scrapy.utils.log] INFO: Versions: lxml 4.6.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:52:21) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 20.0.1 (OpenSSL 1.1.1i  8 Dec 2020), cryptography 3.3.1, Platform Darwin-19.6.0-x86_64-i386-64bit
2021-02-04 11:10:08 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-02-04 11:10:08 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'TestScrapyProject',
 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0,
 'NEWSPIDER_MODULE': 'TestScrapyProject.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['TestScrapyProject.spiders']}
2021-02-04 11:10:08 [scrapy.extensions.telnet] INFO: Telnet Password: 5a73008066b4649c
2021-02-04 11:10:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2021-02-04 11:10:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-02-04 11:10:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-02-04 11:10:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-02-04 11:10:08 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-02-04 11:10:08 [scrapy.core.engine] INFO: Spider opened
2021-02-04 11:11:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.indeed.com/robots.txt> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2021-02-04 11:11:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.com/robots.txt> (referer: None)
2021-02-04 11:13:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.indeed.com/q-python-developer-jobs.html> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fbdbec1f160>
[s]   item       {}
[s]   request    <GET https://www.indeed.com/q-python-developer-jobs.html>
[s]   response   <200 https://www.indeed.com/q-python-developer-jobs.html>
[s]   settings   <scrapy.settings.Settings object at 0x7fbdbec1cb70>
[s]   spider     <DefaultSpider 'default' at 0x7fbdbef7f710>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> 

From the above console output, we can see the server response status is 200 ( response <200 https://www.indeed.com/q-python-developer-jobs.html> ), which means the web page is ok to fetch.

But if the response status code is 403 which means that the target website has turned on “anti-crawler” and is not allowed to use Scrapy to crawl data. In order to solve this problem, we need to disguise Scrapy as a browser.

In order to disguise Scrapy as a real web browser, it is necessary to set the User-Agent header when sending the request. So first we should get the real User-Agent header value from a web browser such as Google Chrome follow below steps.

  1. Open Google Chrome web browser and browse the web page https://www.indeed.com/q-python-developer-jobs.html.
  2. Right-click the web page, click Inspect menu item in the popup menu list.
  3. Click the Network tab on page right side, then reload the web page.
  4. Click any web resource name under the Network tab, then it will display the web resource Headers information on the right side.
  5. Scroll down to the Request Headers section on the right side, and get the user-agent header value as below.
    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36

Now we can start Scrapy shell with the above User-Agent value as below.

$ scrapy shell -s USER-AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.146 Safari/537.36'   https://www.indeed.com/q-python-developer-jobs.html
......
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f95423f1128>
[s]   item       {}
[s]   request    <GET https://www.indeed.com/q-python-developer-jobs.html>
[s]   response   <200 https://www.indeed.com/q-python-developer-jobs.html>
[s]   settings   <scrapy.settings.Settings object at 0x7f95423ecb38>
[s]   spider     <DefaultSpider 'default' at 0x7f95427746a0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> 

2. Use XPath To Extract Web Element Data.

2.1 Get The Job Title Web Element XPath Value.

  1. Right-click the web element ( job title ) in Google Chrome.
  2. Click Inspect menu item in the popup menu list.
  3. Click Elements tab on web page right side.
  4. Right-click the web element (job title) HTML tag on the right side.
  5. Click Copy —> Copy XPath menu item in the popup menu list.
  6. Now you got the job title web element XPath value like this //*[@id=”jl_032dbf1f2fe46842″].
  7. If you are not satisfied with the XPath value ( for example, you need a more general one ), you can write the web element’s XPath by yourself. I got below XPath value, it will get all HTML a tag’s title attribute value.
    "//div/h2/a/@title"

2.2 Extract The Web Element Data By XPath In Scrapy Shell.

  1. Input response.xpath(“//div/h2/a/@title”).extract() command in Scrapy shell, press enter key, it will extract all the job title string in a list.
    >>> response.xpath("//div/h2/a/@title").extract()
    ['Python Developer Intern', 'Python Developer', 'Python Developer', 'Python Developer', 'Software Developer – Entry Level', 'Python Developer', 'Python/Zenoss Developer', 'Python Developer', 'Python Developer', 'Python Developer', 'Python Developer', 'Python Developer', 'PYTHON DEVELOPER', 'Python Developer Trainee', 'Python Developer']

2.3 Extract Job Item In the Job List One By One.

But what we want is to extract each job item in the job list one by one, and then extract each job item’s title, company, location, etc. We can do it like this.

  1. Open a terminal and run the command scrapy shell https://www.indeed.com/q-python-developer-jobs.html.
  2. Execute the command response.xpath(‘//td[@id=”resultsCol”]/div[@data-tn-component=”organicJob”]’) to get all the job elements in the job list area by XPath.
    >>> response.xpath('//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]')
    [<Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>, <Selector xpath='//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]' data='<div class="jobsearch-SerpJobCard uni...'>]
  3. Loop in the above job item list, and extract each job item title by it’s relative XPath value( ‘./h2/a/@title ‘).
    >>> for job_item in response.xpath('//td[@id="resultsCol"]/div[@data-tn-component="organicJob"]'):
    ...     title = job_item.xpath('./h2/a/@title').extract()
    ...     print(title)
    ... 
    ['Python Developer Intern']
    ['Python Developer with Geophysicist']
    ['Python Developer']
    ['Python Developer']
    ['PYTHON DEVELOPER']
    ['Software Developer – Entry Level']
    ['PYTHON DEVELOPER']
    ['Python Developer']
    ['Python Developer']
    ['Python Developer']
    ['Python Developer']
    ['Python Developer']
    ['Senior Level Python Developer']
    ['Python Developer']
    ['Python Developer']

2.4 Extract Job List Next Page Url.

  1. Get the job list page pagination link XPath value ‘//ul[@class=”pagination-list”]//a[@aria-label=”Next”]/@href’.
  2. Verify the XPath value in the Scrapy shell.
    >>> response.xpath('//ul[@class="pagination-list"]//a[@aria-label="Next"]/@href')
    [<Selector xpath='//ul[@class="pagination-list"]//a[@aria-label="Next"]/@href' data='/jobs?q=python+developer&start=10'>]
    >>> 
    >>> response.xpath('//ul[@class="pagination-list"]//a[@aria-label="Next"]/@href').extract()
    ['/jobs?q=python+developer&start=10']

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.