Let us explore how to navigate beyond the initial pages of a website and scrape data from all available pages. In our example, we want to extract quotes from every page on https://quotes.toscrape.com
.
Inspecting the website’s HTML code reveals a link with the relevant information. Here’s the markup:
<ul class="pager"> <li class="next"> <a href="/page/2/">Next <span aria-hidden="true">→</span></a> </li> </ul>
We can use Scrapy’s CSS Selector capabilities to retrieve the link. Here’s how to achieve this:
response.css('li.next a')
selects the anchor element within the <li>
tag with the class "next".
::attr(href)
extension. Scrapy recognizes this extension and retrieves the value of the href attribute: response.css("li.next a::attr(href)").get()
.
Now, let us modify our spider to recursively follow the “Next Page” link and extract data from all linked pages. Here is the improved code:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape.com/page/1/",
]
def parse(self, response):
# Extract quotes as usual... (existing code)
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
# Build a complete URL using urljoin
next_page = response.urljoin(next_page)
# Yield a new request for the next page
yield scrapy.Request(next_page, callback=self.parse)
parse
method checks for the “Next Page” link.
response.urljoin(next_page)
constructs a complete URL from the relative link found on the page.
scrapy.Request
object is created for the next page, specifying the parse
method as the callback to handle data extraction when the request finishes.
Request
objects within a callback triggers Scrapy to schedule and send those requests.
self.parse
) processes the downloaded content of the new page.
By implementing link following, you can scrape data from entire websites with pagination structures, making your scraping tasks more comprehensive.
Use the Search Bar to find content on MarketingMind.
Contact | Privacy Statement | Disclaimer: Opinions and views expressed on www.ashokcharan.com are the author’s personal views, and do not represent the official views of the National University of Singapore (NUS) or the NUS Business School | © Copyright 2013-2024 www.ashokcharan.com. All Rights Reserved.