Following Links

Let us explore how to navigate beyond the initial pages of a website and scrape data from all available pages. In our example, we want to extract quotes from every page on https://quotes.toscrape.com.

Identifying the “Next Page” Link:

Inspecting the website’s HTML code reveals a link with the relevant information. Here’s the markup:

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">→</span></a>
    </li>
</ul>
Extracting the Link URL:

We can use Scrapy’s CSS Selector capabilities to retrieve the link. Here’s how to achieve this:

  1. Selecting the Anchor Tag: response.css('li.next a') selects the anchor element within the <li> tag with the class "next".
  2. Extracting the Href Attribute: To obtain the actual URL, we can utilize the ::attr(href) extension. Scrapy recognizes this extension and retrieves the value of the href attribute: response.css("li.next a::attr(href)").get().
Recursive Link Following:

Now, let us modify our spider to recursively follow the “Next Page” link and extract data from all linked pages. Here is the improved code:

Spider Recursively Following Links
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):   
        # Extract quotes as usual... (existing code)
        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            # Build a complete URL using urljoin
            next_page = response.urljoin(next_page)
            # Yield a new request for the next page
            yield scrapy.Request(next_page, callback=self.parse)
Explanation:
  1. After extracting data from the current page, the parse method checks for the “Next Page” link.
  2. If a link exists, response.urljoin(next_page) constructs a complete URL from the relative link found on the page.
  3. A new scrapy.Request object is created for the next page, specifying the parse method as the callback to handle data extraction when the request finishes.
Key Points:
  • Yielding Request objects within a callback triggers Scrapy to schedule and send those requests.
  • The specified callback function (in this case, self.parse) processes the downloaded content of the new page.
  • This mechanism allows you to build complex crawlers that follow links according to defined rules and extract data from different sections of a website.

By implementing link following, you can scrape data from entire websites with pagination structures, making your scraping tasks more comprehensive.


Previous     Next

Use the Search Bar to find content on MarketingMind.