Social Media Analytics

Following Links

Let us explore how to navigate beyond the initial pages of a website and scrape data from all available pages. In our example, we want to extract quotes from every page on https://quotes.toscrape.com.

Identifying the “Next Page” Link:

Inspecting the website’s HTML code reveals a link with the relevant information. Here’s the markup:

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">→</span></a>
    </li>
</ul>

Extracting the Link URL:

We can use Scrapy’s CSS Selector capabilities to retrieve the link. Here’s how to achieve this:

Selecting the Anchor Tag: response.css('li.next a') selects the anchor element within the <li> tag with the class "next".
Extracting the Href Attribute: To obtain the actual URL, we can utilize the ::attr(href) extension. Scrapy recognizes this extension and retrieves the value of the href attribute: response.css("li.next a::attr(href)").get().

Recursive Link Following:

Now, let us modify our spider to recursively follow the “Next Page” link and extract data from all linked pages. Here is the improved code:

Spider Recursively Following Links

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
    ]

    def parse(self, response):   
        # Extract quotes as usual... (existing code)
        next_page = response.css("li.next a::attr(href)").get()
        if next_page is not None:
            # Build a complete URL using urljoin
            next_page = response.urljoin(next_page)
            # Yield a new request for the next page
            yield scrapy.Request(next_page, callback=self.parse)

Explanation:

After extracting data from the current page, the parse method checks for the “Next Page” link.
If a link exists, response.urljoin(next_page) constructs a complete URL from the relative link found on the page.
A new scrapy.Request object is created for the next page, specifying the parse method as the callback to handle data extraction when the request finishes.

Key Points:

Yielding Request objects within a callback triggers Scrapy to schedule and send those requests.
The specified callback function (in this case, self.parse) processes the downloaded content of the new page.
This mechanism allows you to build complex crawlers that follow links according to defined rules and extract data from different sections of a website.

By implementing link following, you can scrape data from entire websites with pagination structures, making your scraping tasks more comprehensive.

Previous Next

Use the Search Bar to find content on MarketingMind.

Contact | Privacy Statement | Disclaimer: Opinions and views expressed on www.ashokcharan.com are the author’s personal views, and do not represent the official views of the National University of Singapore (NUS) or the NUS Business School | © Copyright 2013-2025 www.ashokcharan.com. All Rights Reserved.