Social Media Analytics

Python Scrapy: Writing a Spider

Spiders are Python classes that you define within a Scrapy project. They serve as the core components for instructing Scrapy on how to navigate and extract data from websites.

To create a spider, you must subclass the scrapy.Spider class and specify the starting URLs or a method to generate them dynamically using start_urls or start_requests. In addition, a parse method is required to handle the downloaded page content and extract the desired data. You may also define rules for how to follow links on the pages, such as whether to follow all links or only specific types.

Here is the code for the Spider:

Spider

from pathlib import Path
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"  

    def start_requests(self):
        # List of URLs to scrape 
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        # Scrape URLs
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]  # page number, i.e, 1, 2
        filename = f"quotes-{page}.html"  # quotes-1.html, quotes-2.html
        Path(filename).write_bytes(response.body) # Save the page
        self.log(f"Saved file {filename}")  # Maintain a log

name: This identifies the spider and is used when the spider is run. It must be unique within a project.
start_requests(): Returns an iterable of scrapy.Request objects that the spider will begin to crawl from.
parse(): This method is invoked by Scrapy to process the response received for each request. The response parameter is a TextResponse object containing the downloaded page content and provides various methods for working with it.
Within the parse method, you typically:
- Parse the Response: Extract the desired data from the response using XPath or CSS selectors.
- Yield Items: Generate Item objects containing the extracted data. These items can be further processed by pipelines.
- Follow Links: If necessary, identify and create new Request objects to follow links on the page, triggering additional requests and parsing processes.

Previous Next

Use the Search Bar to find content on MarketingMind.

Contact | Privacy Statement | Disclaimer: Opinions and views expressed on www.ashokcharan.com are the author’s personal views, and do not represent the official views of the National University of Singapore (NUS) or the NUS Business School | © Copyright 2013-2025 www.ashokcharan.com. All Rights Reserved.