Python Scrapy: Writing a Spider

Spiders are Python classes that you define within a Scrapy project. They serve as the core components for instructing Scrapy on how to navigate and extract data from websites.

To create a spider, you must subclass the scrapy.Spider class and specify the starting URLs or a method to generate them dynamically using start_urls or start_requests. In addition, a parse method is required to handle the downloaded page content and extract the desired data. You may also define rules for how to follow links on the pages, such as whether to follow all links or only specific types.

Here is the code for the Spider:

Spider
from pathlib import Path
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"  

    def start_requests(self):
        # List of URLs to scrape 
        urls = [
            "https://quotes.toscrape.com/page/1/",
            "https://quotes.toscrape.com/page/2/",
        ]
        # Scrape URLs
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]  # page number, i.e, 1, 2
        filename = f"quotes-{page}.html"  # quotes-1.html, quotes-2.html
        Path(filename).write_bytes(response.body) # Save the page
        self.log(f"Saved file {filename}")  # Maintain a log
  • name: This identifies the spider and is used when the spider is run. It must be unique within a project.
  • start_requests(): Returns an iterable of scrapy.Request objects that the spider will begin to crawl from.
  • parse(): This method is invoked by Scrapy to process the response received for each request. The response parameter is a TextResponse object containing the downloaded page content and provides various methods for working with it.
    Within the parse method, you typically:
    • Parse the Response: Extract the desired data from the response using XPath or CSS selectors.
    • Yield Items: Generate Item objects containing the extracted data. These items can be further processed by pipelines.
    • Follow Links: If necessary, identify and create new Request objects to follow links on the page, triggering additional requests and parsing processes.

Previous     Next

Use the Search Bar to find content on MarketingMind.