Extracting Data in Spider

Let us revisit the spider. Up to this point, it has only been saving the entire HTML page to a local file without extracting specific data. We will now integrate the extraction logic into our spider.

A Scrapy spider typically generates multiple dictionaries containing the extracted data. To achieve this, we utilize the yield keyword in the callback method, as demonstrated in Exhibit 25.62.

Spider Extracting Quotes for quotes.toscrape.com
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags": quote.css("div.tags a.tag::text").getall(),
            }
Output Data with Log
2024-09-03 20:05:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2024-09-03 20:05:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

Exhibit 25.62 Spider scraping quotes.toscrape.com to extract quotes.


Previous     Next

Use the Search Bar to find content on MarketingMind.