Let us revisit the spider. Up to this point, it has only been saving the entire HTML page to a local file without extracting specific data. We will now integrate the extraction logic into our spider.
A Scrapy spider typically generates multiple dictionaries containing the extracted data. To achieve this, we utilize the yield
keyword in the callback method, as demonstrated in Exhibit 25.62.
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
2024-09-03 20:05:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/> {'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'} 2024-09-03 20:05:10 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/page/1/> {'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}
Use the Search Bar to find content on MarketingMind.
Contact | Privacy Statement | Disclaimer: Opinions and views expressed on www.ashokcharan.com are the author’s personal views, and do not represent the official views of the National University of Singapore (NUS) or the NUS Business School | © Copyright 2013-2025 www.ashokcharan.com. All Rights Reserved.