Python Scrapy: Running the Spider

To run the spider, navigate to the project’s top-level directory and run:
scrapy crawl quotes

This command will initiate the spider, sending requests to the quotes.toscrape.com domain. The output will look something like this:

2024-01-01 19:19:10 [scrapy.core.engine] INFO: Spider opened
...
2024-01-01 19:19:10 [quotes] DEBUG: Saved file quotes-1.html
2024-01-01 19:19:10 [quotes] DEBUG: Saved file quotes-2.html
2024-01-01 19:19:10 [scrapy.core.engine] INFO: Closing spider (finished)

Two new files would have been created: quotes-1.html and quotes-2.html, with the content for the URLs listed in the spider.

Alternatively, instead of implementing start_requests(), you can define a start_urls class attribute as follows:

Spider using start_urls
import scrapy
from pathlib import Path

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]
    # def start_requests() not required. 

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)

Previous     Next

Use the Search Bar to find content on MarketingMind.