Scrapy Framework


Fake Jobs Webpage.

Exhibit 25.37 Scrapy architecture. Source: dot.scrapy.org.

Scrapy’s framework is built around two independent components:

  • Spiders: The Spiders are responsible for crawling websites and extracting content based on the rules defined in the Spider classes.
  • Pipelines: The Pipeline process the extracted data, performing tasks like cleaning, validation, and storing the data in the desired format or database.

These components operate independently due to their distinct dependencies. The Spider depends on the website’s structure, while the Pipeline is focused on processing and storing the data.

Scrapy’s architecture and data flow is depicted in Exhibit 25.37. The execution engine orchestrates the data flow in Scrapy, following these steps:

  1. Initialization: The engine obtains the initial requests from the Spider.
  2. Scheduling: Requests are scheduled in the Scheduler, which manages the order and prioritization of requests.
  3. Request Retrieval: The engine retrieves the next request from the Scheduler.
  4. Downloading: The request is sent to the Downloader, passing through Downloader Middlewares for potential modifications.
  5. Response Processing: The Downloader receives the downloaded page as a Response and sends it back to the Engine, passing through Downloader Middlewares again.
  6. Spider Processing: The Engine forwards the Response to the Spider for processing, passing through Spider Middlewares.
  7. Item Extraction and New Requests: The Spider extracts items from the Response and generates new Requests to follow, passing them through Spider Middlewares.
  8. Item Pipelines: Processed items are sent to Item Pipelines for further processing, storage, or validation.
  9. Request Scheduling: New Requests are returned to the Scheduler for scheduling, and the process repeats from step 3 until there are no more pending requests.

This cyclical process ensures efficient and organized web scraping within Scrapy.


Previous     Next

Use the Search Bar to find content on MarketingMind.