Python Scrapy: Extracting Data — CSS Method

To begin learning the CSS method, open the Scrapy shell and try this:

>>>response.css("title")
[<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

The resulting list-like object called SelectorList represents a list of Selector objects that wrap around XML/HTML elements.

You need to refine the query to fine-grain the selection. The below code, for instance, will extract the text from the <title> element.

>>>response.css("title::text").get()
'Quotes to Scrape'

There are two things to note here. Firstly, ::text has been added to the CSS query, to select only the text inside the <title> element. Secondly, the get() method retrieves text from the first <title> element.

If there are more than one instances of the element, use the getall() method to retrieve all results. The output will be a list containing all results, i.e.:

>>>response.css("title::text").getall()
['Quotes to Scrape']

Besides the getall() and get() methods, you can also use the re() method to filter and extract content using regular expressions:

>>>response.css("title::text").re(r"Quotes.*")
['Quotes to Scrape']
>>>response.css("title::text").re(r"Q\w+")
['Quotes']
>>>response.css("title::text").re(r"(\w+) to (\w+)")
['Quotes', 'Scrape']    

In order to find the proper CSS selectors to use, you might find it useful to open the response page from the shell in your web browser using view(response). You can use the browser’s developer tools to inspect the HTML and come up with a selector.


Previous     Next

Use the Search Bar to find content on MarketingMind.