Scraping content of a tag by title with Scrapy
I am scraping listings on a real estate website. The property details are located within a table and all have the same class name
However, on occasions, the values are not ordered in the same way or are missing and so when I run my spider I get values in wrong columns
type = response.css('div.carac-value span::text').extract()[1]
year= response.css('div.carac-value span::text').extract()[2]
area = response.css('div.carac-value span::text').extract()[3]
(i.e in the column of property area I would get its construction year) How can I only extract content of a class with a specific title like "Superficie nette" ?
- I used
default=''
since not all of the pages have those properties (year, type, area) - I'm using xpath to find a specific div that has a word in it, and then we get the text of the next sibling.
- I changed
type
totype1
.
scrapy shell
In [1]: url = 'https://www.centris.ca/fr/condo~a-vendre~montreal-rosemont-la-petite-patrie/19783085?view=Summary&uc=3'
In [2]: headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrom
...: e/74.0.3729.169 Safari/537.36'}
In [3]: req = scrapy.Request(url=url, headers=headers)
In [4]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centris.ca/fr/condo~a-vendre~montreal-rosemont-la-petite-patrie/19783085?view=Summary&uc=3> (referer: None)
In [5]: type1 = response.xpath('//div[@class="carac-title"][contains(text(), "Type")]/following-sibling::div[@class="ca
...: rac-value"]//text()').get(default='')
In [6]: year = response.xpath('//div[@class="carac-title"][contains(text(), "Année")]/following-sibling::div[@class="ca
...: rac-value"]//text()').get(default='')
In [7]: area = response.xpath('//div[@class="carac-title"][contains(text(), "Superficie")]/following-sibling::div[@clas
...: s="carac-value"]//text()').get(default='')
In [8]: type1
Out[8]: 'Divise'
In [9]: year
Out[9]: '2015'
In [10]: area
Out[10]: ''
# example with a page that has an area value
In [11]: url = 'https://www.centris.ca/fr/maison~a-vendre~laval-sainte-rose/19672961?view=Summary&uc=2'
In [12]: req = scrapy.Request(url=url, headers=headers)
In [13]: fetch(req)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centris.ca/fr/maison~a-vendre~laval-sainte-rose/19672961?view=Summary&uc=2> (referer: None)
In [14]: area = response.xpath('//div[@class="carac-title"][contains(text(), "Superficie")]/following-sibling::div[@cla
...: ss="carac-value"]//text()').get(default='')
In [15]: area
Out[15]: '7 500 pc'