Scraping content of a tag by title with Scrapy

I am scraping listings on a real estate website. The property details are located within a table and all have the same class name enter image description here

However, on occasions, the values are not ordered in the same way or are missing and so when I run my spider I get values in wrong columns

            type = response.css('div.carac-value span::text').extract()[1]
            year= response.css('div.carac-value span::text').extract()[2]
            area = response.css('div.carac-value span::text').extract()[3]

(i.e in the column of property area I would get its construction year) How can I only extract content of a class with a specific title like "Superficie nette" ?

I used default='' since not all of the pages have those properties (year, type, area)
I'm using xpath to find a specific div that has a word in it, and then we get the text of the next sibling.
I changed type to type1.

scrapy shell

In [1]: url = 'https://www.centris.ca/fr/condo~a-vendre~montreal-rosemont-la-petite-patrie/19783085?view=Summary&uc=3'

In [2]: headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrom
   ...: e/74.0.3729.169 Safari/537.36'}

In [3]: req = scrapy.Request(url=url, headers=headers)

In [4]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centris.ca/fr/condo~a-vendre~montreal-rosemont-la-petite-patrie/19783085?view=Summary&uc=3> (referer: None)

In [5]: type1 = response.xpath('//div[@class="carac-title"][contains(text(), "Type")]/following-sibling::div[@class="ca
   ...: rac-value"]//text()').get(default='')

In [6]: year = response.xpath('//div[@class="carac-title"][contains(text(), "Année")]/following-sibling::div[@class="ca
   ...: rac-value"]//text()').get(default='')

In [7]: area = response.xpath('//div[@class="carac-title"][contains(text(), "Superficie")]/following-sibling::div[@clas
   ...: s="carac-value"]//text()').get(default='')

In [8]: type1
Out[8]: 'Divise'

In [9]: year
Out[9]: '2015'

In [10]: area
Out[10]: ''

# example with a page that has an area value
In [11]: url = 'https://www.centris.ca/fr/maison~a-vendre~laval-sainte-rose/19672961?view=Summary&uc=2'

In [12]: req = scrapy.Request(url=url, headers=headers)

In [13]: fetch(req)
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centris.ca/fr/maison~a-vendre~laval-sainte-rose/19672961?view=Summary&uc=2> (referer: None)

In [14]: area = response.xpath('//div[@class="carac-title"][contains(text(), "Superficie")]/following-sibling::div[@cla
    ...: ss="carac-value"]//text()').get(default='')

In [15]: area
Out[15]: '7 500 pc'

Scraping content of a tag by title with Scrapy

Related

Recent Posts