find a word on a website and get its page link
Main problem is wrong allowed_domain
- it has to be without path /
allowed_domains = ["www.reichelt.com"]
Other problems can be this tutorial is 3 years old (there is link to documentation for Scarpy 1.5
but newest version is 2.5.0
).
It also uses some useless lines of code.
It gets contenttype
but never use it to decode
request.body
. Your url uses iso8859-1
for original language and utf-8
for ?LANGUAGE=PL
- but you can simply use request.text
and it will automatically decode it.
It also uses ok = False
and later check it but it is totally useless.
Minimal working code - you can copy it to single file and run as python script.py
without creating project.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import re
wordlist = [
"katalog",
"catalog",
"downloads",
"download",
]
def find_all_substrings(string, sub):
return [match.start() for match in re.finditer(re.escape(sub), string)]
class WebsiteSpider(CrawlSpider):
name = "webcrawler"
allowed_domains = ["www.reichelt.com"]
start_urls = ["https://www.reichelt.com/"]
#start_urls = ["https://www.reichelt.com/?LANGUAGE=PL"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]
#crawl_count = 0
#words_found = 0
def check_buzzwords(self, response):
print('[check_buzzwords] url:', response.url)
#self.crawl_count += 1
#content_type = response.headers.get("content-type", "").decode('utf-8').lower()
#print('content_type:', content_type)
#data = response.body.decode('utf-8')
data = response.text
for word in wordlist:
print('[check_buzzwords] check word:', word)
substrings = find_all_substrings(data, word)
print('[check_buzzwords] substrings:', substrings)
for pos in substrings:
#self.words_found += 1
# only display
print('[check_buzzwords] word: {} | pos: {} | sub: {} | url: {}'.format(word, pos, data[pos-20:pos+20], response.url))
# send to file
yield {'word': word, 'pos': pos, 'sub': data[pos-20:pos+20], 'url': response.url}
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(WebsiteSpider)
c.start()
EDIT:
I added data[pos-20:pos+20]
to yielded data to see where is substring and sometimes it is in URL like .../elements/adw_2018/catalog/...
or other place like <img alt=""catalog""
- so using regex
doesn't have to be good idea. Maybe better is to use xpath
or css selector
to search text only in some places or in links.
EDIT:
Version which search links with words from list. It uses response.xpath
to search all linsk and later it check if there is word in href
- so it doesn't need regex
.
Problem can be that it treats link with -downloads-
(with s
) as link with word download
and downloads
so it would need more complex method to check (ie. using regex
) to treats it only as link with word downloads
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
wordlist = [
"katalog",
"catalog",
"downloads",
"download",
]
class WebsiteSpider(CrawlSpider):
name = "webcrawler"
allowed_domains = ["www.reichelt.com"]
start_urls = ["https://www.reichelt.com/"]
rules = [Rule(LinkExtractor(), follow=True, callback="check_buzzwords")]
def check_buzzwords(self, response):
print('[check_buzzwords] url:', response.url)
links = response.xpath('//a[@href]')
for word in wordlist:
for link in links:
url = link.attrib.get('href')
if word in url:
print('[check_buzzwords] word: {} | url: {} | page: {}'.format(word, url, response.url))
# send to file
yield {'word': word, 'url': url, 'page': response.url}
# --- run without project and save in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
# save in file CSV, JSON or XML
'FEEDS': {'output.csv': {'format': 'csv'}}, # new in 2.1
})
c.crawl(WebsiteSpider)
c.start()
You can do it with requests-html and rendering the page:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://www.reichelt.com/"
r = session.get(url)
r.html.render(sleep=2)
if "your_word" in r.html.text: #or r.html.html if you want it in raw html
print([link for link in r.html.absolute_links if "your_word" in link])