Improving speed of clicks and sending keys with selenium [duplicate]
I'm trying to scrape a javascript website using scrapy and selenium. I open the javascript website using selenium and a chrome driver and I scrape all the links to different listings from the current page using scrapy and store them in a list (this has been the best way to do it so far as trying to follow links using seleniumRequest and callingback to a parse new page function has caused a lot errors). Then, I loop through the list of URLs, open them in the selenium driver and scrape the info from the pages. So far this scrapes 16 pages/ minute which is not ideal given the amount of listings on this site. I would ideally have the selenium drivers opening links in parallel like the following implementations:
How can I make Selenium run in parallel with Scrapy?
https://gist.github.com/miraculixx/2f9549b79b451b522dde292c4a44177b
However, I can't figure out how to implement parallel processing in my selenium-scrapy code. `
import scrapy
import time
from scrapy.selector import Selector
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
class MarketPagSpider(scrapy.Spider):
name = 'marketPagination'
def start_requests(self):
yield SeleniumRequest(
url="https://www.cryptoslam.io/nba-top-shot/marketplace",
wait_time=5,
wait_until=EC.presence_of_element_located((By.XPATH, '//SELECT[@name="table_length"]')),
callback=self.parse
)
responses = []
def parse(self, response):
# initialize driver
driver = response.meta['driver']
driver.set_window_size(1920,1080)
time.sleep(1)
WebDriverWait(driver, 10).until(
EC.element_to_be_clickable((By.XPATH, "(//th[@class='nowrap sorting'])[1]"))
)
rows = response_obj.xpath("//tbody/tr[@role='row']")
for row in rows:
link = row.xpath(".//td[4]/a/@href").get()
absolute_url = response.urljoin(link)
self.responses.append(absolute_url)
for resp in self.responses:
driver.get(resp)
html = driver.page_source
response_obj = Selector(text=html)
yield {
'name': response_obj.xpath("//div[@class='ibox-content animated fadeIn fetchable-content js-attributes-wrapper']/h4[4]/span/a/text()").get(),
'price': response_obj.xpath("//span[@class='js-auction-current-price']/text()").get()
}
I know that scrapy-splash can handle multiprocessing but the website I'm trying to scrape doesn't open in splash (at least I don't think)
As well, I've deleted the lines of code for pagination to keep the code concise.
I'm very new to this and open to any suggestions and solutions to multiprocessing with selenium.
The following sample program creates a thread pool with only 2 threads for demo purposes and then scrapes 4 URLs to get their titles:
from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
import gc
class Driver:
def __init__(self):
options = webdriver.ChromeOptions()
options.add_argument("--headless")
# suppress logging:
options.add_experimental_option('excludeSwitches', ['enable-logging'])
self.driver = webdriver.Chrome(options=options)
print('The driver was just created.')
def __del__(self):
self.driver.quit() # clean up driver when we are cleaned up
print('The driver has terminated.')
threadLocal = threading.local()
def create_driver():
the_driver = getattr(threadLocal, 'the_driver', None)
if the_driver is None:
the_driver = Driver()
setattr(threadLocal, 'the_driver', the_driver)
return the_driver.driver
def get_title(url):
driver = create_driver()
driver.get(url)
source = BeautifulSoup(driver.page_source, "lxml")
title = source.select_one("title").text
print(f"{url}: '{title}'")
# just 2 threads in our pool for demo purposes:
with ThreadPool(2) as pool:
urls = [
'https://www.google.com',
'https://www.microsoft.com',
'https://www.ibm.com',
'https://www.yahoo.com'
]
pool.map(get_title, urls)
# must be done before terminate is explicitly or implicitly called on the pool:
del threadLocal
gc.collect()
# pool.terminate() is called at exit of with block
Prints:
The driver was just created.
The driver was just created.
https://www.google.com: 'Google'
https://www.microsoft.com: 'Microsoft - Official Home Page'
https://www.ibm.com: 'IBM - United States'
https://www.yahoo.com: 'Yahoo'
The driver has terminated.
The driver has terminated.