Improving speed of clicks and sending keys with selenium [duplicate]

I'm trying to scrape a javascript website using scrapy and selenium. I open the javascript website using selenium and a chrome driver and I scrape all the links to different listings from the current page using scrapy and store them in a list (this has been the best way to do it so far as trying to follow links using seleniumRequest and callingback to a parse new page function has caused a lot errors). Then, I loop through the list of URLs, open them in the selenium driver and scrape the info from the pages. So far this scrapes 16 pages/ minute which is not ideal given the amount of listings on this site. I would ideally have the selenium drivers opening links in parallel like the following implementations:

How can I make Selenium run in parallel with Scrapy?

https://gist.github.com/miraculixx/2f9549b79b451b522dde292c4a44177b

However, I can't figure out how to implement parallel processing in my selenium-scrapy code. `

    import scrapy
    import time
    from scrapy.selector import Selector
    from scrapy_selenium import SeleniumRequest
    from selenium.webdriver.common.keys import Keys
    from selenium.webdriver.support.ui import Select
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC

class MarketPagSpider(scrapy.Spider):
    name = 'marketPagination'
def start_requests(self):
    yield SeleniumRequest(
        url="https://www.cryptoslam.io/nba-top-shot/marketplace",
        wait_time=5,
        wait_until=EC.presence_of_element_located((By.XPATH, '//SELECT[@name="table_length"]')),
        callback=self.parse
    )

responses = []

def parse(self, response):
    # initialize driver
    driver = response.meta['driver']
    driver.set_window_size(1920,1080)

    time.sleep(1)
    WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.XPATH, "(//th[@class='nowrap sorting'])[1]"))
    )

    rows = response_obj.xpath("//tbody/tr[@role='row']")
    for row in rows:
        link = row.xpath(".//td[4]/a/@href").get()
        absolute_url = response.urljoin(link)

        self.responses.append(absolute_url)

    for resp in self.responses:
        driver.get(resp)
        html = driver.page_source 
        response_obj = Selector(text=html)

        yield {
        'name': response_obj.xpath("//div[@class='ibox-content animated fadeIn fetchable-content js-attributes-wrapper']/h4[4]/span/a/text()").get(),
        'price': response_obj.xpath("//span[@class='js-auction-current-price']/text()").get()
        
        }

I know that scrapy-splash can handle multiprocessing but the website I'm trying to scrape doesn't open in splash (at least I don't think)

As well, I've deleted the lines of code for pagination to keep the code concise.

I'm very new to this and open to any suggestions and solutions to multiprocessing with selenium.


The following sample program creates a thread pool with only 2 threads for demo purposes and then scrapes 4 URLs to get their titles:

from multiprocessing.pool import ThreadPool
from bs4 import BeautifulSoup
from selenium import webdriver
import threading
import gc

class Driver:
    def __init__(self):
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")
        # suppress logging:
        options.add_experimental_option('excludeSwitches', ['enable-logging'])
        self.driver = webdriver.Chrome(options=options)
        print('The driver was just created.')

    def __del__(self):
        self.driver.quit() # clean up driver when we are cleaned up
        print('The driver has terminated.')


threadLocal = threading.local()

def create_driver():
    the_driver = getattr(threadLocal, 'the_driver', None)
    if the_driver is None:
        the_driver = Driver()
        setattr(threadLocal, 'the_driver', the_driver)
    return the_driver.driver


def get_title(url):
    driver = create_driver()
    driver.get(url)
    source = BeautifulSoup(driver.page_source, "lxml")
    title = source.select_one("title").text
    print(f"{url}: '{title}'")

# just 2 threads in our pool for demo purposes:
with ThreadPool(2) as pool:
    urls = [
        'https://www.google.com',
        'https://www.microsoft.com',
        'https://www.ibm.com',
        'https://www.yahoo.com'
    ]
    pool.map(get_title, urls)
    # must be done before terminate is explicitly or implicitly called on the pool:
    del threadLocal
    gc.collect()
# pool.terminate() is called at exit of with block

Prints:

The driver was just created.
The driver was just created.
https://www.google.com: 'Google'
https://www.microsoft.com: 'Microsoft - Official Home Page'
https://www.ibm.com: 'IBM - United States'
https://www.yahoo.com: 'Yahoo'
The driver has terminated.
The driver has terminated.