I am developing web crawlers for a while and the most common issue for me is waiting for page to be completely loaded, includes requests, frames, scripts. I mean completely done.

I used several methods to fix it but when I use more than one thread to crawl websites I always get this kind of problem. the Driver opens itself, goes through the URL, doesn't wait and goes through the next URL.

My tries are:

JavascriptExecutor js = (JavascriptExecutor) driver.getWebDriver();
String result = js.executeScript("return document.readyState").toString();
    if (!result.equals("complete")) {
         Thread.sleep(1000)
    } 
}

wait.until(ExpectedConditions.visibilityOfElementLocated(By.xpath));

When I run a single-threaded code, I had no problem with pages but, When I use multi-threaded, It becomes a nightmare. Network cannot handle web pages like the single-threaded that is why I need waits in that while. I am looking for an exact solution. Is there any progress listener or something like that?

I am waiting for your advice.

Similar question:

Selenium -- How to wait until page is completely loaded


Solution 1:

To wait for document.readyState to be complete isn't a full proof approach to ensure presence, visibility or interactibility of an element.

Hence, the function:

JavascriptExecutor js = (JavascriptExecutor) driver.getWebDriver();
String result = js.executeScript("return document.readyState").toString();
    if (!result.equals("complete")) {
     Thread.sleep(1000)
    } 
}

And even waiting for jQuery.active == 0:

public void WaitForAjax2Complete() throws InterruptedException
{
    while (true)
    {
        if ((Boolean) ((JavascriptExecutor)driver).executeScript("return jQuery.active == 0")){
            break;
    }
    Thread.sleep(100);
    }
}

Will be a pure overhead.

You can find a couple of relevant discussions in:

  • Selenium IE WebDriver only works while debugging
  • Do we have any generic function to check if page has completely loaded in Selenium

Solution

The effective approach will be to induce WebDriverWait inconjunction with the ExpectedConditions either for:

  • presence of element
  • visibility of element
  • interactibility of element

You can find a couple of relevant discussions in:

  • Selenium: How selenium identifies elements visible or not? Is is possible that it is loaded in DOM but not rendered on UI?
  • WebDriverWait not working as expected

More than one thread to crawl

WebDriver is not thread-safe. Having said that, if you can serialise access to the underlying driver instance, you can share a reference in more than one thread. This is not advisable. But you can always instantiate one WebDriver instance for each thread.

Ideally the issue of thread-safety isn't in your code but in the actual browser bindings. They all assume there will only be one command at a time (e.g. like a real user). But on the other hand you can always instantiate one WebDriver instance for each thread which will launch multiple browsing tabs/windows. Till this point it seems your program is perfect.

Now, different threads can be run on same Webdriver, but then the results of the tests would not be what you expect. The reason behind is, when you use multi-threading to run different tests on different tabs/windows a little bit of thread safety coding is required or else the actions you will perform like click() or send_keys() will go to the opened tab/window that is currently having the focus regardless of the thread you expect to be running. Which essentially means all the test will run simultaneously on the same tab/window that has focus but not on the intended tab/window.

Solution 2:

In you code you check the readyState and if value is not complete, you just sleep for one second and proceed for the next steps. Here's code, that waiting for 10 seconds using WebDriverWait. Or you can use simple for loop:

WebDriverWait wait = new WebDriverWait(driver, 10);
        wait.until(d -> ((JavascriptExecutor) d).executeScript("return document.readyState !== 'loading'"));

or with interactive

wait.until(d -> ((JavascriptExecutor) d).executeScript("return (document.readyState === 'complete' || document.readyState === 'interactive')"));