How do I scrape / automatically download PDF files from a document search web interface in R?

I see some people suggesting that you use rselenium, which is a way to simulate browser actions, so that the web server renders the page as if a human was visiting the site. From my experience it is almost never necessary to go down that route. The javascript part of the website is interacting with an API and we can utilize that to circumvent the Javascript part and get the raw json data directly. In Firefox (and Chrome is similar in that regard I assume) you can right-click on the website and select “Inspect Element (Q)”, go to the “Network” tab and click on reload. You’ll see that each request the browser makes to the webserver is being listed after a few seconds or less. We are interested in the ones that have the “Type” json. When you right click on an entry you can select “Open in New Tab”. One of the requests that returns json has the following URL attached to it https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1 Opening that URL in Firefox gets you to a GUI that lets you explore the json data structure and you’ll see that there is a “results” entry which contains the data for the 25 first results of your search. Each one has a “path” entry, that leads to the page that will display the embedded PDF. It turns out that if you replace the “.html” part with “.pdf” that path leads directly to the PDF file. The code below utilizes all this information.

library(tidyverse) # tidyverse for the pipe and for `purrr::map*()` functions.
library(httr) # this should already be installed on your machine as `rvest` builds on it
library(pdftools)
#> Using poppler version 20.09.0
library(tidytext)
library(textrank)

base_url <- "https://www.canlii.org"

json_url_search_p1 <-
  "https://www.canlii.org/en/search/ajaxSearch.do?type=decision&text=dogs%20toronto&page=1"

This downloads the json for page 1 / results 1 to 25

results_p1 <-
  GET(json_url_search_p1, encode = "json") %>%
  content()

For each result we extract the path only.

result_html_paths_p1 <-
  map_chr(results_p1$results,
          ~ .$path)

We replace “.html” with “.pdf”, combine the base URL with the path to generate the full URLs pointing to the PDFs. Last we pipe it into purrr::map() and pdftools::pdf_text in order to extract the text from all 25 PDFs.

pdf_texts_p1 <-
  gsub(".html$", ".pdf", result_html_paths_p1) %>%
  paste0(base_url, .) %>%
  map(pdf_text)

If you want to do this for more than just the first page you might want to wrap the above code in a function that lets you switch out the “&page=” parameter. You could also make the “&text=” parameter an argument of the function in order to automatically scrape results for other searches.

For the remaining part of the task we can build on the code you already have. We make it a function that can be applied to any article and apply that function to each PDF text again using purrr::map().

extract_article_summary <-
  function(article) {
    article_sentences <- tibble(text = article) %>%
      unnest_tokens(sentence, text, token = "sentences") %>%
      mutate(sentence_id = row_number()) %>%
      select(sentence_id, sentence)
    
    
    article_words <- article_sentences %>%
      unnest_tokens(word, sentence)
    
    
    article_words <- article_words %>%
      anti_join(stop_words, by = "word")
    
    textrank_sentences(data = article_sentences, terminology = article_words)
  }

This now will take a real long time!

article_summaries_p1 <- 
  map(pdf_texts_p1, extract_article_summary)

Alternatively you could use furrr::future_map() instead to utilize all the CPU cores in your machine and speed up the process.

library(furrr) # make sure the package is installed first
plan(multisession)
article_summaries_p1 <- 
  future_map(pdf_texts_p1, extract_article_summary)

Disclaimer

The code in the answer above is for educational purposes only. As many websites do, this service restricts automated access to its contents. The robots.txt explicitly disallows the /search path from being accessed by bots. It is therefore recommended to get in contact with the site owner before downloading big amounts of data. canlii offers API access on an individual request basis, see documentation here. This would be the correct and safest way to access their data.