How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?
There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https pages.
I am trying to read table on this website (url string):
library(RTidyHTML)
library(XML)
url <- "https://ned.nih.gov/search/ViewDetails.aspx?NIHID=0010121048"
h = htmlParse(url)
tables <- readHTMLTable(url)
But I get this error: File https://ned.nih.gov/search/Vi...does not exist.
I tried to get past the https problem with this (first 2 lines below)(from using google to find solution (like here:http://tonybreyal.wordpress.com/2012/01/13/r-a-quick-scrape-of-top-grossing-films-from-boxofficemojo-com/).
This trick helps to see more of the page, but any attempts to extract the table are not working. Any advice appreciated. I need the table fields like Organization, Organizational Title, Manager.
#attempt to get past the https problem
raw <- getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
head(raw)
[1] "\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html;
...
h = htmlParse(raw)
Error in htmlParse(raw) : File ...
tables <- readHTMLTable(raw)
Error in htmlParse(doc) : File ...
The new package httr
provides a wrapper around RCurl
to make it easier to scrape all kinds of pages.
Still, this page gave me a fair amount of trouble. The following works, but no doubt there are easier ways of doing it.
library("httr")
library("XML")
# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
# Read page
page <- GET(
"https://ned.nih.gov/",
path="search/ViewDetails.aspx",
query="NIHID=0010121048",
config(cainfo = cafile)
)
# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)
# Parse the table
readHTMLTable(tab)
The results:
$ctl00_ContentPlaceHolder_dvPerson
V1 V2
1 Legal Name: Dr Francis S Collins
2 Preferred Name: Dr Francis Collins
3 E-mail: [email protected]
4 Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5 Mail Stop: Â
6 Phone: 301-496-2433
7 Fax: Â
8 IC: OD (Office of the Director)
9 Organization: Office of the Director (HNA)
10 Classification: Employee
11 TTY: Â
Get httr
here: http://cran.r-project.org/web/packages/httr/index.html
EDIT: Useful page with FAQ about the RCurl
package: http://www.omegahat.org/RCurl/FAQ.html
Using Andrie's great way to get past the https
a way to get at the data without readHTMLTable is also below.
A table in HTML may have an ID. In this case the table has one nice one and the XPath in getNodeSet function does it nicely.
# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
# Read page
page <- GET(
"https://ned.nih.gov/",
path="search/ViewDetails.aspx",
query="NIHID=0010121048",
config(cainfo = cafile, ssl.verifypeer = FALSE)
)
h = htmlParse(page)
ns <- getNodeSet(h, "//table[@id = 'ctl00_ContentPlaceHolder_dvPerson']")
ns
I still need to extract the IDs behind the hyperlinks.
for example instead of collen baros as manager, I need to get to the ID 0010080638
Manager:Colleen Barros