How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

There are good answers on SO about how to use readHTMLTable from the XML package and I did that with regular http pages, however I am not able to solve my problem with https pages.

I am trying to read table on this website (url string):

url <- ""
h = htmlParse(url)
tables <- readHTMLTable(url)

But I get this error: File not exist.

I tried to get past the https problem with this (first 2 lines below)(from using google to find solution (like here:

This trick helps to see more of the page, but any attempts to extract the table are not working. Any advice appreciated. I need the table fields like Organization, Organizational Title, Manager.

 #attempt to get past the https problem 
 raw <- getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
[1] "\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"\">\n<html xmlns=\"\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; 
 h = htmlParse(raw)
Error in htmlParse(raw) : File ...
tables <- readHTMLTable(raw)
Error in htmlParse(doc) : File ...

The new package httr provides a wrapper around RCurl to make it easier to scrape all kinds of pages.

Still, this page gave me a fair amount of trouble. The following works, but no doubt there are easier ways of doing it.


# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")

# Read page
page <- GET(
  config(cainfo = cafile)

# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)

# Parse the table

The results:

                V1                                      V2
1      Legal Name:                    Dr Francis S Collins
2  Preferred Name:                      Dr Francis Collins
3          E-mail:                 [email protected]
4        Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5       Mail Stop:                                       Â
6           Phone:                            301-496-2433
7             Fax:                                       Â
8              IC:             OD (Office of the Director)
9    Organization:            Office of the Director (HNA)
10 Classification:                                Employee
11            TTY:                                       Â

Get httr here:

EDIT: Useful page with FAQ about the RCurl package:

Using Andrie's great way to get past the https

a way to get at the data without readHTMLTable is also below.

A table in HTML may have an ID. In this case the table has one nice one and the XPath in getNodeSet function does it nicely.

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")
# Read page
page <- GET(
  config(cainfo = cafile, ssl.verifypeer = FALSE)

h = htmlParse(page)
ns <- getNodeSet(h, "//table[@id = 'ctl00_ContentPlaceHolder_dvPerson']")

I still need to extract the IDs behind the hyperlinks.

for example instead of collen baros as manager, I need to get to the ID 0010080638

Manager:Colleen Barros