Error: XML Content does not seem to be XML | R 3.1.0
I am trying to get this XML file, but am unable to. I checked the other solutions in the same topic, but I couldn't understand. I am a R newbie.
> library(XML)
> fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
> doc <- xmlTreeParse(fileURL,useInternal=TRUE)
Error: XML content does not seem to be XML: 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml'
Can you please help?
Remove the s
from https
library(XML)
fileURL<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
doc <- xmlTreeParse(sub("s", "", fileURL), useInternal = TRUE)
class(doc)
## [1] "XMLInternalDocument" "XMLAbstractDocument"
You can use RCurl
to fetch the content and then XML seems to be able to handle it
library(XML)
library(RCurl)
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xData <- getURL(fileURL)
doc <- xmlParse(xData)
xmlTreeParse does not support https.
You can load the data with getURL
(from RCurl
) and then parse it.
Answer is at http://www.omegahat.net/RCurl/installed/RCurl/html/getURL.html. Key point is to use ssl.verifyPeer=FALSE with getURL if certificate error is shown.
library (RCurl)
library (XML)
curlVersion()$features
curlVersion()$protocol
##These should show ssl and https. I can see these on windows 8.1 at least.
##It may differ on other OSes.
temp <- getURL("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml", ssl.verifyPeer=FALSE)
DFX <- xmlTreeParse(temp,useInternal = TRUE)
If ssl or https capability is not shown by libcurl functions, check using Rcurl with HTTPs.
Using download.file
avoids introducing another dependency. The following function returns the output of XML::xmlParse
also when the URL starts with https
. It caches the file to a temporary directory so that it will be downloaded only once if this function is called many times during an R session.
xml_parse <- function(xml_url){
# Temporary copy of the xml file, valid for this R session
xml_temp_file <- file.path(tempdir(), basename(xml_url))
if (!file.exists(xml_temp_file)){
print(sprintf("Downloading to %s.", xml_temp_file))
download.file(xml_url, xml_temp_file)
}
return(XML::xmlParse(xml_temp_file))
}
# Example
xml_content = xml_parse("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml")