How to transform XML data into a data.frame?
Solution 1:
Ordinarily, I would suggest trying the xmlToDataFrame()
function, but I believe that this will actually be fairly tricky because it isn't well structured to begin with.
I would recommend working with this function:
xmlToList(books)
One problem is that there are multiple authors per book, so you will need to decide how to handle that when you're structuring your data frame.
Once you have decided what to do with the multiple authors issue, then it's fairly straight forward to turn your book list into a data frame with the ldply()
function in plyr (or just use lapply and convert the return value into a data.frame by using do.call("rbind"...).
Here's a complete example (excluding author):
library(XML)
books <- "w3schools.com/xsl/books.xml"
library(plyr)
ldply(xmlToList(books), function(x) { data.frame(x[!names(x)=="author"]) } )
.id title.text title..attrs year price .attrs
1 book Everyday Italian en 2005 30.00 COOKING
2 book Harry Potter en 2005 29.99 CHILDREN
3 book XQuery Kick Start en 2003 49.99 WEB
4 book Learning XML en 2003 39.95 WEB
Here's what it looks like with author included. You need to use ldply
in this instance since the list is "jagged"...lapply can't handle that properly. [Otherwise you can use lapply
with rbind.fill
(also courtesy of Hadley), but why bother when plyr
automatically does it for you?]:
ldply(xmlToList(books), data.frame)
.id title.text title..attrs author year price .attrs
1 book Everyday Italian en Giada De Laurentiis 2005 30.00 COOKING
2 book Harry Potter en J K. Rowling 2005 29.99 CHILDREN
3 book XQuery Kick Start en James McGovern 2003 49.99 WEB
4 book Learning XML en Erik T. Ray 2003 39.95 WEB
author.1 author.2 author.3 author.4
1 <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA>
3 Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan
4 <NA> <NA> <NA> <NA>