如何将XML数据转换成data.frame? [英] How to transform XML data into a data.frame?
问题描述
我正在尝试学习R的 XML
包。我正在尝试从books.xml示例xml数据文件创建一个data.frame。这是我得到的:
I'm trying to learn R's XML
package. I'm trying to create a data.frame from books.xml sample xml data file. Here's what I get:
library(XML)
books <- "http://www.w3schools.com/XQuery/books.xml"
doc <- xmlTreeParse(books, useInternalNodes = TRUE)
doc
xpathApply(doc, "//book", function(x) do.call(paste, as.list(xmlValue(x))))
xpathSApply(doc, "//book", function(x) strsplit(xmlValue(x), " "))
xpathSApply(doc, "//book/child::*", xmlValue)
这些xpathSApply中的每一个都不让我更接近我的意图。
Each of these xpathSApply's don't get me even close to my intention. How should one proceed toward a well formed data.frame?
推荐答案
通常,我建议尝试 xmlToDataFrame()
函数,但我相信这实际上是相当棘手的,因为它的结构不够开始。
Ordinarily, I would suggest trying the xmlToDataFrame()
function, but I believe that this will actually be fairly tricky because it isn't well structured to begin with.
我建议使用此功能:
xmlToList(books)
一个问题是每本书有多个作者,因此您需要决定如何处理当您构建数据框架时。
One problem is that there are multiple authors per book, so you will need to decide how to handle that when you're structuring your data frame.
一旦您决定了对多个作者的问题做了什么,那么将您的书籍列表变成一个数据框与plyr中的 ldply()
函数(或者只是使用lap.c并使用do.call(rbind...)将返回值转换为数据框)
Once you have decided what to do with the multiple authors issue, then it's fairly straight forward to turn your book list into a data frame with the ldply()
function in plyr (or just use lapply and convert the return value into a data.frame by using do.call("rbind"...).
这是一个完整的例子(不包括作者):
Here's a complete example (excluding author):
library(XML)
books <- "http://www.w3schools.com/XQuery/books.xml"
library(plyr)
ldply(xmlToList(books), function(x) { data.frame(x[!names(x)=="author"]) } )
.id title.text title..attrs year price .attrs
1 book Everyday Italian en 2005 30.00 COOKING
2 book Harry Potter en 2005 29.99 CHILDREN
3 book XQuery Kick Start en 2003 49.99 WEB
4 book Learning XML en 2003 39.95 WEB
这是作者包含的样子。在这种情况下,您需要使用 ldply
,因为列表是锯齿状的... lapply无法正确处理。 [否则您可以使用 lapply
与 rbind.fill
(也由Hadley提供),但是为什么在 plyr
自动为你做吗?]:
Here's what it looks like with author included. You need to use ldply
in this instance since the list is "jagged"...lapply can't handle that properly. [Otherwise you can use lapply
with rbind.fill
(also courtesy of Hadley), but why bother when plyr
automatically does it for you?]:
ldply(xmlToList(books), data.frame)
.id title.text title..attrs author year price .attrs
1 book Everyday Italian en Giada De Laurentiis 2005 30.00 COOKING
2 book Harry Potter en J K. Rowling 2005 29.99 CHILDREN
3 book XQuery Kick Start en James McGovern 2003 49.99 WEB
4 book Learning XML en Erik T. Ray 2003 39.95 WEB
author.1 author.2 author.3 author.4
1 <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA>
3 Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan
4 <NA> <NA> <NA> <NA>
这篇关于如何将XML数据转换成data.frame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!