如何将 XML 数据转换为 data.frame? [英] How to transform XML data into a data.frame?
问题描述
我正在尝试学习 R 的 XML
包.我正在尝试从 books.xml 示例 xml 数据文件创建一个 data.frame.这是我得到的:
I'm trying to learn R's XML
package. I'm trying to create a data.frame from books.xml sample xml data file. Here's what I get:
library(XML)
books <- "http://www.w3schools.com/XQuery/books.xml"
doc <- xmlTreeParse(books, useInternalNodes = TRUE)
doc
xpathApply(doc, "//book", function(x) do.call(paste, as.list(xmlValue(x))))
xpathSApply(doc, "//book", function(x) strsplit(xmlValue(x), " "))
xpathSApply(doc, "//book/child::*", xmlValue)
这些 xpathSApply 中的每一个都没有让我接近我的意图.应该如何向格式良好的 data.frame 迈进?
Each of these xpathSApply's don't get me even close to my intention. How should one proceed toward a well formed data.frame?
推荐答案
通常,我会建议尝试 xmlToDataFrame()
函数,但我相信这实际上会相当棘手,因为它不是'一开始就结构良好.
Ordinarily, I would suggest trying the xmlToDataFrame()
function, but I believe that this will actually be fairly tricky because it isn't well structured to begin with.
我建议使用此功能:
xmlToList(books)
一个问题是每本书有多个作者,因此您需要在构建数据框时决定如何处理.
One problem is that there are multiple authors per book, so you will need to decide how to handle that when you're structuring your data frame.
一旦您决定如何处理多作者问题,那么使用 plyr 中的 ldply()
函数(或仅使用 lapply并使用 do.call("rbind"...) 将返回值转换为 data.frame.
Once you have decided what to do with the multiple authors issue, then it's fairly straight forward to turn your book list into a data frame with the ldply()
function in plyr (or just use lapply and convert the return value into a data.frame by using do.call("rbind"...).
这是一个完整的例子(不包括作者):
Here's a complete example (excluding author):
library(XML)
books <- "w3schools.com/xsl/books.xml"
library(plyr)
ldply(xmlToList(books), function(x) { data.frame(x[!names(x)=="author"]) } )
.id title.text title..attrs year price .attrs
1 book Everyday Italian en 2005 30.00 COOKING
2 book Harry Potter en 2005 29.99 CHILDREN
3 book XQuery Kick Start en 2003 49.99 WEB
4 book Learning XML en 2003 39.95 WEB
这是包含作者的情况.在这种情况下,您需要使用 ldply
,因为列表是锯齿状的"...lapply 无法正确处理.[否则,您可以将 lapply
与 rbind.fill
一起使用(同样由 Hadley 提供),但是当 plyr
自动为您执行此操作时,何必费心呢?]:
Here's what it looks like with author included. You need to use ldply
in this instance since the list is "jagged"...lapply can't handle that properly. [Otherwise you can use lapply
with rbind.fill
(also courtesy of Hadley), but why bother when plyr
automatically does it for you?]:
ldply(xmlToList(books), data.frame)
.id title.text title..attrs author year price .attrs
1 book Everyday Italian en Giada De Laurentiis 2005 30.00 COOKING
2 book Harry Potter en J K. Rowling 2005 29.99 CHILDREN
3 book XQuery Kick Start en James McGovern 2003 49.99 WEB
4 book Learning XML en Erik T. Ray 2003 39.95 WEB
author.1 author.2 author.3 author.4
1 <NA> <NA> <NA> <NA>
2 <NA> <NA> <NA> <NA>
3 Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan
4 <NA> <NA> <NA> <NA>
这篇关于如何将 XML 数据转换为 data.frame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!