如何将XML数据转换成data.frame? [英] How to transform XML data into a data.frame?

查看:135
本文介绍了如何将XML数据转换成data.frame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习R的 XML 包。我正在尝试从books.xml示例xml数据文件创建一个data.frame。这是我得到的:

I'm trying to learn R's XML package. I'm trying to create a data.frame from books.xml sample xml data file. Here's what I get:

library(XML)
books <- "http://www.w3schools.com/XQuery/books.xml"
doc <- xmlTreeParse(books, useInternalNodes = TRUE)
doc
xpathApply(doc, "//book", function(x) do.call(paste, as.list(xmlValue(x))))
xpathSApply(doc, "//book", function(x) strsplit(xmlValue(x), " "))
xpathSApply(doc, "//book/child::*", xmlValue)

这些xpathSApply中的每一个都不让我更接近我的意图。

Each of these xpathSApply's don't get me even close to my intention. How should one proceed toward a well formed data.frame?

推荐答案

通常,我建议尝试 xmlToDataFrame()函数,但我相信这实际上是相当棘手的,因为它的结构不够开始。

Ordinarily, I would suggest trying the xmlToDataFrame() function, but I believe that this will actually be fairly tricky because it isn't well structured to begin with.

我建议使用此功能:

xmlToList(books)

一个问题是每本书有多个作者,因此您需要决定如何处理当您构建数据框架时。

One problem is that there are multiple authors per book, so you will need to decide how to handle that when you're structuring your data frame.

一旦您决定了对多个作者的问题做了什么,那么将您的书籍列表变成一个数据框与plyr中的 ldply()函数(或者只是使用lap.c并使用do.call(rbind...)将返回值转换为数据框)

Once you have decided what to do with the multiple authors issue, then it's fairly straight forward to turn your book list into a data frame with the ldply() function in plyr (or just use lapply and convert the return value into a data.frame by using do.call("rbind"...).

这是一个完整的例子(不包括作者):

Here's a complete example (excluding author):

library(XML)
books <- "http://www.w3schools.com/XQuery/books.xml"
library(plyr)
ldply(xmlToList(books), function(x) { data.frame(x[!names(x)=="author"]) } )

   .id        title.text title..attrs year price   .attrs
 1 book  Everyday Italian           en 2005 30.00  COOKING
 2 book      Harry Potter           en 2005 29.99 CHILDREN
 3 book XQuery Kick Start           en 2003 49.99      WEB
 4 book      Learning XML           en 2003 39.95      WEB

这是作者包含的样子。在这种情况下,您需要使用 ldply ,因为列表是锯齿状的... lapply无法正确处理。 [否则您可以使用 lapply rbind.fill (也由Hadley提供),但是为什么在 plyr 自动为你做吗?]:

Here's what it looks like with author included. You need to use ldply in this instance since the list is "jagged"...lapply can't handle that properly. [Otherwise you can use lapply with rbind.fill (also courtesy of Hadley), but why bother when plyr automatically does it for you?]:

ldply(xmlToList(books), data.frame)

   .id        title.text title..attrs              author year price   .attrs
1 book  Everyday Italian           en Giada De Laurentiis 2005 30.00  COOKING
2 book      Harry Potter           en        J K. Rowling 2005 29.99 CHILDREN
3 book XQuery Kick Start           en      James McGovern 2003 49.99      WEB
4 book      Learning XML           en         Erik T. Ray 2003 39.95      WEB
     author.1   author.2   author.3               author.4
1        <NA>       <NA>       <NA>                   <NA>
2        <NA>       <NA>       <NA>                   <NA>
3 Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan
4        <NA>       <NA>       <NA>                   <NA>

这篇关于如何将XML数据转换成data.frame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆