如何将 XML 数据转换为 data.frame? [英] How to transform XML data into a data.frame?

查看:41
本文介绍了如何将 XML 数据转换为 data.frame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习 R 的 XML 包.我正在尝试从 books.xml 示例 xml 数据文件创建一个 data.frame.这是我得到的:

I'm trying to learn R's XML package. I'm trying to create a data.frame from books.xml sample xml data file. Here's what I get:

library(XML)
books <- "http://www.w3schools.com/XQuery/books.xml"
doc <- xmlTreeParse(books, useInternalNodes = TRUE)
doc
xpathApply(doc, "//book", function(x) do.call(paste, as.list(xmlValue(x))))
xpathSApply(doc, "//book", function(x) strsplit(xmlValue(x), " "))
xpathSApply(doc, "//book/child::*", xmlValue)

这些 xpathSApply 中的每一个都没有让我接近我的意图.应该如何向格式良好的 data.frame 迈进?

Each of these xpathSApply's don't get me even close to my intention. How should one proceed toward a well formed data.frame?

推荐答案

通常,我会建议尝试 xmlToDataFrame() 函数,但我相信这实际上会相当棘手,因为它不是'一开始就结构良好.

Ordinarily, I would suggest trying the xmlToDataFrame() function, but I believe that this will actually be fairly tricky because it isn't well structured to begin with.

我建议使用此功能:

xmlToList(books)

一个问题是每本书有多个作者,因此您需要在构建数据框时决定如何处理.

One problem is that there are multiple authors per book, so you will need to decide how to handle that when you're structuring your data frame.

一旦您决定如何处理多作者问题,那么使用 plyr 中的 ldply() 函数(或仅使用 lapply并使用 do.call("rbind"...) 将返回值转换为 data.frame.

Once you have decided what to do with the multiple authors issue, then it's fairly straight forward to turn your book list into a data frame with the ldply() function in plyr (or just use lapply and convert the return value into a data.frame by using do.call("rbind"...).

这是一个完整的例子(不包括作者):

Here's a complete example (excluding author):

library(XML)
books <-  "w3schools.com/xsl/books.xml"
library(plyr)
ldply(xmlToList(books), function(x) { data.frame(x[!names(x)=="author"]) } )

   .id        title.text title..attrs year price   .attrs
 1 book  Everyday Italian           en 2005 30.00  COOKING
 2 book      Harry Potter           en 2005 29.99 CHILDREN
 3 book XQuery Kick Start           en 2003 49.99      WEB
 4 book      Learning XML           en 2003 39.95      WEB

这是包含作者的情况.在这种情况下,您需要使用 ldply ,因为列表是锯齿状的"...lapply 无法正确处理.[否则,您可以将 lapplyrbind.fill 一起使用(同样由 Hadley 提供),但是当 plyr 自动为您执行此操作时,何必费心呢?]:

Here's what it looks like with author included. You need to use ldply in this instance since the list is "jagged"...lapply can't handle that properly. [Otherwise you can use lapply with rbind.fill (also courtesy of Hadley), but why bother when plyr automatically does it for you?]:

ldply(xmlToList(books), data.frame)

   .id        title.text title..attrs              author year price   .attrs
1 book  Everyday Italian           en Giada De Laurentiis 2005 30.00  COOKING
2 book      Harry Potter           en        J K. Rowling 2005 29.99 CHILDREN
3 book XQuery Kick Start           en      James McGovern 2003 49.99      WEB
4 book      Learning XML           en         Erik T. Ray 2003 39.95      WEB
     author.1   author.2   author.3               author.4
1        <NA>       <NA>       <NA>                   <NA>
2        <NA>       <NA>       <NA>                   <NA>
3 Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan
4        <NA>       <NA>       <NA>                   <NA>

这篇关于如何将 XML 数据转换为 data.frame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆