R 快速 XML 解析 [英] R Fast XML Parsing

查看:30
本文介绍了R 快速 XML 解析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前在 R 中将 XML 文件转换为数据帧的最快方法是什么?

What is the fastest way to convert XML files to data frames in R currently?

XML 看起来像这样:(注意-并非所有行都有所有字段)

The XML looks like this: (Note- not all rows have all fields)

  <row>
    <ID>001</ID>
    <age>50</age>
    <field3>blah</field3>
    <field4 />
  </row>
  <row>
    <ID>001</ID>
    <age>50</age>
    <field4 />
  </row>

我尝试了两种方法:

  1. 来自 XML 库的 xmlToDataFrame 函数
  2. 面向速度的 xmlToDF 函数发布在 here

对于一个 8.5 MB 的文件,有 1.6k 个行"和 114 个列",xmlToDataFrame 用了 25.1 秒,而 xmlToDF 在我的机器上用了 16.7 秒.

For an 8.5 MB file, with 1.6k "rows" and 114 "columns", xmlToDataFrame took 25.1 seconds, while xmlToDF took 16.7 seconds on my machine.

与能够在 0.4 秒内完成工作的 python XML 解析器(例如 xml.etree.ElementTree)相比,这些时间相当长.

These times are quite large, when compared with python XML parsers (eg. xml.etree.ElementTree) which was able to do the job in 0.4 seconds.

在 R 中是否有更快的方法来做到这一点,或者 R 中有什么基本的东西阻止我们更快地做到这一点?

Is there a faster way to do this in R, or is there something fundamental in R that prevents us making this faster?

对此有所了解会非常有帮助!

Some light on this would be really helpful!

推荐答案

更新评论

d = xmlRoot(doc)
size = xmlSize(d)

names = NULL
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    names = unique(c(names, names(v)))
}

for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    cat(paste(v[names], collapse=","), "\n", file="a.csv", append=TRUE)
}

对于 1000x100 xml 记录,这在大约 0.4 秒内完成.如果您知道变量名称,您甚至可以省略第一个 for 循环.

This finishes in about 0.4 second for a 1000x100 xml record. If you know the variable name, you can even omit the first for loop.

注意:如果您的 xml 内容包含逗号、引号,您可能需要特别注意它们.在这种情况下,我推荐下一个方法.

Note: if you xml content contains commas, quotation marks, you may have to take special care about them. In this case, I recommend the next method.

如果你想动态构造data.frame,可以用data.table来做,data.table比上面的csv方法慢一点,但比 data.frame

if you want to construct the data.frame dynamically, you can do this with data.table, data.table is a little bit slower than the above csv method, but faster than data.frame

m = data.table(matrix(NA,nc=length(names), nr=size))
setnames(m, names)
for (n in names) mode(m[[n]]) = "character"
for(i in 1:size){
    v = getChildrenStrings(d[[i]])
    m[i, names(v):= as.list(v), with=FALSE]
}
for (n in names) m[, n:= type.convert(m[[n]], as.is=TRUE), with=FALSE]

对于同一个文档,它在大约 1.1 秒内完成.

It finishes in about 1.1 second for the same document.

这篇关于R 快速 XML 解析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆