提高将行追加到data.table的性能 [英] Improving performance of appending rows to a data.table

查看:89
本文介绍了提高将行追加到data.table的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析一堆类似表格数据的XML,并希望将它们连接到一个单一的data.table来进行我的计算。我使用 XML 包解析,有〜10,000 xml文件要解析,每个将有15-150行内(确切的数字我不知道提前)。我目前的方法是:

  sol<  -  data.table()
for(i in seq_len xml_list))){
i.xml< - xmlParse(xml_list [[i]]
i.component< - as.data.table(xmlToDataFrame(..))
sol < - rbindlist(list(i.component,sol),use.names = T,fill = T)
}
sol

这个过程大约需要一个小时的数据。有人能指出一个方法来大幅提高这个解析的性能吗?



我想到的可能的方法是:以某种方式预分配内存为更大的data.table和追加行,而不是重新复制整个事情在每个步骤,或者也许有一个更快的XML解析器,我可以使用?

解决方案

您可以递归地 rbinding 每增加一个小增加($ 10,000以上 rbindlist 调用 data.table $ c>!)。更好地创建一个很长的data.tables列表,然后调用 rbindlist 一次:

  ll<  -  lapply(xml_list,function(x)as.data.table(xmlParse(x)))
dt< - rbindlist(ll)



我想象在这种格式下,大部分的处理时间将用于读取和解析xml文件。


I'm parsing a bunch of XMLs with similar table-like data and want to join them into a single data.table to do my calculations afterwards. I use XML package for parsing, there are ~10,000 xml files to be parsed and each would have 15-150 rows inside (exact number I don't know in advance). My current approach is:

sol <- data.table()
for(i in seq_len(length(xml_list))) {
  i.xml <- xmlParse(xml_list[[i]]
  i.component <- as.data.table(xmlToDataFrame(..))
  sol <- rbindlist(list(i.component,sol),use.names=T,fill=T)
}
sol

This process takes about an hour on my data. Could somebody point me towards a way to substantially improve performance of this parsing?

Possible ways that I'm thinking are: somehow pre-allocate memory for the larger data.table and append rows instead of re-copying the whole thing on each step? Or maybe there's a faster XML parser that I could use? Or possibly parse XMLs in the list simultaneously and not sequentially (since they are all alike).

解决方案

You are recursively rbinding your growing data.table with each new small addition (10,000+ calls to rbindlist!). Better to create a long list of data.tables and then call rbindlist once:

ll <- lapply( xml_list , function(x) as.data.table( xmlParse( x ) ) )
dt <- rbindlist( ll )

I imagine in this format the majority of your processing time is going to be spent reading and parsing the xml files.

这篇关于提高将行追加到data.table的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆