提高将行追加到data.table的性能 [英] Improving performance of appending rows to a data.table
问题描述
我正在解析一堆类似表格数据的XML,并希望将它们连接到一个单一的data.table来进行我的计算。我使用 XML
包解析,有〜10,000 xml文件要解析,每个将有15-150行内(确切的数字我不知道提前)。我目前的方法是:
sol< - data.table()
for(i in seq_len xml_list))){
i.xml< - xmlParse(xml_list [[i]]
i.component< - as.data.table(xmlToDataFrame(..))
sol < - rbindlist(list(i.component,sol),use.names = T,fill = T)
}
sol
这个过程大约需要一个小时的数据。有人能指出一个方法来大幅提高这个解析的性能吗?
我想到的可能的方法是:以某种方式预分配内存为更大的data.table和追加行,而不是重新复制整个事情在每个步骤,或者也许有一个更快的XML解析器,我可以使用?
您可以递归地 rbinding 每增加一个小增加($ 10,000以上
rbindlist $ c>调用
data.table
$ c>!)。更好地创建一个很长的data.tables列表,然后调用 rbindlist
一次:
ll< - lapply(xml_list,function(x)as.data.table(xmlParse(x)))
dt< - rbindlist(ll)
我想象在这种格式下,大部分的处理时间将用于读取和解析xml文件。
I'm parsing a bunch of XMLs with similar table-like data and want to join them into a single data.table to do my calculations afterwards. I use
XML
package for parsing, there are ~10,000 xml files to be parsed and each would have 15-150 rows inside (exact number I don't know in advance). My current approach is:sol <- data.table() for(i in seq_len(length(xml_list))) { i.xml <- xmlParse(xml_list[[i]] i.component <- as.data.table(xmlToDataFrame(..)) sol <- rbindlist(list(i.component,sol),use.names=T,fill=T) } sol
This process takes about an hour on my data. Could somebody point me towards a way to substantially improve performance of this parsing?
Possible ways that I'm thinking are: somehow pre-allocate memory for the larger data.table and append rows instead of re-copying the whole thing on each step? Or maybe there's a faster XML parser that I could use? Or possibly parse XMLs in the list simultaneously and not sequentially (since they are all alike).
解决方案You are recursively
rbinding
your growingdata.table
with each new small addition (10,000+ calls torbindlist
!). Better to create a long list of data.tables and then callrbindlist
once:ll <- lapply( xml_list , function(x) as.data.table( xmlParse( x ) ) ) dt <- rbindlist( ll )
I imagine in this format the majority of your processing time is going to be spent reading and parsing the xml files.
这篇关于提高将行追加到data.table的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!