为什么 rbindlist “更好"?比rbind? [英] Why is rbindlist "better" than rbind?
问题描述
我正在浏览 data.table
的文档,并且还从这里关于 SO 的一些对话中注意到 rbindlist
应该比 rbind 更好
.
I am going through documentation of data.table
and also noticed from some of the conversations over here on SO that rbindlist
is supposed to be better than rbind
.
我想知道为什么 rbindlist
比 rbind
更好,在哪些情况下 rbindlist
确实优于 rbind
?
I would like to know why is rbindlist
better than rbind
and in which scenarios rbindlist
really excels over rbind
?
在内存利用率方面有什么优势吗?
Is there any advantage in terms of memory utilization?
推荐答案
rbindlist
是 do.call(rbind, list(...))
的优化版本,以使用 rbind.data.frame
rbindlist
is an optimized version of do.call(rbind, list(...))
, which is known for being slow when using rbind.data.frame
显示 rbindlist
亮点的一些问题是
Some questions that show where rbindlist
shines are
使用 do.call 和 ldply 将一长串 data.frames(约 100 万)转换为单个 data.frame 时出现问题
这些具有显示速度有多快的基准.
These have benchmarks that show how fast it can be.
rbind.data.frame
会进行大量检查,并将按名称进行匹配.(即 rbind.data.frame 将考虑到列可能有不同的顺序,并按名称匹配),rbindlist
不做这种检查,而是按位置加入
rbind.data.frame
does lots of checking, and will match by name. (i.e. rbind.data.frame will account for the fact that columns may be in different orders, and match up by name), rbindlist
doesn't do this kind of checking, and will join by position
例如
do.call(rbind, list(data.frame(a = 1:2, b = 2:3), data.frame(b = 1:2, a = 2:3)))
## a b
## 1 1 2
## 2 2 3
## 3 2 1
## 4 3 2
rbindlist(list(data.frame(a = 1:5, b = 2:6), data.frame(b = 1:5, a = 2:6)))
## a b
## 1: 1 2
## 2: 2 3
## 3: 1 2
## 4: 2 3
<小时>
rbindlist 的一些其他限制
它过去难以处理因素
,因为一个已经修复的错误:
Some other limitations of rbindlist
It used to struggle to deal with factors
, due to a bug that has since been fixed:
rbindlist 两个 data.tables,其中一个具有因子,另一个具有列的字符类型(Bug #2650)
存在重复列名的问题
见警告消息:在 rbindlist(allargs) 中:强制引入的 NA:data.table 中可能存在错误? (Bug #2384)
rbindlist
可以处理lists
data.frames
和data.tables
,并且会返回一个data.table没有行名
rbindlist
can handle lists
data.frames
and data.tables
, and will return a data.table without rownames
您可以使用 do.call(rbind, list(...))
进入混乱的行名见
you can get in a muddle of rownames using do.call(rbind, list(...))
see
如何避免在 do.call 中使用 rbind 时重命名行?
在内存方面rbindlist
是用C
实现的,所以内存效率高,它使用setattr
通过引用来设置属性
In terms of memory rbindlist
is implemented in C
, so is memory efficient, it uses setattr
to set attributes by reference
rbind.data.frame
在 R
中实现,它做了很多分配,并使用 attr<-
(和 class<-
和 rownames<-
所有这些都将(在内部)创建创建的 data.frame 的副本.
rbind.data.frame
is implemented in R
, it does lots of assigning, and uses attr<-
(and class<-
and rownames<-
all of which will (internally) create copies of the created data.frame.
这篇关于为什么 rbindlist “更好"?比rbind?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!