data.table vs dplyr:可以做一些好的事情,另一个不能或不好吗? [英] data.table vs dplyr: can one do something well the other can't or does poorly?
问题描述
概述
我比较熟悉 data.table
,而不是 dplyr
。我阅读了一些 dplyr
vignette 以及在SO上弹出的例子,到目前为止,我的结论是:
-
data.table
和dplyr
的速度是可比的,除非有很多(即> 10-100K)的组,而在其他一些情况下(见下面的基准) -
dplyr
具有更易于访问的语法 -
dplyr
抽象(或将)潜在的数据库交互 - 有一些小的功能差异(参见下面的示例/用法)
在我心中2.不重要,因为我相当熟悉它 data.table
,虽然我明白用户对这两者都将是一个重要因素。我想避免一个更直观的论据,因为对于我已经熟悉 data.table
的人的观点,我的具体问题是无关紧要的。我也想避免讨论如何更直观导致更快的分析(肯定是真的,但再次,不是我最感兴趣的这里)。
问题
我想知道的是:
- 是否有分析任务对于熟悉这些软件包的人来说,使用一个或另一个软件包来编写代码更容易(即,所需的击键所需的组合与所需的深奥级别的组合,每种软件都是一件好事)。
- 有没有一个分析工作在一个软件包和另一个软件包中实质上(即超过2x)执行。
一个最近的SO问题让我想起了这更多一点,因为直到那时我不认为 dplyr
将提供超出我可以在 data.table
。这里是 dplyr
解决方案(Q结尾的数据):
dat%。%
group_by(name,job)%。%
过滤器(job!=Boss| year == min(year))%。%
mutate(cumu_job2 = cumsum (job2))
哪个比我在一个数据的黑客尝试好多了.table
解决方案。也就是说,好的 data.table
解决方案也很不错(感谢Jean-Robert,Arun,并且在这里注意到,我喜欢在严格最优的解决方案中的单一陈述):
setDT(dat)[,
.SD [job!=Boss| year == min(year)] [,cumjob:= cumsum(job2)],
by = list(id,job)
]
后者的语法可能看起来非常深奥,但如果习惯于 data.table
(即不使用一些更深奥的技巧)
理想情况下,我想看到的是一些很好的例子, dplyr
或 data.table
方式实质上更简洁或执行得更好。
< h3>示例
用法
-
dplyr
允许返回任意行数的分组操作(从 eddi的问题 ,请注意:这看起来将在 dplyr 0.5 ,另外,在@ eddi的问题的答案中,@beginneR显示了使用do
的潜在解决方案。 -
数据.table
支持 滚动加入 (感谢@dholstius)以及 重叠连接 -
data.table
internal优化DT [col == value]
或DT [col%in%values]
>速度通过自动索引使用二进制搜索,同时使用相同的基本R语法。 请参阅这里了解更多详细信息和小型基准。 -
dplyr
提供功能的标准评估版本(例如重组
,summarize_each _
)可以简化dplyr
(注意编程使用data.table
的编程使用是绝对可能的,只是至少根据我的知识需要一些小心,替代/引用等)
基准
- 我跑了 我自己的基准 ,发现这两个包在拆分应用组合样式分析中可比较,除非有非常大数量的组(> 100K),此时
data.table
变得更快。 - @Arun运行一些 连接的基准 ,显示
data.table
比dplyr
随着团体数量的增加(随着两个包中最近的增强和最新版本的R更新)。另外,当试图获得 唯一值 有data.table
〜6x更快。 - (未验证)有
数据。
较大版本的组合/应用/排序速度快75%,而dplyr
较小版本的速度提高了40%( <一个href =https://stackoverflow.com/questions/21477525/fast-frequency-and-percentage-table-with-dplyr/>另一个来自评论的SO问题 ,谢谢danas)。 - Matt是
data.table
的主要作者,具有 基准分组操作data.table
,dplyr
和pythonpandas
最多可达20亿行(RAM中约100GB) 。 - 有
data.table
〜8x更快
数据
这是第一个例如,我在问题部分中显示。
dat < - structure(list(id = c(1L,1L,1L, 1L,1L,1L,1L,1L,2L,2L,
2L,2L,2L,2L,2L,2L),name = c(Jane,Jane,Jane ,
Jane,Jane,Jane,Jane,Bob,Bob,Bob,Bob,Bob,
Bob Bob,Bob),year = c(1980L,1981L,1982L,1983L,1984L,
1985L,1986L,1987L,1985L,1986L,1987L,1988L,1989L,1990L,
1991L, 1992年),job = c(经理,经理,经理,经理,
经理,经理,老板,老板,经理 ,经理,
老板,老板,老板,老板,老板),job2 = c(1L,1L,1L,
1L,1L,1L, 0L,0L,1L,1L,1L,0L,0L,0L,0L,0L)).Names = c(id,
name,year,job ),class =data.fra我,row.names = c(NA,
-16L))
我们需要至少涵盖这些方面,以提供全面的答案/比较(不重要的顺序):速度
,内存使用情况
,语法
和功能
。
我的意图是从data.table的角度尽可能清楚地覆盖每一个。
注意:除非另有明确指出,通过引用dplyr,我们引用了dplyr的data.frame接口,其内部使用C ++使用Rcpp。
data.table语法的格式一致 - DT [i,j,by]
。为了保持 i
,<$ code>,和
由
。通过将相关操作保持在一起,它允许轻松优化 的操作,更重要的是内存使用,并且还提供一些强大的功能
1。速度
已经将已经显示data.table的问题添加到几个基准(尽管主要是分组操作)比dplyr更快地通过增加分组的组和/或行数,包括基准测试马特从100万到1000万组和不同的分组列分组从1000万到20亿行(100GB内存),也比较 pandas
。
在基准测试中,覆盖这些其他方面也是非常好的:
-
分组涉及行子集的操作 - 即
DT [x> $ / $>
对其他操作进行基准测试,例如更新
-
除了运行时,还针对每个操作进行基准测试内存占用。 / p>
2。内存使用量
-
涉及
filter()
或<$ c $在dplyr中的c> slice()可能是内存效率低下(在data.frames和data.tables上)。 看到这篇文章。
请注意, Hadley的评论谈论了速度(那个dplyr对他来说是很丰富的),而这里的主要关注是>内存。
-
此时data.table界面允许修改/更新列通过参考(请注意,我们不需要将结果重新分配给变量)。
#sub-assign by reference,updates'y'in-place
DT [x> = 1L,y:= NA]
但是,dplyr 永远不会通过引用更新。 (请注意,结果需要重新分配):
#复制整个y列
ans< - DF%>%mutate(y = replace(y,which(x> = 1L),NA))
这个问题是 参照透明度 。通过引用更新data.table对象,特别是在函数内部可能并不总是需要。但这是一个非常有用的功能:请参阅此和这个帖子有趣的情况。我们想保留它。
因此,我们正在努力在data.table中导出
shallow()
函数将为用户提供两种可能性。例如,如果不希望修改函数中的输入数据表,那么可以执行以下操作:foo< ; - 功能(DT){
DT =浅(DT)##浅拷贝DT
DT [,newcol:= 1L] ##不影响原始DT
DT [x> ; 2L,newcol:= 2L] ##不需要复制(内部),因为此列只存在于浅层复制DT
DT [x> 2L,x:= 3L] ##必须复制(像基本R / dplyr一样);否则原始DT将
##也得到修改。
}
不使用
shallow()
,旧的功能被保留:bar< - function(DT){
DT [ newcol:= 1L] ##旧行为,原始DT通过引用更新
DT [x> 2L,x:= 3L] ##旧行为,更新原始DT中的列x。
}
使用<$ c创建浅拷贝 $ c> shallow(),我们明白你不想修改原始对象。我们在内部处理所有内容,以确保尽可能复制列,只有在绝对必要时才修改 。实施时,这应该可以解决参照透明度问题,同时为用户提供两种可能性。
另外,一旦
shallow()
被导出,dplyr的data.table接口应该避免几乎所有的副本。所以那些喜欢dplyr语法的人可以用data.tables来使用它。
但是它仍然缺少data.table提供的许多功能,包括(子) - 参考引用。
/ li>
-
加入时聚合
假设您有两个数据表,如下所示:
DT1 = data.table(x = c(1,1,1,1,2,2,2,2),y = c(a ,a,b,b),z = 1:8,key = c(x,y))
#xyz
#1:1 a 1
#2:1 a 2
#3:1 b 3
#4:1 b 4
#5:2 a 5
#6:2 a 6
#7:2 b 7
#8:2 b 8
DT2 = data.table(x = 1:2,y = c(a,b),mul = 4:3,key = c(x,y))
#xy mul
#1:1 a 4
#2:2 b 3
你想为每一个获得
sum(z)* mul
在DT2
的行中,加入列x,y
。我们可以:
-
1)聚合
DT1
获取sum(z)
,2)执行加入,3)乘法(或)#data.table way
DT1 [,。(z = sum(z)),keyby =。(x,y)] [DT2] [,z:= z * mul] []
#dplyr等价物
DF1%>%group_by(x,y)%>%summarize(z = sum(z))%>%
right_join(DF2)%> ;%mutate(z = z * mul)
-
2) (使用
by = .EACHI
功能):DT1 [DT2,列表(z = sum(z)* mul),by = .EACHI]
有什么优势?
-
我们不必为中间结果分配内存
-
我们不需要两次分组/哈希(一个用于聚合,另一个用于加入)。
-
更重要的是,通过查看(2)中的
j
可以清楚我们想要执行的操作。
检查< a href =https://stackoverflow.com/a/27004566/559784>这篇文章,详细解释了
by = .EACHI
。没有中间结果被实现,并且join +聚合一次执行。
在
/ li>dplyr
中,您必须使用首先连接并聚合或聚合,然后连接,这两者都不会在内存方面(这反过来转化为速度)也是如此有效。
-
-
更新和连接:
考虑如下所示的data.table代码:
DT1 [DT2,col:= i.mul]
添加/更新
DT1
的列col
与mul $那些
DT2
的键列匹配DT2
的c $ c> / code>。我不认为在dplyr
中完全相当于此操作,即不避免使用* _ join
操作,这将不得不复制整个DT1
只是为了添加一个新的列,这是不必要的。
请查看此信息以获取实际使用情况。
总而言之,重要的是要意识到每一点优化都很重要。由于 Grace Hopper 会说, 记住你的纳秒 !
3。语法
现在我们来看看语法。 Hadley在此处发表评论:
数据表非常快,但我认为他们的简洁使它难以学习,而使用它的代码在写入后难以阅读 ...
我觉得这句话毫无意义,因为它是非常主观的。我们可以尝试的是比较语法的一致性。我们将并排比较data.table和dplyr语法。
我们将使用如下所示的虚拟数据:
DT = data.table(x = 1:10,y = 11:20,z = rep(1:2,each = 5))
DF = as.data.frame(DT)
-
基本汇总/更新操作。
#case(a)
DT [,sum(y),by = z] ## data.table语法
DF%>%group_by(z)%>%summarize(sum(y))## dplyr syntax
DT [,y:= cumsum(y),by = z]
ans< - DF%>%group_by(z)%>%mutate(y = cumsum(y))
#case(b)
DT [x& 2,sum(y),by = z]
DF%>%filter(x> 2)%>%group_by(z)%>%summarize(sum(y))
DT [x] 2,y = = cumsum(y),by = z]
ans< - DF%>%group_by(z)%>%mutate(y = replace(y,which(x> y(1)] y [2L] else y [2L]
$ b DT(,((x> 5L) ,by = z]
DF%>%group_by(z)%>%summaryize(if(any(x> 5L))y [1L] -y [2L] else y [2L])
DT [,if(any(x> 5L))y [1L] -y [2L],by = z]
DF%>%group_by(z)%>%filter (x> 5L))%>%总结(y [1L] -y [2L])
-
data.table语法是紧凑的,dplyr是相当冗长的。 (a)情况或多或少相同。
-
如果(b),我们必须在dplyr中使用
filter()
总结。但是当更新时,我们必须在mutate()
之间移动逻辑。然而,在data.table中,我们用相同的逻辑来表达这两个操作 - 对x> 2
,但是在第一种情况下,得到sum(y)
,而在第二种情况下更新y
其累积金额。
这是我们的意思,当我们说
DT [i,j,by] / code> form 是一致的。
-
类似的情况(c),当我们有
if-else
条件,我们能够在data.table和dplyr中表达逻辑原样。但是,如果我们希望只返回如果
条件满足的那些行,否则跳过,我们不能使用summarize()
直接(AFAICT)。我们首先需要filter()
,然后总结,因为summaryize()
总是期望一个单个值。
虽然返回相同的结果,但使用
filter()
可以使实际操作变得不那么明显。 >
在第一种情况下,也可能使用
filter()
(对我来说似乎不太明显)但是我的观点是我们不应该这样做。
-
-
多列的聚合/ p>
#case(a)
DT [,lapply(.SD,sum),by = z] ## data。表语法
DF%>%group_by(z)%>%summarise_each(funs(sum))## dplyr syntax
DT [,(cols))= = lapply(.SD,sum) by = z]
ans< - DF%>%group_by(z)%>%mutate_each(funs(sum))
#case(b)
DT [,c(lapply(.SD,sum),lapply(.SD,mean)),by = z]
DF%>%group_by(z)%>%summarise_each(funs(sum,mean) )
#case(c)
DT [,c(.N,lapply(.SD,sum)),by = z]
DF%>%group_by z)%>%summarise_each(funs(n(),mean))
-
如果(a),代码或多或少等同。 data.table使用熟悉的基本函数
lapply()
,而dplyr
引入* _ each()
以及一些函数到funs()
。 -
data.table的
:=
需要提供列名称,而dplyr会自动生成。 / p> -
如果(b),dplyr的语法相对简单。改进多个功能的聚合/更新是在data.table的列表中。
-
如果(c),dplyr将返回
n )
多列,而不只是一次。在data.table中,我们需要做的是在j
中返回一个列表。列表的每个元素将成为结果中的一列。因此,我们可以再次使用熟悉的基础函数c()
将.N
连接到列表
返回列表
。
注意:再次,在data.table中,我们需要做的是返回
j
中的列表。列表的每个元素将成为结果中的一列。您可以使用c()
,as.list()
,lapply() code>,
list()
等...基本功能完成此操作,无需学习任何新功能。
您将需要至少学习特殊变量 -
.N
和.SD
。 dplyr中的等价物是n()
和。
li>
-
-
加入
dplyr为每种类型的连接提供单独的功能,其中data.table允许使用相同语法的连接
DT [i,j,by]
(和原因)。它还提供了一个等效的merge.data.table()
函数作为替代。setkey(DT1,x,y)
#1.正常连接
DT1 [DT2] ## data.table语法
left_join(DT2,DT1)# #dplyr syntax
#2.加入时选择列
DT1 [DT2,。(z,i.mul)]
left_join(select(DT2,x,y,mul ),选择(DT1,x,y,z))
#3.加入汇总
DT1 [DT2,。(sum(z)* i.mul),by =。 %d $ b DF1%>%group_by(x,y)%>%summarize(z = sum(z))%>%
inner_join(DF2)%>%mutate(z = z * mul)%>%select(-mul)
#4.加入时更新
DT1 [DT2,z:= cumsum(z)* i.mul,by =。 EACHI]
??
#5.滚动加入
DT1 [DT2,roll = -Inf]
??
#6.控制输出的其他参数
DT1 [DT2,mult =first]
??
-
有些可能会为每个连接找到一个单独的功能更好(左,右,内,反,半等),而其他人可能会喜欢data.table的
DT [i,j,by]
或merge()
,它类似于base R. -
然而,dplyr join只是这样。而已。没有什么
-
data.tables可以在加入(2)时选择列,而在dplyr中,您需要
select()
首先在两个data.frames之前加入,如上所示。否则,您将使用不必要的列来实现连接,以便稍后删除它们,这是无效的。 -
data.tables可以在加入使用
by = .EACHI
功能,加入(4),并且更新。为什么要使整个连接结果添加/更新只是几列? -
data.table还有
mult =
参数,选择第一个,最后一个或 / em> matches(6)。 -
data.table的
allow.cartesian = TRUE
参数,以防止意外的无效连接。
-
再次,语法与
DT [i,j,by] code>其他参数可以进一步控制输出。
-
do()
...
dplyr的总结是为返回单个值的函数而设计的。如果你的函数返回多个/不等的值,你将不得不诉诸于
do()
。你必须事先知道你所有的函数返回值。DT [,list(x [1],y [1]),by = z] ## data.table syntax
DF%>%group_by(z)%>%summarize(x [1],y [1])## dplyr syntax
DT [,list(x [1:2],y [1]),by = z]
DF%>%group_by(z)%>%do(data.frame(。$ x [1:2],。$ y [1]))
DT [,quantile(x,0.25),by = z]
DF%>%group_by(z)%>%summarize(quantile(x,0.25))
DT [,quantile(x,c(0.25,0.75)),by = z]
DF%>%group_by(z)%>%do(data.frame (0.25,0.75))))
DT [,as.list(summary(x)),by = z]
DF%>%group_by(z)%>% do(data.frame(as.list(summary(。$ x))))
-
.SD
的等价物是。
-
在data.table中,您可以在
j
中抛出任何东西 - 唯一要记住的是它返回一个列表,以便列表中的每个元素都转换为列。 -
在dplyr中,不能这样做。根据您的功能是否始终返回单个值,您必须诉诸
do()
。而且相当慢
-
.table的语法与
DT [i,j,by]
一致。我们可以继续在j
中抛出表达式,而无需担心这些事情。
看看这个SO问题和这个。我想知道是否可以使用dplyr的语法来直接表达答案...
总而言之,我特别强调了>几个实例,其中dplyr的语法是低效的,有限的或不能使操作直截了当。这特别是因为data.table对于更难阅读/学习的语法(如上面粘贴/链接的一个)有相当多的反弹。大多数涵盖dplyr的帖子谈论最简单的操作。那真是太棒了但是,重要的是要实现其语法和功能限制,而且我还没有看到一个帖子。
data.table也有它的怪癖(一些我已经指出我们试图修复)。我也在尝试改进data.table的连接,因为我已经强调了这里。
但是,与data.table相比,还应考虑dplyr缺少的功能数量。
4。特征
我已经指出了大部分功能此处和也在这篇文章。另外:
-
fread - 快速文件阅读器已经可用了很长时间。 / p>
-
fwrite - 当前的devel,v1.9.7,一个并行可用。有关实施的详细说明,请参阅此帖和#1664 ,以跟踪进一步的发展。
-
自动索引 - 优化基础R的另一个便利功能
-
即时分组:
dplyr
通过在summaryize()
中对变量进行分组来自动排序结果,这可能并不总是可取的。 -
-
非平等加入:是v1.9.7 +的新功能。它允许使用其他运算符
< =,<,>,> =
连接数据连接的所有其他优点。 -
setorder()
data.table中的函数,允许通过引用真正快速重新排序data.tables 。 -
dplyr提供与数据库的接口使用相同的语法,哪个data.table目前不存在。
-
data.table
从v1.9.7 +(由Jan Gorecki编写) - fsetdiff(),提供了更快等效的集合操作 > fintersect(), funion()和 fsetequal()附加的全部
参数SQL)。 -
data.table在没有屏蔽警告的情况下干净地加载,并具有描述这里为
[。data.frame
filter,lag
and[
which can cause problems;例如here and here.
Finally:
On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.
On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using
OpenMP
.
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using
Overview
I'm relatively familiar with data.table
, not so much with dplyr
. I've read through some dplyr
vignettes and examples that have popped up on SO, and so far my conclusions are that:
data.table
anddplyr
are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)dplyr
has more accessible syntaxdplyr
abstracts (or will) potential DB interactions- There are some minor functionality differences (see "Examples/Usage" below)
In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table
, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table
. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).
Question
What I want to know is:
- Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
- Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.
One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr
would offer much beyond what I can already do in data.table
. Here is the dplyr
solution (data at end of Q):
dat %.%
group_by(name, job) %.%
filter(job != "Boss" | year == min(year)) %.%
mutate(cumu_job2 = cumsum(job2))
Which was much better than my hack attempt at a data.table
solution. That said, good data.table
solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):
setDT(dat)[,
.SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by=list(id, job)
]
The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table
(i.e. doesn't use some of the more esoteric tricks).
Ideally what I'd like to see is some good examples were the dplyr
or data.table
way is substantially more concise or performs substantially better.
Examples
Usagedplyr
does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around usingdo
in the answer to @eddi's question).data.table
supports rolling joins (thanks @dholstius) as well as overlap joinsdata.table
internally optimises expressions of the formDT[col == value]
orDT[col %in% values]
for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.dplyr
offers standard evaluation versions of functions (e.g.regroup
,summarize_each_
) that can simplify the programmatic use ofdplyr
(note programmatic use ofdata.table
is definitely possible, just requires some careful though, substitution/quoting, etc, at least to my knowledge)
- I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point
data.table
becomes substantially faster. - @Arun ran some benchmarks on joins, showing that
data.table
scales better thandplyr
as the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values hasdata.table
~6x faster. - (Unverified) has
data.table
75% faster on larger versions of a group/apply/sort whiledplyr
was 40% faster on the smaller ones (another SO question from comments, thanks danas). - Matt, the main author of
data.table
, has benchmarked grouping operations ondata.table
,dplyr
and pythonpandas
on up to 2 billion rows (~100GB in RAM). - An older benchmark on 80K groups has
data.table
~8x faster
Data
This is for the first example I showed in the question section.
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane",
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob",
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L,
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager",
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager",
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L,
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id",
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA,
-16L))
We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed
, Memory usage
, Syntax
and Features
.
My intent is to cover each one of these as clearly as possible from data.table perspective.
Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.
The data.table syntax is consistent in its form - DT[i, j, by]
. To keep i
, j
and by
together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while mainitaining the consistency in syntax.
1. Speed
Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas
.
On benchmarks, it would be great to cover these remaining aspects as well:
Grouping operations involving a subset of rows - i.e.,
DT[x > val, sum(y), by=z]
type operations.Benchmark other operations such as update and joins.
Also benchmark memory footprint for each operation in addition to runtime.
2. Memory usage
Operations involving
filter()
orslice()
in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.
data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).
# sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA]
But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):
# copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.
Therefore we are working towards exporting
shallow()
function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. }
By not using
shallow()
, the old functionality is retained:bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. }
By creating a shallow copy using
shallow()
, we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.Also, once
shallow()
is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.But it will still lack many features that data.table provides, including (sub)-assignment by reference.
Aggregate while joining:
Suppose you have two data.tables as follows:
DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # x y z # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # x y mul # 1: 1 a 4 # 2: 2 b 3
And you would like to get
sum(z) * mul
for each row inDT2
while joining by columnsx,y
. We can either:1) aggregate
DT1
to getsum(z)
, 2) perform a join and 3) multiply (or)# data.table way DT1[, .(z=sum(z)), keyby=.(x,y)][DT2][, z := z*mul][] # dplyr equivalent DF1 %>% group_by(x,y) %>% summarise(z=sum(z)) %>% right_join(DF2) %>% mutate(z=z*mul)
2) do it all in one go (using
by=.EACHI
feature):DT1[DT2, list(z=sum(z) * mul), by=.EACHI]
What is the advantage?
We don't have to allocate memory for the intermediate result.
We don't have to group/hash twice (one for aggregation and other for joining).
And more importantly, the operation what we wanted to perform is clear by looking at
j
in (2).
Check this post for a detailed explanation of
by=.EACHI
. No intermediate results are materialised, and the join+aggregate is performed all in one go.Have a look at this, this and this posts for real usage scenarios.
In
dplyr
you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).Update and joins:
Consider the data.table code shown below:
DT1[DT2, col := i.mul]
adds/updates
DT1
's columncol
withmul
fromDT2
on those rows whereDT2
's key column matchesDT1
. I don't think there is an exact equivalent of this operation indplyr
, i.e., without avoiding a*_join
operation, which would have to copy the entireDT1
just to add a new column to it, which is unnecessary.Check this post for a real usage scenario.
To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!
3. Syntax
Let's now look at syntax. Hadley commented here:
Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...
I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.
We will work with the dummy data shown below:
DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
Basic aggregation/update operations.
# case (a) DT[, sum(y), by=z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by=z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by=z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by=z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x>2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by=z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L]-y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L]-y[2L], by=z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L]-y[2L])
data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).
In case (b), we had to use
filter()
in dplyr while summarising. But while updating, we had to move the logic insidemutate()
. In data.table however, we express both operations with the same logic - operate on rows wherex > 2
, but in first case, getsum(y)
, whereas in the second case update those rows fory
with its cumulative sum.This is what we mean when we say the
DT[i, j, by]
form is consistent.Similarly in case (c), when we have
if-else
condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where theif
condition satisfies and skip otherwise, we cannot usesummarise()
directly (AFAICT). We have tofilter()
first and then summarise becausesummarise()
always expects a single value.While it returns the same result, using
filter()
here makes the actual operation less obvious.It might very well be possible to use
filter()
in the first case as well (does not seem obvious to me), but my point is that we should not have to.
Aggregation / update on multiple columns
# case (a) DT[, lapply(.SD, sum), by=z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by=z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by=z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by=z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
In case (a), the codes are more or less equivalent. data.table uses familiar base function
lapply()
, whereasdplyr
introduces*_each()
along with a bunch of functions tofuns()
.data.table's
:=
requires column names to be provided, whereas dplyr generates it automatically.In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.
In case (c) though, dplyr would return
n()
as many times as many columns, instead of just once. In data.table, all we need to do is to return a list inj
. Each element of the list will become a column in the result. So, we can use, once again, the familiar base functionc()
to concatenate.N
to alist
which returns alist
.
Note: Once again, in data.table, all we need to do is return a list in
j
. Each element of the list will become a column in result. You can usec()
,as.list()
,lapply()
,list()
etc... base functions to accomplish this, without having to learn any new functions.You will need to learn just the special variables -
.N
and.SD
at least. The equivalent in dplyr aren()
and.
Joins
dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax
DT[i, j, by]
(and with reason). It also provides an equivalentmerge.data.table()
function as an alternative.setkey(DT1, x, y) # 1. normal join DT1[DT2] ## data.table syntax left_join(DT2, DT1) ## dplyr syntax # 2. select columns while join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x,y,mul), select(DT1, x,y,z)) # 3. aggregate while join DT1[DT2, .(sum(z)*i.mul), by=.EACHI] DF1 %>% group_by(x, y) %>% summarise(z=sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul) # 4. update while join DT1[DT2, z := cumsum(z)*i.mul, by=.EACHI] ?? # 5. rolling join DT1[DT2, roll = -Inf] ?? # 6. other arguments to control output DT1[DT2, mult = "first"] ??
Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc..), whereas as others might like data.table's
DT[i, j, by]
, ormerge()
which is similar to base R.However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to
select()
first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.data.tables can aggregate while joining (3) and also update while joining (4), using
by=.EACHI
feature. Why materialse the entire join result to add/update just a few columns?data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has
mult=
argument which selects first, last or all matches (6).data.table has
allow.cartesian=TRUE
argument to protect from accidental invalid joins.
Once again, the syntax is consistent with
DT[i, j, by]
with additional arguments allowing for controlling the output further.
do()
...dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to
do()
. You have to know beforehand about all your functions return value.DT[, list(x[1], y[1]), by=z] ## data.table syntax DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax DT[, list(x[1:2], y[1]), by=z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1])) DT[, quantile(x, 0.25), by=z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25)) DT[, quantile(x, c(0.25, 0.75)), by=z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75)))) DT[, as.list(summary(x)), by=z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
.SD
's equivalent is.
In data.table, you can throw pretty much anything in
j
- the only thing to remember is for it to return a list so that each element of the list gets converted to a column.In dplyr, cannot do that. Have to resort to
do()
depending on how sure you are as to whether your function would always return a single value. And it is quite slow.
Once again, data.table's syntax is consistent with
DT[i, j, by]
. We can just keep throwing expressions inj
without having to worry about these things.
Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...
To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.
data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.
But one should also consider the number of features that dplyr lacks in comparison to data.table.
4. Features
I have pointed out most of the features here and also in this post. In addition:
fread - fast file reader has been available for a long time now.
fwrite - NEW in the current devel, v1.9.7, a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing - another handy feature to optimise base R syntax as is, internally.
Ad-hoc grouping:
dplyr
automatically sorts the results by grouping variables duringsummarise()
, which may not be always desirable.Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.
Non-equi joins: is a NEW feature available from v1.9.7+. It allows joins using other operators
<=, <, >, >=
along with all other advantages of data.table joins.Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.
setorder()
function in data.table that allows really fast reordering of data.tables by reference.dplyr provides interface to databases using the same syntax, which data.table does not at the moment.
data.table
provides faster equivalents of set operations from v1.9.7+ (written by Jan Gorecki) - fsetdiff(), fintersect(), funion() and fsetequal() with additionalall
argument (as in SQL).data.table loads cleanly with no masking warnings and has a mechanism described here for
[.data.frame
compatibility when passed to any R package. dplyr changes base functionsfilter
,lag
and[
which can cause problems; e.g. here and here.
Finally:
On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.
On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using
OpenMP
.
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using
这篇关于data.table vs dplyr:可以做一些好的事情,另一个不能或不好吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!