data.table vs dplyr:可以做一些好的,其他不能或不好? [英] data.table vs dplyr: can one do something well the other can't or does poorly?
问题描述
概述
我对 data.table
比较熟悉, c> dplyr 。我阅读了一些 dplyr
小插曲和到目前为止我的结论是:
-
data.table > code>和
dplyr
在速度上是可比的,除非有很多(即> 10-100K)组,在其他一些情况下(见下面的基准) / li>
dplyr
有更容易使用的语法
dplyr
抽象(或将)潜在的数据库交互
- 有一些小的功能差异(见下面的示例/用法)
在我心中2.不太重,因为我相当熟悉它 data.table
,虽然我知道用户新的两个都会是一个大因素。我想避免一个关于哪个更直观的论点,因为这对于我已经熟悉 data.table
的人的角度提出的具体问题是不相关的。我也想避免讨论如何更直观导致更快的分析(当然是真的,但是,再次,不是我最感兴趣的在这里)。
问题
我想知道的是:
- 是否有分析任务对于熟悉这些包的人来说,使用一个或另一个包(即,需要击键的一些组合与所需的神秘主义的组合,其中更少的每个都是好东西)来更容易地编码。
- 分析任务在一个包中比另一个包更有效地执行(即超过2x)。
一个最近的SO问题让我想起这个有点多,因为直到那个点,我不认为 dplyr
将提供远远超出我已经在 data.table > code>。这是
dplyr
解决方案(Q结束时的数据):
dat%。%
group_by(name,job)%。%
filter(job!=Boss| year == min(year))%。%
mutate(cumu_job2 = cumsum (job2))
这比我在一个数据.table
解决方案。也就是说,好的 data.table
解决方案也是相当不错的(感谢Jean-Robert,Arun,注意这里我赞成严格最优解的单一语句):
setDT(dat)[,
.SD [job!=Boss| year == min(year)] [,cumjob:= cumsum(job2)],
by = list(id,job)
]
后者的语法看起来很神秘,但是如果你习惯于 data.table
(即不使用一些更深奥的技巧)。
理想情况下,我想看到的是一些很好的例子是 dplyr
或 data.table
方法基本上更简洁或表现更好。
< h3>示例
用法
-
dplyr
允许返回任意数量行的分组操作(从 eddi的问题 ,请注意:这看起来像是在 dplyr 0.5 ,@beginneR显示了使用do
stackoverflow.com/questions/12030932/rolling-joins-data-table-in-r\">滚动加入 (感谢@dholstius)以及 重叠连接 -
data.table
在内部优化表单DT [col == value]
或DT [col%in%values]
表示速度通过自动索引搜索,同时使用相同的基本R语法。 有关详情和一个小基准,请参阅此处。 -
dplyr
提供标准评估版本的函数(例如regroup
,summarize_each _
),可以简化dplyr
的编程使用(注意data.table
的编程使用是绝对可能的,只是至少根据我的知识,需要一些小心,替换/引用等)
基准
- 我跑了 我自己的基准 ,发现了两个包在分割应用组合样式分析中是可比较的,除非当非常大量的组(> 100K)在此点
data.table
变得明显更快时。 - (未验证) 。在较大版本的组上/应用/排序快75%,
dplyr
比较小的版本快40% a href =http://stackoverflow.com/questions/21477525/fast-frequency-and-percentage-table-with-dplyr/>来自评论的另一个SO问题 ,感谢danas)。 -
data.table
的主要作者Matt,拥有 在data.table
,dplyr c> $> c> $ b
- 80K年龄组的旧基准 有
data.table
〜8倍更快
数据
这是我在问题部分中显示的第一个示例。
dat < - structure(list(id = c(1L,1L,1L,1L,1L,1L,1L,1L,2L,2L,
2L,2L,2L,2L,2L,2L),name = c(Jane,Jane,Jane,Jane,
Jane ,Jane,Bob,Bob,Bob,Bob,Bob,
Bob,Bob,Bob),year = c(1980L, (经理,经理,经理,经理,经理,经理经理,经理,经理,
经理,经理,经纪人 ,Boss,Boss,Boss),job2 = c(1L,1L,1L,
1L,1L,1L,0L,0L,1L,1L,1L,0L,0L,0L, 0L,0L)),.Names = c(id,
name,year,job,job2),class =data.frame,row.names = c NA,
-16L))
解决方案需要至少覆盖这些方面以提供全面的答案/比较(没有特定的重要性顺序):
速度
,内存使用
,语法
和特性
。
意图是从data.table透视图中尽可能清楚地覆盖它们中的每一个。
注意:除非另有明确说明, dplyr,我们引用dplyr的data.frame接口,其内部是使用Rcpp的C ++。
data.table语法在其形式上是一致的 -
DT [i,j,by]
。要通过设计保持i
,j
和由
。通过将相关操作保持在一起,它允许容易地优化速度以及更重要的是内存使用的操作,还提供一些强大的功能,同时保持语法的一致性。
1。速度
已经在显示data.table的问题上添加了几个基准(虽然主要是分组操作)比dplyr更快通过增加分组的组和/或行数,包括Matt的基准在<1000万到1000万个群组 (RAM中的100GB) / code>。
在基准测试中,很好的覆盖这些方面:
-
涉及子集行的分组操作 - 即
DT [x>
-
基准化其他操作,例如 update
- 80K年龄组的旧基准 有
-
除了运行时外,还为每个操作评估内存占用 / p>
<@> @Arun运行了一些 加入计划的基准 ,表明随着组数量的增加,
data.table
的规模优于 dplyr
增强在两个包和最新版本的R)。此外,尝试获取 唯一值 有 data.table
快了6倍。 2。内存使用
-
涉及
filter()
或slice()
在dplyr可能是内存低效(在data.frames和data.tables)。 查看此信息。
请注意, Hadley的评论讲述了速度(dplyr对他很快),而这里的主要问题是记忆。
-
data.table介面允许修改/ (注意,我们不需要将结果重新分配给变量)。
#sub-assign by reference,updates-y'in-place
DT [x> = 1L,y:= NA]
但dplyr 永远不会通过引用更新。 dplyr等价将是(注意结果需要重新赋值):
#复制整个y列
ans< - DF%>%mutate(y = replace(y,which(x> = 1L),NA))
对此的关注是 引用透明度 。通过引用,特别是在函数内更新data.table对象可能不总是所期望的。但这是一个非常有用的功能:请参见此和这帖子有趣的案例。
因此,我们正在努力导出data.table中的
shallow()
将为用户提供两种可能性。例如,如果希望不修改函数中的输入数据表,那么可以这样做:foo< ; - 函数(DT){
DT =浅(DT)##浅拷贝DT
DT [,newcol:= 1L] ##不影响原始DT
DT [x> ; 2L,newcol:= 2L] ##无需复制(内部),因为此列仅存在于浅复制的DT
DT [x> 2L,x:= 3L] ##必须复制(像基本R / dplyr总是);否则原来的DT会
##也得到修改。
}
不使用
shallow()
,保留旧的功能:bar< - function(DT){
DT [ newcol:= 1L] ##旧行为,原始DT通过引用更新
DT [x> 2L,x:= 3L] ##旧行为,更新原始DT中的列x。
}
使用<$ c建立浅拷贝 $ c> shallow(),我们明白你不想修改原始对象。我们会在内部处理一切,以确保同时确保只有在绝对必要时才复制修改的列。
此外,您也可以使用此功能,一旦
shallow()
导出dplyr的data.table接口应该避免几乎所有副本。所以那些喜欢dplyr的语法可以使用它与data.tables。
但它仍然缺少data.table提供的许多功能,包括(sub) - 通过引用分配。
-
加入时汇总:
假设您有两个data.tables如下:
DT1 = data.table(x = c(1,1,1,1,2,2,2,2),y = c(a ,a,b,b),z = 1:8,key = c(x,y))
#xyz
#1:1 a 1
#2:1 a 2
#3:1 b 3
#4:1 b 4
#5:2 a 5
#6:2 a 6
#7:2 b 7
#8:2 b 8
DT2 = data.table(x = 1:2,y = c(a,b),mul = 4:3,key = c(x,y))
#xy mul
#1:1 a 4
#2:2 b 3
并且你想获得
sum(z)* mul
当x,y
时,DT2
我们可以:
-
1)聚合
DT1
c $ c> sum(z),2)执行连接和3)乘法(或)#data.table way
DT1 [,。(z = sum(z)),keyby =。(x,y)] [DT2] [,z:= z * mul] []
#dplyr等价
DF1%>%group_by(x,y)%>%summarize(z = sum(z))%>%
right_join(DF2)%> ;%mutate(z = z * mul)
-
(使用
by = .EACHI
功能):DT1 [ list(z = sum(z)* mul),by = .EACHI]
有什么优点?
-
我们不必为中间结果分配内存
-
我们不必分组/哈希两次(一次用于汇总,另一次用于加入) b
-
更重要的是,通过查看(2)中的
j
,我们希望执行的操作是清楚的。
检查此信息以获取详细说明of
by = .EACHI
。没有中间结果被实现,并且连接+聚合一次性执行。
在
dplyr
中,您必须先加入并聚合或聚合,然后再加入,这两种方式在内存方面(这反过来意味着速度)都非常高效。 -
-
更新和加入:
考虑下面显示的data.table代码:
DT1 [DT2,col:= i.mul]
使用
mul
添加/更新DT1
的列col
DT2
的键列与DT1
DT2 >。我不认为在dplyr
中有一个完全等价的操作,即不避免* _ join
操作,这将不得不复制整个DT1
只是为它添加一个新列,这是不必要的。
检查此帖子的实际使用情况。
总而言之,重要的是要了解每一点优化都很重要。由于 Grace Hopper 会说,
blockquote>
3。语法
现在让我们看看语法。 Hadley在此处评论:
数据表非常快,但我认为它们的简洁性使学习更加困难。和使用它的代码在你写完后难以阅读 ... p>
我认为这一点毫无意义,因为它是非常主观的。我们可以尝试的是对比语法的一致性。我们将比较data.table和dplyr语法并排。
我们将使用如下所示的虚拟数据:
DT = data.table(x = 1:10,y = 11:20,z = rep(1:2,each = 5))
DF = as.data.frame(DT)
基本汇总/更新操作。
#case(a)
DT [,sum(y),by = z] ## data.table语法
DF%>%group_by(z)%>%summarize(sum(y))## dplyr syntax
DT [,y:= cumsum(y),by = z]
ans< - DF%>%group_by(z)%>%mutate(y = cumsum(y))
#case(b)
DT [ 2,sum(y),by = z]
DF%>%filter(x> 2)%>%group_by(z)%>%summarize [x> 2,y:= cumsum(y),by = z]
ans< -DF%>%group_by(z)%>%mutate(y =如果(任何(x> 5L))y [1L] -y [2L]否则y [2L] ,by = z]
DF%>%group_by(z)%>%summarize(if(any(x> 5L))y [1L] -y [2L] else y [2L])
DT [,if(any(x> 5L))y [1L] -y [2L],by = z]
DF%>%group_by(z)%>%filter (x> 5L))%>总结(y [1L] -y [2L])
data.table语法非常紧凑,dplyr很冗长。在情况(a)中事物或多或少相等。
总结。但是在更新时,我们必须移动mutate()
中的逻辑。然而,在data.table中,我们用相同的逻辑表示两个操作 - 对x> 2
,但在第一种情况下,getsum(y)
,而在第二种情况下更新/ code>与其累积和。
这是我们所说的
DT [i,j,by] / c>
类似地,在情况(c)中,当我们有
if-else
条件,我们能够在data.table和dplyr中表示逻辑as-is。但是,如果我们想只返回if
条件满足的那些行,而跳过其他行,我们不能使用summarize()
直接(AFAICT)。我们必须先filter()
,然后总结,因为summarize()
总是期望一个 em>。
当返回相同的结果时,使用
filter()
>
在第一种情况下也可以使用
filter()
(对我来说不太明显)
多列上的聚合/更新
p>
#case(a)
DT [,lapply(.SD,sum),by = z] ## data。表语法
DF%>%group_by(z)%>%summarise_each(funs(sum))## dplyr syntax
DT [,(cols):= lapply(.SD,sum) by z]
ans< - DF%>%group_by(z)%>%mutate_each(funs(sum))
#case(b)
DT [,c(lapply(.SD,sum),lapply(.SD,mean)),by = z]
DF%>%group_by(z)%>%summarise_each )
#case(c)
DT [,c(.N,lapply(.SD,sum)),by = z]
DF%>%group_by z)%>%summarise_each(funs(n(),mean))
在情况(a)中,代码或多或少是等效的。 data.table使用熟悉的基函数
lapply()
,而dplyr
引入以及
funs()
的一系列函数。
data.table的
:=
需要提供列名称,而dplyr会自动生成。 / p>
在情况(b)中,dplyr的语法相对简单。
在(c)情况下,dplyr会返回
n )
多少列,而不只是一次。在data.table中,我们需要做的是在j
中返回一个列表。列表中的每个元素将成为结果中的一个列。因此,我们可以再次使用熟悉的基函数c()
将.N
连接到list
,它返回列表
。
注意:再一次,在data.table中,我们需要做的是在
j
中返回一个列表。列表中的每个元素将成为结果中的一个列。您可以使用c()
,as.list()
,lapply code>,
list()
等基本函数来实现这一点,而不必学习任何新的函数。
你至少需要学习特殊变量 -
.N
和.SD
。 dplyr中的等效项为n()
和。
li>
加入
dplyr为每种类型的连接提供单独的函数,因为data.table允许使用相同的语法进行连接
DT [i,j,by]
(以及原因)。它还提供等效的merge.data.table()
函数作为替代。setkey(DT1,x,y)
#1.正常加入
DT1 [DT2] ##数据表语法
left_join(DT2,DT1) #dplyr syntax
#2.连接时选择列
DT1 [DT2,。(z,i.mul)]
left_join(select(DT2,x,y,mul ),选择(DT1,x,y,z))
#3.聚合同时join
DT1 [DT2,...(sum(z)* i.mul) EACHI]
DF1%>%group_by(x,y)%>%summarize(z = sum(z))%>%
inner_join(DF2)%& z * mul)%>%select(-mul)
#4.连接更新
DT1 [DT2,z:= cumsum(z)* i.mul,by =。 EACHI]
?
#5.滚动连接
DT1 [DT2,roll = -Inf]
#6.控制输出的其他参数
DT1 [DT2,mult =first]
?
有些人可能会发现一个单独的函数, (左,右,内,反,半等),而其他人可能喜欢data.table的
DT [i,j,by]
或merge()
它类似于基础R.
但是dplyr连接只是这样。而已。没什么。
data.tables可以在加入时选择列(2),在dplyr中,您需要
select $ c>首先在两个data.frames之前加入如上所示。
data.tables可以在加入时聚合 >(3)并使用
by = .EACHI
功能更新加入(4)为什么要整个连接结果只添加/更新几个列?
data.table可以滚动连接 ) - roll forward,LOCF , roll back,NOCB ,最近的。
data.table还有
mult =
参数,选择第一个,最后或匹配(6)。
data.table有
allow.cartesian = TRUE
参数以防止意外无效连接。
再次,语法与
DT [i,j,by]
do()
...
dplyr的summarize是专门为返回单个值的函数而设计的。如果你的函数返回多个/不等的值,你必须使用
do()
。你必须事先知道你所有的函数返回值。DT [,list(x [1],y [1]),by = z] ## data.table语法
DF%>%group_by(z)%>%summarize(x [1],y [1])## dplyr syntax
DT [,list(x [1:2],y [1]),by = z]
DF%>%group_by(z)%>%do(data.frame(。$ x [1:2],。$ y [1]))
DT [,quantile(x,0.25),by = z]
DF%>%group_by(z)%>%summarize(quantile(x,0.25))
DT [,quantile(x,c(0.25,0.75)),by = z]
DF%>%group_by(z)%>%do(data.frame (0.25,0.75))))
DT [,as.list(summary(x)),by = z]
DF%>%group_by(z)%& do(data.frame(as.list(summary(。$ x))))
.SD
的等效物为。
在data.table中,你可以在
j
中抛出任何东西 - 唯一要记住的是它返回一个列表
在dplyr中,不能这样做。必须求助于
do()
,这取决于你是否确定你的函数是否总是返回单个值。它是相当缓慢。
.table的语法与
DT [i,j,by]
一致。我们可以继续在j
中投入表达式,而不必担心这些事情。
查看此SO问题和这一个。我想知道是否可以使用dplyr的语法直接表达答案...
总而言之,我特别强调了几个实例,其中dplyr的语法是低效的,有限的或无法使操作直接。这尤其是因为data.table在更难读/学习语法(如上面粘贴/链接的语法)上产生相当大的反感。大多数帖子涵盖dplyr谈论最简单的操作。这是伟大的。但是,重要的是要实现它的语法和功能的限制,我还没有看到一个帖子。
data.table也有它的怪癖(一些其中我已经指出,我们正在试图修复)。我们还尝试改进data.table的连接,因为我已在此处突出显示此处。
但是还应该考虑dplyr与data.table相比缺少的功能数量。
4。功能
我已指出这里的大部分功能,也在这篇文章。此外:
fread - 快速文件阅读器已经可用很长时间了。 / p>
fwrite - 当前版本中的新功能,v1.9.7,并行化的快速文件编写器可用。有关实施的详细说明,请参阅此帖,#1664 追踪进一步的发展。
自动建立索引 - 内部优化基本R语法的另一个方便的功能。
特殊分组:
dplyr
通过在汇总期间对变量进行分组
data.table连接中有许多优点(速度/内存效率和语法) )
非等同加入:是v1.9.7 +中提供的新功能。它允许使用其他运算符
< =,<,>,> =
连同data.table连接的所有其他优点连接。
setorder()
data.table中的函数,允许通过引用真正快速重新排序data.tables
data.table fintersect()提供了更快的等效集合操作,从v1.9.7 +(由Jan Gorecki写) / as> 和
所有
参数(如在SQL中)一样, p>
data.table无干扰地加载了无掩码警告,并且具有这里为
[。data.frame
兼容性,当传递到任何R包。 dplyr更改基函数过滤器
,lag
和[
引起问题;例如此处和此处.
Finally:
On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.
On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using
OpenMP
.
Overview
I'm relatively familiar with
data.table
, not so much withdplyr
. I've read through somedplyr
vignettes and examples that have popped up on SO, and so far my conclusions are that:
data.table
anddplyr
are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)dplyr
has more accessible syntaxdplyr
abstracts (or will) potential DB interactions- There are some minor functionality differences (see "Examples/Usage" below)
In my mind 2. doesn't bear much weight because I am fairly familiar with it
data.table
, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar withdata.table
. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).Question
What I want to know is:
- Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
- Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.
One recent SO question got me thinking about this a bit more, because up until that point I didn't think
dplyr
would offer much beyond what I can already do indata.table
. Here is thedplyr
solution (data at end of Q):dat %.% group_by(name, job) %.% filter(job != "Boss" | year == min(year)) %.% mutate(cumu_job2 = cumsum(job2))
Which was much better than my hack attempt at a
data.table
solution. That said, gooddata.table
solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):setDT(dat)[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], by=list(id, job) ]
The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to
data.table
(i.e. doesn't use some of the more esoteric tricks).Ideally what I'd like to see is some good examples were the
dplyr
ordata.table
way is substantially more concise or performs substantially better.Examples
UsageBenchmarks
dplyr
does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around usingdo
in the answer to @eddi's question).data.table
supports rolling joins (thanks @dholstius) as well as overlap joinsdata.table
internally optimises expressions of the formDT[col == value]
orDT[col %in% values]
for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.dplyr
offers standard evaluation versions of functions (e.g.regroup
,summarize_each_
) that can simplify the programmatic use ofdplyr
(note programmatic use ofdata.table
is definitely possible, just requires some careful though, substitution/quoting, etc, at least to my knowledge)
- I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point
data.table
becomes substantially faster.- @Arun ran some benchmarks on joins, showing that
data.table
scales better thandplyr
as the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values hasdata.table
~6x faster.- (Unverified) has
data.table
75% faster on larger versions of a group/apply/sort whiledplyr
was 40% faster on the smaller ones (another SO question from comments, thanks danas).- Matt, the main author of
data.table
, has benchmarked grouping operations ondata.table
,dplyr
and pythonpandas
on up to 2 billion rows (~100GB in RAM).- An older benchmark on 80K groups has
data.table
~8x fasterData
This is for the first example I showed in the question section.
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", "Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", "name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, -16L))
解决方案We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance):
Speed
,Memory usage
,Syntax
andFeatures
.My intent is to cover each one of these as clearly as possible from data.table perspective.
Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.
The data.table syntax is consistent in its form -
DT[i, j, by]
. To keepi
,j
andby
together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while mainitaining the consistency in syntax.1. Speed
Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares
pandas
.On benchmarks, it would be great to cover these remaining aspects as well:
Grouping operations involving a subset of rows - i.e.,
DT[x > val, sum(y), by=z]
type operations.Benchmark other operations such as update and joins.
Also benchmark memory footprint for each operation in addition to runtime.
2. Memory usage
Operations involving
filter()
orslice()
in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.
data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).
# sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA]
But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):
# copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.
Therefore we are working towards exporting
shallow()
function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. }
By not using
shallow()
, the old functionality is retained:bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. }
By creating a shallow copy using
shallow()
, we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.Also, once
shallow()
is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.But it will still lack many features that data.table provides, including (sub)-assignment by reference.
Aggregate while joining:
Suppose you have two data.tables as follows:
DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # x y z # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # x y mul # 1: 1 a 4 # 2: 2 b 3
And you would like to get
sum(z) * mul
for each row inDT2
while joining by columnsx,y
. We can either:
1) aggregate
DT1
to getsum(z)
, 2) perform a join and 3) multiply (or)# data.table way DT1[, .(z=sum(z)), keyby=.(x,y)][DT2][, z := z*mul][] # dplyr equivalent DF1 %>% group_by(x,y) %>% summarise(z=sum(z)) %>% right_join(DF2) %>% mutate(z=z*mul)
2) do it all in one go (using
by=.EACHI
feature):DT1[DT2, list(z=sum(z) * mul), by=.EACHI]
What is the advantage?
We don't have to allocate memory for the intermediate result.
We don't have to group/hash twice (one for aggregation and other for joining).
And more importantly, the operation what we wanted to perform is clear by looking at
j
in (2).Check this post for a detailed explanation of
by=.EACHI
. No intermediate results are materialised, and the join+aggregate is performed all in one go.Have a look at this, this and this posts for real usage scenarios.
In
dplyr
you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).Update and joins:
Consider the data.table code shown below:
DT1[DT2, col := i.mul]
adds/updates
DT1
's columncol
withmul
fromDT2
on those rows whereDT2
's key column matchesDT1
. I don't think there is an exact equivalent of this operation indplyr
, i.e., without avoiding a*_join
operation, which would have to copy the entireDT1
just to add a new column to it, which is unnecessary.Check this post for a real usage scenario.
To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!
3. Syntax
Let's now look at syntax. Hadley commented here:
Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...
I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.
We will work with the dummy data shown below:
DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5)) DF = as.data.frame(DT)
Basic aggregation/update operations.
# case (a) DT[, sum(y), by=z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by=z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by=z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by=z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x>2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by=z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L]-y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L]-y[2L], by=z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L]-y[2L])
data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).
In case (b), we had to use
filter()
in dplyr while summarising. But while updating, we had to move the logic insidemutate()
. In data.table however, we express both operations with the same logic - operate on rows wherex > 2
, but in first case, getsum(y)
, whereas in the second case update those rows fory
with its cumulative sum.This is what we mean when we say the
DT[i, j, by]
form is consistent.Similarly in case (c), when we have
if-else
condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where theif
condition satisfies and skip otherwise, we cannot usesummarise()
directly (AFAICT). We have tofilter()
first and then summarise becausesummarise()
always expects a single value.While it returns the same result, using
filter()
here makes the actual operation less obvious.It might very well be possible to use
filter()
in the first case as well (does not seem obvious to me), but my point is that we should not have to.Aggregation / update on multiple columns
# case (a) DT[, lapply(.SD, sum), by=z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by=z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by=z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by=z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
In case (a), the codes are more or less equivalent. data.table uses familiar base function
lapply()
, whereasdplyr
introduces*_each()
along with a bunch of functions tofuns()
.data.table's
:=
requires column names to be provided, whereas dplyr generates it automatically.In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.
In case (c) though, dplyr would return
n()
as many times as many columns, instead of just once. In data.table, all we need to do is to return a list inj
. Each element of the list will become a column in the result. So, we can use, once again, the familiar base functionc()
to concatenate.N
to alist
which returns alist
.Note: Once again, in data.table, all we need to do is return a list in
j
. Each element of the list will become a column in result. You can usec()
,as.list()
,lapply()
,list()
etc... base functions to accomplish this, without having to learn any new functions.You will need to learn just the special variables -
.N
and.SD
at least. The equivalent in dplyr aren()
and.
Joins
dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax
DT[i, j, by]
(and with reason). It also provides an equivalentmerge.data.table()
function as an alternative.setkey(DT1, x, y) # 1. normal join DT1[DT2] ## data.table syntax left_join(DT2, DT1) ## dplyr syntax # 2. select columns while join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x,y,mul), select(DT1, x,y,z)) # 3. aggregate while join DT1[DT2, .(sum(z)*i.mul), by=.EACHI] DF1 %>% group_by(x, y) %>% summarise(z=sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul) # 4. update while join DT1[DT2, z := cumsum(z)*i.mul, by=.EACHI] ?? # 5. rolling join DT1[DT2, roll = -Inf] ?? # 6. other arguments to control output DT1[DT2, mult = "first"] ??
Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc..), whereas as others might like data.table's
DT[i, j, by]
, ormerge()
which is similar to base R.However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to
select()
first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.data.tables can aggregate while joining (3) and also update while joining (4), using
by=.EACHI
feature. Why materialse the entire join result to add/update just a few columns?data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has
mult=
argument which selects first, last or all matches (6).data.table has
allow.cartesian=TRUE
argument to protect from accidental invalid joins.
Once again, the syntax is consistent with
DT[i, j, by]
with additional arguments allowing for controlling the output further.
do()
...dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to
do()
. You have to know beforehand about all your functions return value.DT[, list(x[1], y[1]), by=z] ## data.table syntax DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax DT[, list(x[1:2], y[1]), by=z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1])) DT[, quantile(x, 0.25), by=z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25)) DT[, quantile(x, c(0.25, 0.75)), by=z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75)))) DT[, as.list(summary(x)), by=z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
.SD
's equivalent is.
In data.table, you can throw pretty much anything in
j
- the only thing to remember is for it to return a list so that each element of the list gets converted to a column.In dplyr, cannot do that. Have to resort to
do()
depending on how sure you are as to whether your function would always return a single value. And it is quite slow.
Once again, data.table's syntax is consistent with
DT[i, j, by]
. We can just keep throwing expressions inj
without having to worry about these things.Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...
To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.
data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.
But one should also consider the number of features that dplyr lacks in comparison to data.table.
4. Features
I have pointed out most of the features here and also in this post. In addition:
fread - fast file reader has been available for a long time now.
fwrite - NEW in the current devel, v1.9.7, a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing - another handy feature to optimise base R syntax as is, internally.
Ad-hoc grouping:
dplyr
automatically sorts the results by grouping variables duringsummarise()
, which may not be always desirable.Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.
Non-equi joins: is a NEW feature available from v1.9.7+. It allows joins using other operators
<=, <, >, >=
along with all other advantages of data.table joins.Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.
setorder()
function in data.table that allows really fast reordering of data.tables by reference.dplyr provides interface to databases using the same syntax, which data.table does not at the moment.
data.table
provides faster equivalents of set operations from v1.9.7+ (written by Jan Gorecki) - fsetdiff(), fintersect(), funion() and fsetequal() with additionalall
argument (as in SQL).data.table loads cleanly with no masking warnings and has a mechanism described here for
[.data.frame
compatibility when passed to any R package. dplyr changes base functionsfilter
,lag
and[
which can cause problems; e.g. here and here.
Finally:
On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.
On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using
OpenMP
.这篇关于data.table vs dplyr:可以做一些好的,其他不能或不好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!