data.table vs dplyr:可以做一些好的事情,另一个不能或不好吗? [英] data.table vs dplyr: can one do something well the other can't or does poorly?

查看:99
本文介绍了data.table vs dplyr:可以做一些好的事情,另一个不能或不好吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

概述



我比较熟悉 data.table ,而不是 dplyr 。我阅读了一些 dplyr vignette 以及在SO上弹出的例子,到目前为止,我的结论是:


  1. data.table dplyr 的速度是可比的,除非有很多(即> 10-100K)的组,而在其他一些情况下(见下面的基准)

  2. dplyr 具有更易于访问的语法

  3. dplyr 抽象(或将)潜在的数据库交互

  4. 有一些小的功能差异(参见下面的示例/用法)

在我心中2.不重要,因为我相当熟悉它 data.table ,虽然我明白用户对这两者都将是一个重要因素。我想避免一个更直观的论据,因为对于我已经熟悉 data.table 的人的观点,我的具体问题是无关紧要的。我也想避免讨论如何更直观导致更快的分析(肯定是真的,但再次,不是我最感兴趣的这里)。



问题



我想知道的是:


  1. 是否有分析任务对于熟悉这些软件包的人来说,使用一个或另一个软件包来编写代码更容易(即,所需的击键所需的组合与所需的深奥级别的组合,每种软件都是一件好事)。

  2. 有没有一个分析工作在一个软件包和另一个软件包中实质上(即超过2x)执行。

一个最近的SO问题让我想起了这更多一点,因为直到那时我不认为 dplyr 将提供超出我可以在 data.table 。这里是 dplyr 解决方案(Q结尾的数据):

  dat%。%
group_by(name,job)%。%
过滤器(job!=Boss| year == min(year))%。%
mutate(cumu_job2 = cumsum (job2))

哪个比我在一个数据的黑客尝试好多了.table 解决方案。也就是说,好的 data.table 解决方案也很不错(感谢Jean-Robert,Arun,并且在这里注意到,我喜欢在严格最优的解决方案中的单一陈述):

  setDT(dat)[,
.SD [job!=Boss| year == min(year)] [,cumjob:= cumsum(job2)],
by = list(id,job)
]

后者的语法可能看起来非常深奥,但如果习惯于 data.table (即不使用一些更深奥的技巧)



理想情况下,我想看到的是一些很好的例子, dplyr data.table 方式实质上更简洁或执行得更好。



< h3>示例

用法


  • dplyr 允许返回任意行数的分组操作(从 eddi的问题 ,请注意:这看起来将在 dplyr 0.5 ,另外,在@ eddi的问题的答案中,@beginneR显示了使用 do 的潜在解决方案。

  • 数据.table 支持 滚动加入 (感谢@dholstius)以及 重叠连接

  • data.table internal优化 DT [col == value] DT [col%in%values] >速度通过自动索引使用二进制搜索,同时使用相同的基本R语法。 请参阅这里了解更多详细信息和小型基准。

  • dplyr 提供功能的标准评估版本(例如重组 summarize_each _ )可以简化 dplyr (注意编程使用 data.table 的编程使用是绝对可能的,只是至少根据我的知识需要一些小心,替代/引用等)



基准


  • 我跑了 我自己的基准 ,发现这两个包在拆分应用组合样式分析中可比较,除非有非常大数量的组(> 100K),此时 data.table 变得更快。

  • @Arun运行一些 连接的基准 ,显示 data.table dplyr 随着团体数量的增加(随着两个包中最近的增强和最新版本的R更新)。另外,当试图获得 唯一值 data.table 〜6x更快。

  • (未验证)有数据。较大版本的组合/应用/排序速度快75%,而 dplyr 较小版本的速度提高了40%( <一个href =https://stackoverflow.com/questions/21477525/fast-frequency-and-percentage-table-with-dplyr/>另一个来自评论的SO问题 ,谢谢danas)。

  • Matt是 data.table 的主要作者,具有 基准分组操作 data.table dplyr 和python pandas 最多可达20亿行(RAM中约100GB)

  • data.table 〜8x更快



数据



这是第一个例如,我在问题部分中显示。

  dat < -  structure(list(id = c(1L,1L,1L, 1L,1L,1L,1L,1L,2L,2L,
2L,2L,2L,2L,2L,2L),name = c(Jane,Jane,Jane ,
Jane,Jane,Jane,Jane,Bob,Bob,Bob,Bob,Bob,
Bob Bob,Bob),year = c(1980L,1981L,1982L,1983L,1984L,
1985L,1986L,1987L,1985L,1986L,1987L,1988L,1989L,1990L,
1991L, 1992年),job = c(经理,经理,经理,经理,
经理,经理,老板,老板,经理 ,经理,
老板,老板,老板,老板,老板),job2 = c(1L,1L,1L,
1L,1L,1L, 0L,0L,1L,1L,1L,0L,0L,0L,0L,0L)).Names = c(id,
name,year,job ),class =data.fra我,row.names = c(NA,
-16L))


解决方案

我们需要至少涵盖这些方面,以提供全面的答案/比较(不重要的顺序):速度内存使用情况语法功能



我的意图是从data.table的角度尽可能清楚地覆盖每一个。


注意:除非另有明确指出,通过引用dplyr,我们引用了dplyr的data.frame接口,其内部使用C ++使用Rcpp。







data.table语法的格式一致 - DT [i,j,by] 。为了保持 i ,<$ code>, 。通过将相关操作保持在一起,它允许轻松优化 的操作,更重要的是内存使用,并且还提供一些强大的功能



1。速度



已经将已经显示data.table的问题添加到几个基准(尽管主要是分组操作)比dplyr更快地通过增加分组的组和/或行数,包括基准测试马特从100万到1000万组和不同的分组列分组从1000万到20亿行(100GB内存),也比较 pandas



在基准测试中,覆盖这些其他方面也是非常好的:




  • 分组涉及行子集的操作 - 即 DT [x> $ / $>

  • 对其他操作进行基准测试,例如更新

  • em>和连接


  • 除了运行时,还针对每个操作进行基准测试内存占用。 / p>




2。内存使用量




  1. 涉及 filter()或<$ c $在dplyr中的c> slice()可能是内存效率低下(在data.frames和data.tables上)。 看到这篇文章


    请注意, Hadley的评论谈论了速度(那个dplyr对他来说是很丰富的),而这里的主要关注是>内存



  2. 此时data.table界面允许修改/更新列通过参考(请注意,我们不需要将结果重新分配给变量)。

      #sub-assign by reference,updates'y'in-place 
    DT [x> = 1L,y:= NA]

    但是,dplyr 永远不会通过引用更新。 (请注意,结果需要重新分配):

     #复制整个y列
    ans< - DF%>%mutate(y = replace(y,which(x> = 1L),NA))

    这个问题是 参照透明度 。通过引用更新data.table对象,特别是在函数内部可能并不总是需要。但这是一个非常有用的功能:请参阅这个帖子有趣的情况。我们想保留它。



    因此,我们正在努力在data.table中导出 shallow()函数将为用户提供两种可能性。例如,如果不希望修改函数中的输入数据表,那么可以执行以下操作:

      foo< ;  - 功能(DT){
    DT =浅(DT)##浅拷贝DT
    DT [,newcol:= 1L] ##不影响原始DT
    DT [x> ; 2L,newcol:= 2L] ##不需要复制(内部),因为此列只存在于浅层复制DT
    DT [x> 2L,x:= 3L] ##必须复制(像基本R / dplyr一样);否则原始DT将
    ##也得到修改。
    }

    不使用 shallow(),旧的功能被保留:

      bar<  -  function(DT){
    DT [ newcol:= 1L] ##旧行为,原始DT通过引用更新
    DT [x> 2L,x:= 3L] ##旧行为,更新原始DT中的列x。
    }

    使用<$ c创建浅拷贝 $ c> shallow(),我们明白你不想修改原始对象。我们在内部处理所有内容,以确保尽可能复制列,只有在绝对必要时才修改 。实施时,这应该可以解决参照透明度问题,同时为用户提供两种可能性。


    另外,一旦 shallow()被导出,dplyr的data.table接口应该避免几乎所有的副本。所以那些喜欢dplyr语法的人可以用data.tables来使用它。



    但是它仍然缺少data.table提供的许多功能,包括(子) - 参考引用。


    / li>

  3. 加入时聚合



    假设您有两个数据表,如下所示:

      DT1 = data.table(x = c(1,1,1,1,2,2,2,2),y = c(a ,a,b,b),z = 1:8,key = c(x,y))
    #xyz
    #1:1 a 1
    #2:1 a 2
    #3:1 b 3
    #4:1 b 4
    #5:2 a 5
    #6:2 a 6
    #7:2 b 7
    #8:2 b 8
    DT2 = data.table(x = 1:2,y = c(a,b),mul = 4:3,key = c(x,y))
    #xy mul
    #1:1 a 4
    #2:2 b 3

    你想为每一个获得 sum(z)* mul DT2 的行中,加入列 x,y 。我们可以:




    • 1)聚合 DT1 获取 sum(z),2)执行加入,3)乘法(或)

       #data.table way 
      DT1 [,。(z = sum(z)),keyby =。(x,y)] [DT2] [,z:= z * mul] []

      #dplyr等价物
      DF1%>%group_by(x,y)%>%summarize(z = sum(z))%>%
      right_join(DF2)%> ;%mutate(z = z * mul)


    • 2) (使用 by = .EACHI 功能):

        DT1 [DT2,列表(z = sum(z)* mul),by = .EACHI] 




    有什么优势?




    • 我们不必为中间结果分配内存


    • 我们不需要两次分组/哈希(一个用于聚合,另一个用于加入)。


    • 更重要的是,通过查看(2)中的 j 可以清楚我们想要执行的操作。




    检查< a href =https://stackoverflow.com/a/27004566/559784>这篇文章,详细解释了 by = .EACHI 。没有中间结果被实现,并且join +聚合一次执行。



    看看这个以及发布实际使用情况



    dplyr 中,您必须使用首先连接并聚合或聚合,然后连接,这两者都不会在内存方面(这反过来转化为速度)也是如此有效。

    / li>
  4. 更新和连接:



    考虑如下所示的data.table代码:

      DT1 [DT2,col:= i.mul] 

    添加/更新 DT1 的列 col mul DT2 的键列匹配 DT2 的c $ c> / code>。我不认为在 dplyr 中完全相当于此操作,即不避免使用 * _ join 操作,这将不得不复制整个 DT1 只是为了添加一个新的列,这是不必要的。



    请查看此信息以获取实际使用情况。





总而言之,重要的是要意识到每一点优化都很重要。由于 Grace Hopper 会说, 记住你的纳秒




3。语法



现在我们来看看语法。 Hadley在此处发表评论:


数据表非常快,但我认为他们的简洁使它难以学习,而使用它的代码在写入后难以阅读 ...


我觉得这句话毫无意义,因为它是非常主观的。我们可以尝试的是比较语法的一致性。我们将并排比较data.table和dplyr语法。



我们将使用如下所示的虚拟数据:

  DT = data.table(x = 1:10,y = 11:20,z = rep(1:2,each = 5))
DF = as.data.frame(DT)




  1. 基本汇总/更新操作。

     #case(a)
    DT [,sum(y),by = z] ## data.table语法
    DF%>%group_by(z)%>%summarize(sum(y))## dplyr syntax
    DT [,y:= cumsum(y),by = z]
    ans< - DF%>%group_by(z)%>%mutate(y = cumsum(y))

    #case(b)
    DT [x& 2,sum(y),by = z]
    DF%>%filter(x> 2)%>%group_by(z)%>%summarize(sum(y))
    DT [x] 2,y = = cumsum(y),by = z]
    ans< - DF%>%group_by(z)%>%mutate(y = replace(y,which(x> y(1)] y [2L] else y [2L]
    $ b DT(,((x> 5L) ,by = z]
    DF%>%group_by(z)%>%summaryize(if(any(x> 5L))y [1L] -y [2L] else y [2L])
    DT [,if(any(x> 5L))y [1L] -y [2L],by = z]
    DF%>%group_by(z)%>%filter (x> 5L))%>%总结(y [1L] -y [2L])




    • data.table语法是紧凑的,dplyr是相当冗长的。 (a)情况或多或少相同。


    • 如果(b),我们必须在dplyr中使用 filter()总结。但是当更新时,我们必须在 mutate()之间移动逻辑。然而,在data.table中,我们用相同的逻辑来表达这两个操作 - 对 x> 2 ,但是在第一种情况下,得到 sum(y),而在第二种情况下更新 y 其累积金额。



      这是我们的意思,当我们说 DT [i,j,by] / code> form 是一致的


    • 类似的情况(c),当我们有 if-else 条件,我们能够在data.table和dplyr中表达逻辑原样。但是,如果我们希望只返回如果条件满足的那些行,否则跳过,我们不能使用 summarize()直接(AFAICT)。我们首先需要 filter(),然后总结,因为 summaryize()总是期望一个单个值



      虽然返回相同的结果,但使用 filter()可以使实际操作变得不那么明显。 >

      在第一种情况下,也可能使用 filter()(对我来说似乎不太明显)但是我的观点是我们不应该这样做。



  2. 多列的聚合/ p>

     #case(a)
    DT [,lapply(.SD,sum),by = z] ## data。表语法
    DF%>%group_by(z)%>%summarise_each(funs(sum))## dplyr syntax
    DT [,(cols))= = lapply(.SD,sum) by = z]
    ans< - DF%>%group_by(z)%>%mutate_each(funs(sum))

    #case(b)
    DT [,c(lapply(.SD,sum),lapply(.SD,mean)),by = z]
    DF%>%group_by(z)%>%summarise_each(funs(sum,mean) )

    #case(c)
    DT [,c(.N,lapply(.SD,sum)),by = z]
    DF%>%group_by z)%>%summarise_each(funs(n(),mean))




    • 如果(a),代码或多或少等同。 data.table使用熟悉的基本函数 lapply(),而 dplyr 引入 * _ each() 以及一些函数到 funs()


    • data.table的:= 需要提供列名称,而dplyr会自动生成。 / p>


    • 如果(b),dplyr的语法相对简单。改进多个功能的聚合/更新是在data.table的列表中。


    • 如果(c),dplyr将返回 n )多列,而不只是一次。在data.table中,我们需要做的是在 j 中返回一个列表。列表的每个元素将成为结果中的一列。因此,我们可以再次使用熟悉的基础函数 c() .N 连接到列表返回列表





    注意:再次,在data.table中,我们需要做的是返回 j 中的列表。列表的每个元素将成为结果中的一列。您可以使用 c() as.list() lapply() code>, list()等...基本功能完成此操作,无需学习任何新功能。



    您将需要至少学习特殊变量 - .N .SD 。 dplyr中的等价物是 n()


    li>

  3. 加入



    dplyr为每种类型的连接提供单独的功能,其中data.table允许使用相同语法的连接 DT [i,j,by] (和原因)。它还提供了一个等效的 merge.data.table()函数作为替代。

      setkey(DT1,x,y)

    #1.正常连接
    DT1 [DT2] ## data.table语法
    left_join(DT2,DT1)# #dplyr syntax

    #2.加入时选择列
    DT1 [DT2,。(z,i.mul)]
    left_join(select(DT2,x,y,mul ),选择(DT1,x,y,z))

    #3.加入汇总
    DT1 [DT2,。(sum(z)* i.mul),by =。 %d $ b DF1%>%group_by(x,y)%>%summarize(z = sum(z))%>%
    inner_join(DF2)%>%mutate(z = z * mul)%>%select(-mul)

    #4.加入时更新
    DT1 [DT2,z:= cumsum(z)* i.mul,by =。 EACHI]
    ??

    #5.滚动加入
    DT1 [DT2,roll = -Inf]
    ??

    #6.控制输出的其他参数
    DT1 [DT2,mult =first]
    ??




    • 有些可能会为每个连接找到一个单独的功能更好(左,右,内,反,半等),而其他人可能会喜欢data.table的 DT [i,j,by] merge(),它类似于base R.


    • 然而,dplyr join只是这样。而已。没有什么


    • data.tables可以在加入(2)时选择列,而在dplyr中,您需要 select()首先在两个data.frames之前加入,如上所示。否则,您将使用不必要的列来实现连接,以便稍后删除它们,这是无效的。


    • data.tables可以在加入使用 by = .EACHI 功能,加入(4),并且更新。为什么要使整个连接结果添加/更新只是几列?


    • data.table能够滚动连接(5 ) - 滚动转发,LOCF 向后滚动,NOCB 最近


    • data.table还有 mult = 参数,选择第一个最后一个 / em> matches(6)。


    • data.table的 allow.cartesian = TRUE 参数,以防止意外的无效连接。






再次,语法与 DT [i,j,by] code>其他参数可以进一步控制输出。





  1. do() ...



    dplyr的总结是为返回单个值的函数而设计的。如果你的函数返回多个/不等的值,你将不得不诉诸于 do()。你必须事先知道你所有的函数返回值。

      DT [,list(x [1],y [1]),by = z] ## data.table syntax 
    DF%>%group_by(z)%>%summarize(x [1],y [1])## dplyr syntax
    DT [,list(x [1:2],y [1]),by = z]
    DF%>%group_by(z)%>%do(data.frame(。$ x [1:2],。$ y [1]))

    DT [,quantile(x,0.25),by = z]
    DF%>%group_by(z)%>%summarize(quantile(x,0.25))
    DT [,quantile(x,c(0.25,0.75)),by = z]
    DF%>%group_by(z)%>%do(data.frame (0.25,0.75))))

    DT [,as.list(summary(x)),by = z]
    DF%>%group_by(z)%>% do(data.frame(as.list(summary(。$ x))))




    • .SD 的等价物是


    • 在data.table中,您可以在 j 中抛出任何东西 - 唯一要记住的是它返回一个列表,以便列表中的每个元素都转换为列。


    • 在dplyr中,不能这样做。根据您的功能是否始终返回单个值,您必须诉诸 do()。而且相当慢






.table的语法与 DT [i,j,by] 一致。我们可以继续在 j 中抛出表达式,而无需担心这些事情。


看看这个SO问题这个。我想知道是否可以使用dplyr的语法来直接表达答案...


总而言之,我特别强调了>几个实例,其中dplyr的语法是低效的,有限的或不能使操作直截了当。这特别是因为data.table对于更难阅读/学习的语法(如上面粘贴/链接的一个)有相当多的反弹。大多数涵盖dplyr的帖子谈论最简单的操作。那真是太棒了但是,重要的是要实现其语法和功能限制,而且我还没有看到一个帖子。



data.table也有它的怪癖(一些我已经指出我们试图修复)。我也在尝试改进data.table的连接,因为我已经强调了这里



但是,与data.table相比,还应考虑dplyr缺少的功能数量。




4。特征



我已经指出了大部分功能此处和也在这篇文章。另外:




  • fread - 快速文件阅读器已经可用了很长时间。 / p>


  • fwrite - 当前的devel,v1.9.7,一个并行可用。有关实施的详细说明,请参阅此帖#1664 ,以跟踪进一步的发展。


  • 自动索引 - 优化基础R的另一个便利功能


  • 即时分组 dplyr 通过在 summaryize()中对变量进行分组来自动排序结果,这可能并不总是可取的。



  • 非平等加入:是v1.9.7 +的新功能。它允许使用其他运算符< =,<,>,> = 连接数据连接的所有其他优点。


  • 重叠范围连接最近在data.table中实现了。查看此帖,了解基准测试的概况。


  • setorder() data.table中的函数,允许通过引用真正快速重新排序data.tables 。


  • dplyr提供与数据库的接口使用相同的语法,哪个data.table目前不存在。


  • data.table 从v1.9.7 +(由Jan Gorecki编写) - fsetdiff()提供了更快等效的集合操作 > fintersect() funion() fsetequal()附加的全部参数SQL)。


  • data.table在没有屏蔽警告的情况下干净地加载,并具有描述这里 [。data.frame filter, lag and [ which can cause problems;例如here and here.







Finally:




  • On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.


  • On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).




    • Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.



Overview

I'm relatively familiar with data.table, not so much with dplyr. I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:

  1. data.table and dplyr are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
  2. dplyr has more accessible syntax
  3. dplyr abstracts (or will) potential DB interactions
  4. There are some minor functionality differences (see "Examples/Usage" below)

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).

Question

What I want to know is:

  1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
  2. Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table. Here is the dplyr solution (data at end of Q):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution. That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (i.e. doesn't use some of the more esoteric tricks).

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better.

Examples

Usage
  • dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do in the answer to @eddi's question).
  • data.table supports rolling joins (thanks @dholstius) as well as overlap joins
  • data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
  • dplyr offers standard evaluation versions of functions (e.g. regroup, summarize_each_) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful though, substitution/quoting, etc, at least to my knowledge)
Benchmarks

Data

This is for the first example I showed in the question section.

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))

解决方案

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.


The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while mainitaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas.

On benchmarks, it would be great to cover these remaining aspects as well:

  • Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by=z] type operations.

  • Benchmark other operations such as update and joins.

  • Also benchmark memory footprint for each operation in addition to runtime.

2. Memory usage

  1. Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.

  2. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

    # sub-assign by reference, updates 'y' in-place
    DT[x >= 1L, y := NA]
    

    But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):

    # copies the entire 'y' column
    ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
    

    A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:

    foo <- function(DT) {
        DT = shallow(DT)          ## shallow copy DT
        DT[, newcol := 1L]        ## does not affect the original DT 
        DT[x > 2L, newcol := 2L]  ## no need to copy (internally), as this column exists only in shallow copied DT
        DT[x > 2L, x := 3L]       ## have to copy (like base R / dplyr does always); otherwise original DT will 
                                  ## also get modified.
    }
    

    By not using shallow(), the old functionality is retained:

    bar <- function(DT) {
        DT[, newcol := 1L]        ## old behaviour, original DT gets updated by reference
        DT[x > 2L, x := 3L]       ## old behaviour, update column x in original DT.
    }
    

    By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.

    Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.

    But it will still lack many features that data.table provides, including (sub)-assignment by reference.

  3. Aggregate while joining:

    Suppose you have two data.tables as follows:

    DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
    #    x y z
    # 1: 1 a 1
    # 2: 1 a 2
    # 3: 1 b 3
    # 4: 1 b 4
    # 5: 2 a 5
    # 6: 2 a 6
    # 7: 2 b 7
    # 8: 2 b 8
    DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
    #    x y mul
    # 1: 1 a   4
    # 2: 2 b   3
    

    And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either:

    • 1) aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)

      # data.table way
      DT1[, .(z=sum(z)), keyby=.(x,y)][DT2][, z := z*mul][]
      
      # dplyr equivalent
      DF1 %>% group_by(x,y) %>% summarise(z=sum(z)) %>% 
          right_join(DF2) %>% mutate(z=z*mul)
      

    • 2) do it all in one go (using by=.EACHI feature):

      DT1[DT2, list(z=sum(z) * mul), by=.EACHI]
      

    What is the advantage?

    • We don't have to allocate memory for the intermediate result.

    • We don't have to group/hash twice (one for aggregation and other for joining).

    • And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

    Check this post for a detailed explanation of by=.EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

    Have a look at this, this and this posts for real usage scenarios.

    In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

  4. Update and joins:

    Consider the data.table code shown below:

    DT1[DT2, col := i.mul]
    

    adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

    Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)

  1. Basic aggregation/update operations.

    # case (a)
    DT[, sum(y), by=z]                       ## data.table syntax
    DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
    DT[, y := cumsum(y), by=z]
    ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
    
    # case (b)
    DT[x > 2, sum(y), by=z]
    DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
    DT[x > 2, y := cumsum(y), by=z]
    ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x>2), cumsum(y)))
    
    # case (c)
    DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by=z]
    DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L]-y[2L] else y[2L])
    DT[, if(any(x > 5L)) y[1L]-y[2L], by=z]
    DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L]-y[2L])
    

    • data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

    • In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

      This is what we mean when we say the DT[i, j, by] form is consistent.

    • Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

      While it returns the same result, using filter() here makes the actual operation less obvious.

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

  2. Aggregation / update on multiple columns

    # case (a)
    DT[, lapply(.SD, sum), by=z]                     ## data.table syntax
    DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
    DT[, (cols) := lapply(.SD, sum), by=z]
    ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
    
    # case (b)
    DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by=z]
    DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
    
    # case (c)
    DT[, c(.N, lapply(.SD, sum)), by=z]     
    DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
    

    • In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

    • data.table's := requires column names to be provided, whereas dplyr generates it automatically.

    • In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

    • In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

    Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

    You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

  3. Joins

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

    setkey(DT1, x, y)
    
    # 1. normal join
    DT1[DT2]            ## data.table syntax
    left_join(DT2, DT1) ## dplyr syntax
    
    # 2. select columns while join    
    DT1[DT2, .(z, i.mul)]
    left_join(select(DT2, x,y,mul), select(DT1, x,y,z))
    
    # 3. aggregate while join
    DT1[DT2, .(sum(z)*i.mul), by=.EACHI]
    DF1 %>% group_by(x, y) %>% summarise(z=sum(z)) %>% 
        inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
    
    # 4. update while join
    DT1[DT2, z := cumsum(z)*i.mul, by=.EACHI]
    ??
    
    # 5. rolling join
    DT1[DT2, roll = -Inf]
    ??
    
    # 6. other arguments to control output
    DT1[DT2, mult = "first"]
    ??
    

    • Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc..), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

    • However dplyr joins do just that. Nothing more. Nothing less.

    • data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

    • data.tables can aggregate while joining (3) and also update while joining (4), using by=.EACHI feature. Why materialse the entire join result to add/update just a few columns?

    • data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

    • data.table also has mult= argument which selects first, last or all matches (6).

    • data.table has allow.cartesian=TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

  1. do()...

    dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

    DT[, list(x[1], y[1]), by=z]                 ## data.table syntax
    DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
    DT[, list(x[1:2], y[1]), by=z]
    DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
    
    DT[, quantile(x, 0.25), by=z]
    DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
    DT[, quantile(x, c(0.25, 0.75)), by=z]
    DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
    
    DT[, as.list(summary(x)), by=z]
    DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
    

    • .SD's equivalent is .

    • In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

    • In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

  • fread - fast file reader has been available for a long time now.

  • fwrite - NEW in the current devel, v1.9.7, a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

  • Automatic indexing - another handy feature to optimise base R syntax as is, internally.

  • Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.

  • Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.

  • Non-equi joins: is a NEW feature available from v1.9.7+. It allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.

  • Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.

  • setorder() function in data.table that allows really fast reordering of data.tables by reference.

  • dplyr provides interface to databases using the same syntax, which data.table does not at the moment.

  • data.table provides faster equivalents of set operations from v1.9.7+ (written by Jan Gorecki) - fsetdiff(), fintersect(), funion() and fsetequal() with additional all argument (as in SQL).

  • data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


Finally:

  • On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.

  • On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).

    • Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.

这篇关于data.table vs dplyr:可以做一些好的事情,另一个不能或不好吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆