data.table vs dplyr:一个人可以做得很好,而另一个人不能或做得很差? [英] data.table vs dplyr: can one do something well the other can't or does poorly?

查看:10
本文介绍了data.table vs dplyr:一个人可以做得很好,而另一个人不能或做得很差?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

概述

我比较熟悉data.table,不太熟悉dplyr.我已经阅读了一些 dplyr vignettes 和弹出的示例关于 SO,到目前为止,我的结论是:

  1. data.tabledplyr 在速度上相当,除非有很多(即 >10-100K)组,以及在某些其他情况下(请参阅下面的基准)
  2. dplyr 具有更易于访问的语法
  3. dplyr 抽象(或将)潜在的数据库交互
  4. 存在一些细微的功能差异(请参阅下面的示例/用法")

在我看来 2. 没有多大分量,因为我对它 data.table 相当熟悉,尽管我知道对于这两个方面的新手来说,这将是一个重要因素.我想避免争论哪个更直观,因为这与我从已经熟悉 data.table 的人的角度提出的具体问题无关.我也想避免讨论更直观"如何导致更快的分析(当然是真的,但同样,这不是我最感兴趣的).

问题

我想知道的是:

  1. 对于熟悉这些软件包的人来说,是否有更容易使用其中一个软件包编写代码的分析任务(例如,所需的击键与所需的深奥水平的某种组合,其中每一种都少一些是好事).
  2. 在一个包中是否存在比在另一个包中执行效率显着提高(即超过 2 倍)的分析任务.

得到一个最近的SO问题我想了更多,因为在那之前我认为 dplyr 不会提供超出我在 data.table 中已经可以做的事情.这是 dplyr 解决方案(Q 末尾的数据):

dat %.%group_by(姓名,工作) %.%过滤器(工作!=老板"|年==分钟(年))%.%变异(cumu_job2 = cumsum(job2))

这比我对 data.table 解决方案的破解尝试要好得多.也就是说,良好的 data.table 解决方案也非常好(感谢 Jean-Robert、Arun,并注意这里我更喜欢单一语句而不是严格最优的解决方案):

setDT(dat)[,.SD[工作!=老板"|年 == 分钟(年)][, cumjob := cumsum(job2)],按=列表(ID,工作)]

后者的语法可能看起来很深奥,但如果您习惯于data.table(即不使用一些更深奥的技巧),它实际上非常简单.

理想情况下,我希望看到一些很好的例子,即 dplyrdata.table 方式更简洁或性能更佳.

示例

用法
  • dplyr 不允许返回任意行数的分组操作(来自 eddi 的问题,注意:这看起来会在 dplyr 0.5,另外,@beginneR 显示了一个潜在的解决方法do 在@eddi 问题的答案中).
  • data.table 支持滚动连接(感谢@dholstius)以及重叠连接
  • data.table 在内部优化 DT[col == value]DT[col %in% values] 形式的表达式速度通过自动索引,它使用二分搜索,同时使用相同的基本R语法.查看这里了解更多详情和小基准.
  • dplyr 提供函数的标准评估版本(例如 regroupsummarize_each_),可以简化 dplyr(请注意,data.table 的编程使用绝对是可能的,只需要仔细考虑、替换/引用等,至少据我所知)
基准

数据

这是我在问题部分展示的第一个示例.

dat <-结构(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane",简"、简"、简"、简"、鲍勃"、鲍勃"、鲍勃"、鲍勃"、鲍勃"、"鲍勃", "鲍勃", "鲍勃"), 年份 = c(1980L, 1981L, 1982L, 1983L, 1984L,1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L,1991L, 1992L), job = c("经理", "经理", "经理", "经理",经理"、经理"、老板"、老板"、经理"、经理"、经理"、上司"、上司"、上司"、上司"、上司"), job2 = c(1L, 1L, 1L,1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id",名称"、年份"、工作"、工作 2")、类 =data.frame"、row.names = c(NA,-16L))

解决方案

我们需要至少涵盖这些方面以提供全面的答案/比较(没有特别的重要性顺序):速度内存使用语法功能.

我的目的是从 data.table 的角度尽可能清楚地涵盖其中的每一项.

<块引用>

注意:除非另有明确说明,否则通过引用 dplyr,我们指的是 dplyr 的 data.frame 接口,其内部结构是使用 Rcpp 在 C++ 中实现的.


data.table 语法在其形式上是一致的 - DT[i, j, by].将 ijby 保持在一起是设计使然.通过将相关操作放在一起,它可以轻松优化操作以提高速度,更重要的是内存使用,并且还提供了一些强大的功能,同时保持语法的一致性.

1.速度

相当多的基准(尽管主要是关于分组操作)已经添加到已经显示 data.table 比 dplyr的问题中,因为要分组的组和/或行数增加,包括 Matt 的基准 关于从 1000 万分组100 - 1000 万组 上的 20 亿行(RAM 中 100GB)和不同的分组列,这也比较了 pandas.另请参阅 更新基准,其中包括 Sparkpydatatable 也是.

在基准测试中,最好也涵盖这些剩余方面:

  • 涉及行子集的分组操作 - 即,DT[x >val, sum(y), by = z] 类型操作.

  • 对其他操作进行基准测试,例如 updatejoins.

  • 除了运行时之外,还对每个操作的内存占用进行基准测试.

2.内存使用情况

  1. 在 dplyr 中涉及 filter()slice() 的操作可能会导致内存效率低下(在 data.frames 和 data.tables 上).查看这篇文章.

    <块引用>

    请注意 Hadley 的评论 谈论 speed(dplyr 对他来说足够快),而这里主要关注的是 memory.

  2. data.table 接口目前允许通过引用修改/更新列(请注意,我们不需要将结果重新分配回变量).

     # 通过引用进行子分配,就地更新'y'DT[x >= 1L, y := NA]

    但是 dplyr 永远不会通过引用更新.dplyr 等效项是(请注意,需要重新分配结果):

     # 复制整个 'y' 列ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))

    对此的担忧是参考透明度.通过引用更新 data.table 对象,尤其是在函数内可能并不总是可取的.但这是一个非常有用的功能:请参阅 thisthis 发布有趣的案例.我们想保留它.

    因此,我们正在努力在 data.table 中导出 shallow() 函数,这将为用户提供两种可能性.例如,如果不希望在函数中修改输入 data.table,则可以这样做:

     foo <- function(DT) {DT = shallow(DT) ##浅拷贝DTDT[, newcol := 1L] ##不影响原来的DTDT[x>2L, newcol := 2L] ## 不需要复制(内部),因为该列只存在于浅复制的DT中DT[x>2L, x := 3L] ## 必须复制(就像 base R/dplyr 一样);否则原始 DT 将## 也得到修改.}

    通过不使用 shallow(),保留旧功能:

     bar <- function(DT) {DT[, newcol := 1L] ## 旧行为,原始 DT 通过引用更新DT[x>2L, x := 3L] ## 旧行为,更新原始 DT 中的 x 列.}

    通过使用shallow() 创建浅拷贝,我们知道您不想修改原始对象.我们会在内部处理所有事情,以确保在复制列的同时仅在绝对必要时修改.实施后,这应该完全解决引用透明度问题,同时为用户提供两种可能性.

    <块引用>

    此外,一旦 shallow() 被导出,dplyr 的 data.table 接口应该避免几乎所有的副本.所以那些喜欢 dplyr 语法的人可以将它与 data.tables 一起使用.

    <块引用>

    但它仍然缺少 data.table 提供的许多功能,包括通过引用进行(子)分配.

  3. 加入时聚合:

    假设你有两个data.tables如下:

     DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c(a", a", b";, "b"), z=1:8, key=c("x", "y"))# x y z# 1: 1 一 1#2:1一2# 3: 1 b 3# 4: 1 b 4# 5: 2 一 5# 6: 2 一 6# 7: 2 b 7# 8: 2 b 8DT2 = data.table(x=1:2, y=c(a", b"), mul=4:3, key=c(x", y"))# x y 多重# 1:1 一 4# 2: 2 b 3

    并且您希望在 x,y 列连接时为 DT2 中的每一行获取 sum(z) * mul.我们可以:

      1. 聚合 DT1 得到 sum(z),2) 执行连接和 3) 乘(或)

        data.table方式

        DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]

        dplyr 等效

        DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%right_join(DF2) %>% 变异(z = z * mul)

      1. 一次性完成所有操作(使用 by = .EACHI 功能):

        DT1[DT2, list(z=sum(z) * mul), by = .EACHI]

    有什么优势?

    • 我们不必为中间结果分配内存.

    • 我们不必进行两次分组/散列(一次用于聚合,另一次用于加入).

    • 更重要的是,我们想要执行的操作通过查看(2)中的j就很清楚了.

    查看这篇文章了解by = .EACHI的详细说明.没有中间结果被物化,并且join+aggregate是一次性完成的.

    看看这个thisthis 针对实际使用场景发布.

    dplyr 中,您必须 加入并聚合或先聚合然后加入,就内存(反过来转化为速度)而言,两者都没有那么高效.

  4. 更新和加入:

    考虑如下所示的data.table代码:

     DT1[DT2, col := i.mul]

    使用 DT2 中的 mul 添加/更新 DT1 的列 colDT2 的键列与 DT1 匹配.我认为在 dplyr 中不存在与此操作完全等价的操作,即在不避免 *_join 操作的情况下,该操作必须复制整个 DT1 只是为它添加一个新列,这是不必要的.

    查看这篇文章了解真实的使用场景.

<块引用>

总而言之,重要的是要意识到每一点优化都很重要.正如 Grace Hopper 所说,注意纳秒

3.语法

现在让我们看看语法.Hadley 评论了这里:

<块引用>

数据表非常快,但我认为它们的简洁性使得它更难学习并且使用它的代码在你编写后更难阅读 ...

我觉得这句话毫无意义,因为它非常主观.我们或许可以尝试对比语法的一致性.我们将并排比较 data.table 和 dplyr 语法.

我们将使用如下所示的虚拟数据:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))DF = as.data.frame(DT)

  1. 基本聚合/更新操作.

     # case (a)DT[, sum(y), by = z] ## data.table 语法DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr 语法DT[, y := cumsum(y), by = z]ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))# 案例(b)DT[x>2, sum(y), by = z]DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))DT[x>2, y := cumsum(y), by = z]ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))# 案例 (c)DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])

    • data.table 语法紧凑,dplyr 相当冗长.在情况 (a) 中,事情或多或少是等价的.

    • 在情况 (b) 中,我们必须在 summarising 时在 dplyr 中使用 filter().但是在更新时,我们不得不将逻辑移到mutate()中.然而,在 data.table 中,我们用相同的逻辑来表达这两个操作——对 x > 的行进行操作.2,但在第一种情况下,获取 sum(y),而在第二种情况下,用其累积和更新 y 的那些行.

      这就是我们所说的DT[i, j, by]形式一致的意思.

    • 类似情况(c),当我们有 if-else 条件时,我们能够在两者中表达逻辑as-is"data.table 和 dplyr.但是,如果我们只想返回满足 if 条件的行并跳过,我们不能直接使用 summarise() (AFAICT).我们必须先 filter() 然后汇总,因为 summarise() 总是需要一个单个值.

      虽然返回相同的结果,但在此处使用 filter() 会使实际操作不那么明显.

      在第一种情况下也很可能使用 filter()(对我来说似乎并不明显),但我的意思是我们不应该这样做.

  2. 多列的聚合/更新

     # case (a)DT[, lapply(.SD, sum), by = z] ## data.table 语法DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr 语法DT[, (cols) := lapply(.SD, sum), by = z]ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))# 案例(b)DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]DF %>% group_by(z) %>% summarise_each(funs(sum, mean))# 案例 (c)DT[, c(.N, lapply(.SD, sum)), by = z]DF %>% group_by(z) %>% summarise_each(funs(n(), mean))

    • 在情况 (a) 中,代码或多或少是等价的.data.table 使用熟悉的基本函数 lapply(),而 dplyr 引入了 *_each() 以及一堆函数来 funs().

    • data.table 的 := 要求提供列名,而 dplyr 会自动生成.

    • 在情况(b)中,dplyr 的语法相对简单.改进多个函数的聚合/更新在 data.table 的列表中.

    • 在 (c) 的情况下,dplyr 将返回 n() 的列数,而不是只返回一次.在 data.table 中,我们需要做的就是在 j 中返回一个列表.列表中的每个元素都将成为结果中的一列.因此,我们可以再次使用熟悉的基本函数 c().N 连接到返回 list列表.

    <块引用>

    注意:再一次,在 data.table 中,我们需要做的就是在 j 中返回一个列表.列表的每个元素都将成为结果中的一列.您可以使用 c()as.list()lapply()list() 等... 完成此任务的基本功能,无需学习任何新功能.

    <块引用>

    您只需要学习特殊变量 - 至少 .N.SD.dplyr 中的等价物是 n().

  3. 加入

    dplyr 为每种连接类型提供单独的函数,其中 data.table 允许使用相同的语法 DT[i, j, by] 进行连接(并且有原因).它还提供了一个等效的 merge.data.table() 函数作为替代.

     setkey(DT1, x, y)# 1. 普通加入DT1[DT2] ## data.table 语法left_join(DT2, DT1) ## dplyr 语法# 2. 加入时选择列DT1[DT2, .(z, i.mul)]left_join(选择(DT2,x,y,mul),选择(DT1,x,y,z))# 3. 加入时聚合DT1[DT2, .(sum(z) * i.mul), by = .EACHI]DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%内部连接(DF2)%>% 变异(z = z*mul)%>% 选择(-mul)# 4. 加入时更新DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]??# 5. 滚动连接DT1[DT2, 滚动 = -Inf]??# 6. 控制输出的其他参数DT1[DT2, mult = 第一"]??

  • 有些人可能会发现每个连接的单独函数更好(左、右、内、反、半等),而其他人可能喜欢 data.table 的 DT[i, j, by]merge(),类似于基本 R.

  • 但是 dplyr 加入就是这样做的.而已.一点也不差.

  • data.tables 可以在加入时选择列 (2),在 dplyr 中,您需要先在两个 data.frames 上 select() 才能加入,如上所示.否则,您将使用不必要的列来实现连接,以便稍后删除它们,这是低效的.

  • data.tables 可以在加入时聚合,使用 by = .EACHI 功能 (3) 以及 在加入时更新 (4).为什么要具体化整个连接结果来添加/更新几列?

  • data.table 能够滚动连接 (5) - 滚动向前,LOCF向后滚动,NOCB最近的.

  • data.table 还具有 mult = 参数,用于选择 firstlastall 匹配项(6).

  • data.table 有 allow.cartesian = TRUE 参数来防止意外的无效连接.

<块引用>

再一次,语法与 DT[i, j, by] 一致,附加参数允许进一步控制输出.

  1. do()...

    dplyr 的汇总是专门为返回单个值的函数设计的.如果您的函数返回多个/不相等的值,您将不得不求助于 do().您必须事先了解所有函数的返回值.

     DT[, list(x[1], y[1]), by = z] ## data.table 语法DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr 语法DT[, list(x[1:2], y[1]), by = z]DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))DT[, 分位数(x, 0.25), by = z]DF %>% group_by(z) %>% summarise(quantile(x, 0.25))DT[, 分位数(x, c(0.25, 0.75)), by = z]DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))DT[, as.list(summary(x)), by = z]DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))

  • .SD 的等价物是 .

  • 在 data.table 中,您几乎可以在 j 中抛出任何东西 - 唯一要记住的是它返回一个列表,以便列表中的每个元素都转换为列.

  • 在 dplyr 中,不能这样做.必须求助于 do() 取决于您对函数是否总是返回单个值的确定程度.而且速度很慢.

<块引用>

再次说明,data.table 的语法与 DT[i, j, by] 一致.我们可以继续在 j 中抛出表达式,而不必担心这些事情.

看看 这个 SO question这个.我想知道是否可以使用 dplyr 的语法直接表达答案...

<块引用>

总而言之,我特别强调了 几个 实例,其中 dplyr 的语法效率低下、受限或无法使操作简单明了.这尤其是因为 data.table 在更难阅读/学习"方面受到了相当多的反对.语法(如上面粘贴/链接的那个).大多数涵盖 dplyr 的帖子都在谈论最直接的操作.这很棒.但认识到它的语法和功能限制也很重要,我还没有看到关于它的帖子.

<块引用>

data.table 也有其怪癖(其中一些我已经指出我们正在尝试修复).正如我在此处强调的那样,我们也在尝试改进 data.table 的连接.

<块引用>

但还应考虑 dplyr 与 data.table 相比缺少的功能数量.

4.特点

我已经在这里和这篇文章中指出了大部分功能.另外:

  • fread - 快速文件阅读器已经存在很长时间了.

  • fwrite - 并行 快速文件写入器现已推出.有关实现的详细说明,请参阅 这篇文章#1664 用于跟踪进一步的发展.

  • 自动索引 - 另一个在内部优化基本 R 语法的便捷功能.

  • Ad-hoc 分组:dplyrsummarise() 期间通过对变量进行分组自动对结果进行排序,这可能并不总是可取的.

  • 上面提到的 data.table 连接的许多优点(速度/内存效率和语法).

  • 非等值连接:允许使用其他运算符 <=、<、>、>= 进行连接以及数据的所有其他优势.table 连接.

  • 重叠范围连接在数据中实现.table 最近.查看这篇文章了解基准测试概览.

  • setorder() 函数在 data.table 中允许通过引用快速重新排序 data.tables.

  • dplyr 提供 数据库接口使用相同的语法,data.table 目前没有.

  • data.table 提供更快的 set 操作 等价物(由 Jan Gorecki 编写) - fsetdifffintersectfunionfsetequal 以及附加的 all 参数(如 SQL 中).

  • data.table 可以干净地加载,没有任何屏蔽警告,并且具有此处[传递给任何 R 包时的 .data.frame 兼容性.dplyr 更改了可能导致问题的基本函数 filterlag[;例如这里这里>.


最后:

  • 关于数据库 - data.table 没有理由不能提供类似的接口,但这不是现在的优先事项.如果用户非常喜欢该功能,它可能会增加.不确定.

  • 关于并行性 - 一切都很困难,直到有人继续去做.当然,这需要付出努力(线程安全).

    • 目前(在 v1.9.7 开发版中)正在使用 OpenMP 并行化已知耗时的部分以提高性能.

Overview

I'm relatively familiar with data.table, not so much with dplyr. I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:

  1. data.table and dplyr are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
  2. dplyr has more accessible syntax
  3. dplyr abstracts (or will) potential DB interactions
  4. There are some minor functionality differences (see "Examples/Usage" below)

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).

Question

What I want to know is:

  1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
  2. Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table. Here is the dplyr solution (data at end of Q):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution. That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (i.e. doesn't use some of the more esoteric tricks).

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better.

Examples

Usage
  • dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do in the answer to @eddi's question).
  • data.table supports rolling joins (thanks @dholstius) as well as overlap joins
  • data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
  • dplyr offers standard evaluation versions of functions (e.g. regroup, summarize_each_) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)
Benchmarks

Data

This is for the first example I showed in the question section.

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))

解决方案

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.


The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas. See also updated benchmarks, which include Spark and pydatatable as well.

On benchmarks, it would be great to cover these remaining aspects as well:

  • Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.

  • Benchmark other operations such as update and joins.

  • Also benchmark memory footprint for each operation in addition to runtime.

2. Memory usage

  1. Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.

  2. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

     # sub-assign by reference, updates 'y' in-place
     DT[x >= 1L, y := NA]
    

    But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):

     # copies the entire 'y' column
     ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
    

    A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:

     foo <- function(DT) {
         DT = shallow(DT)          ## shallow copy DT
         DT[, newcol := 1L]        ## does not affect the original DT 
         DT[x > 2L, newcol := 2L]  ## no need to copy (internally), as this column exists only in shallow copied DT
         DT[x > 2L, x := 3L]       ## have to copy (like base R / dplyr does always); otherwise original DT will 
                                   ## also get modified.
     }
    

    By not using shallow(), the old functionality is retained:

     bar <- function(DT) {
         DT[, newcol := 1L]        ## old behaviour, original DT gets updated by reference
         DT[x > 2L, x := 3L]       ## old behaviour, update column x in original DT.
     }
    

    By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.

    Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.

    But it will still lack many features that data.table provides, including (sub)-assignment by reference.

  3. Aggregate while joining:

    Suppose you have two data.tables as follows:

     DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
     #    x y z
     # 1: 1 a 1
     # 2: 1 a 2
     # 3: 1 b 3
     # 4: 1 b 4
     # 5: 2 a 5
     # 6: 2 a 6
     # 7: 2 b 7
     # 8: 2 b 8
     DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
     #    x y mul
     # 1: 1 a   4
     # 2: 2 b   3
    

    And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either:

      1. aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)

        data.table way

        DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]

        dplyr equivalent

        DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul)

      1. do it all in one go (using by = .EACHI feature):

        DT1[DT2, list(z=sum(z) * mul), by = .EACHI]

    What is the advantage?

    • We don't have to allocate memory for the intermediate result.

    • We don't have to group/hash twice (one for aggregation and other for joining).

    • And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

    Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

    Have a look at this, this and this posts for real usage scenarios.

    In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

  4. Update and joins:

    Consider the data.table code shown below:

     DT1[DT2, col := i.mul]
    

    adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

    Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)

  1. Basic aggregation/update operations.

     # case (a)
     DT[, sum(y), by = z]                       ## data.table syntax
     DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
     DT[, y := cumsum(y), by = z]
     ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
    
     # case (b)
     DT[x > 2, sum(y), by = z]
     DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
     DT[x > 2, y := cumsum(y), by = z]
     ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
    
     # case (c)
     DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
     DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
     DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
     DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
    

    • data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

    • In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

      This is what we mean when we say the DT[i, j, by] form is consistent.

    • Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

      While it returns the same result, using filter() here makes the actual operation less obvious.

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

  2. Aggregation / update on multiple columns

     # case (a)
     DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
     DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
     DT[, (cols) := lapply(.SD, sum), by = z]
     ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
    
     # case (b)
     DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
     DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
    
     # case (c)
     DT[, c(.N, lapply(.SD, sum)), by = z]     
     DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
    

    • In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

    • data.table's := requires column names to be provided, whereas dplyr generates it automatically.

    • In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

    • In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

    Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

    You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

  3. Joins

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

     setkey(DT1, x, y)
    
     # 1. normal join
     DT1[DT2]            ## data.table syntax
     left_join(DT2, DT1) ## dplyr syntax
    
     # 2. select columns while join    
     DT1[DT2, .(z, i.mul)]
     left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
    
     # 3. aggregate while join
     DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
     DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
         inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
    
     # 4. update while join
     DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
     ??
    
     # 5. rolling join
     DT1[DT2, roll = -Inf]
     ??
    
     # 6. other arguments to control output
     DT1[DT2, mult = "first"]
     ??
    

  • Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

  • However dplyr joins do just that. Nothing more. Nothing less.

  • data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

  • data.tables can aggregate while joining using by = .EACHI feature (3) and also update while joining (4). Why materialize the entire join result to add/update just a few columns?

  • data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

  • data.table also has mult = argument which selects first, last or all matches (6).

  • data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

  1. do()...

    dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

     DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
     DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
     DT[, list(x[1:2], y[1]), by = z]
     DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
    
     DT[, quantile(x, 0.25), by = z]
     DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
     DT[, quantile(x, c(0.25, 0.75)), by = z]
     DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
    
     DT[, as.list(summary(x)), by = z]
     DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
    

  • .SD's equivalent is .

  • In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

  • In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

  • fread - fast file reader has been available for a long time now.

  • fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

  • Automatic indexing - another handy feature to optimise base R syntax as is, internally.

  • Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.

  • Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.

  • Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.

  • Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.

  • setorder() function in data.table that allows really fast reordering of data.tables by reference.

  • dplyr provides interface to databases using the same syntax, which data.table does not at the moment.

  • data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).

  • data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


Finally:

  • On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.

  • On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).

    • Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.

这篇关于data.table vs dplyr:一个人可以做得很好,而另一个人不能或做得很差?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆