data.table 与 dplyr:一个人能不能做得好,另一个不能或做得不好? [英] data.table vs dplyr: can one do something well the other can't or does poorly?

查看:14
本文介绍了data.table 与 dplyr:一个人能不能做得好,另一个不能或做得不好?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

概述

我对 data.table 比较熟悉,对 dplyr 不太熟悉.我已经阅读了一些 dplyr 小插曲 和出现的例子在 SO 上,到目前为止,我的结论是:

  1. data.tabledplyr 在速度上是相当的,除非有很多(即 >10-100K)组,以及在其他一些情况下(参见下面的基准)
  2. dplyr 具有更易于访问的语法
  3. dplyr 抽象(或将要)潜在的数据库交互
  4. 存在一些细微的功能差异(请参阅下面的示例/用法")

在我看来 2. 并不重要,因为我对它相当熟悉 data.table,但我知道对于这两个的新手来说,这将是一个重要因素.我想避免争论哪个更直观,因为这与我从已经熟悉 data.table 的人的角度提出的具体问题无关.我也想避免讨论更直观"如何导致更快的分析(当然是对的,但同样,这不是我在这里最感兴趣的).

问题

我想知道的是:

  1. 对于熟悉软件包的人来说,是否有使用一个或另一个软件包更容易编码的分析任务(即所需的击键组合与所需的深奥程度,其中每一项都少是一件好事).
  2. 是否有分析任务在一个包中执行的效率显着(即超过 2 倍)比在另一个包中要高得多.

一个最近的SO问题我考虑了更多,因为在那之前我不认为 dplyr 会提供超出我在 data.table 中已经可以做的更多.这是 dplyr 解决方案(Q 末尾的数据):

dat %.%group_by(姓名,工作)%.%过滤器(工作!=老板"|年==分钟(年))%.%变异(cumu_job2 = cumsum(job2))

这比我对 data.table 解决方案的 hack 尝试要好得多.也就是说,好的 data.table 解决方案也很不错(感谢 Jean-Robert、Arun,请注意,我更喜欢单一语句而不是严格的最佳解决方案):

setDT(dat)[,.SD[工作!=老板"|年 == 分钟(年)][, cumjob := cumsum(job2)],by=list(id, 工作)]

后者的语法可能看起来很深奥,但如果您习惯了data.table(即不使用一些更深奥的技巧),它实际上非常简单.

理想情况下,我希望看到一些很好的例子,例如 dplyrdata.table 方式更简洁或性能更好.

示例

用法
  • dplyr 不允许返回任意行数的分组操作(来自 eddi 的问题,注意:这看起来将在 dplyr 0.5,此外,@beginneR 显示了一个潜在的解决方法,使用do 在@eddi 问题的答案中).
  • data.table 支持滚动连接(感谢@dholstius)以及 重叠连接
  • data.table 在内部优化 DT[col == value]DT[col %in% values] 形式的表达式,用于通过自动索引加速,它使用二分搜索,同时使用相同的基本 R 语法.请参阅此处了解更多详细信息和小基准.
  • dplyr 提供函数的标准评估版本(例如 regroupsummarize_each_),可以简化 dplyr 的编程使用code>(注意 data.table 的编程使用绝对是可能的,只需要一些仔细的思考,替换/引用等,至少据我所知)
基准

数据

这是我在问题部分展示的第一个例子.

dat <- 结构(列表(id = c(1L,1L,1L,1L,1L,1L,1L,1L,2L,2L,2L, 2L, 2L, 2L, 2L, 2L), name = c("简", "简", "简", "简",简",简",简",简",鲍勃",鲍勃",鲍勃",鲍勃",鲍勃","鲍勃", "鲍勃", "鲍勃"), 年份 = c(1980L, 1981L, 1982L, 1983L, 1984L,1985L、1986L、1987L、1985L、1986L、1987L、1988L、1989L、1990L、1991L, 1992L), job = c("经理", "经理", "经理", "经理",经理",经理",老板",老板",经理",经理",经理",老板",老板",老板",老板",老板"),job2 = c(1L,1L,1L,1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id","name", "year", "job", "job2"), class = "data.frame", row.names = c(NA,-16L))

解决方案

我们需要至少涵盖这些方面以提供全面的答案/比较(没有特定的重要性顺序):Speed内存使用语法特性.

我的目的是从 data.table 的角度尽可能清楚地涵盖其中的每一个.

<块引用>

注意:除非另有明确说明,在提到 dplyr 时,我们指的是 dplyr 的 data.frame 接口,其内部使用 Rcpp 在 C++ 中.


data.table 的语法在形式上是一致的 - DT[i, j, by].将 ijby 保持在一起是设计使然.通过将相关操作放在一起,它可以轻松优化操作以提高速度,更重要的是内存使用,并且还提供了一些强大的功能,同时保持语法的一致性.

1.速度

已将相当多的基准(尽管主要是关于分组操作)添加到已经显示 data.table 比 dplyr 更快的问题中,因为要分组的组和/或行数增加,包括 Matt 的基准 关于1000 万分组的基准在 100 - 1000 万个组 和不同的分组列上达到 20 亿行(RAM 中为 100GB),这也比较了 pandas.另请参阅更新的基准,其中包括 Sparkpydatatable 也是.

在基准测试中,最好也涵盖这些剩余方面:

  • 涉及行子集的分组操作 - 即,DT[x >val, sum(y), by = z] 类型操作.

  • 对其他操作进行基准测试,例如 updatejoins.

  • 除了运行时,还对每个操作的内存占用进行基准测试.

2.内存使用

  1. 在 dplyr 中涉及 filter()slice() 的操作可能会导致内存效率低下(在 data.frames 和 data.tables 上).查看此帖子.

    <块引用>

    请注意哈德利的评论 谈到了速度(dplyr 对他来说速度很快),而这里主要关注的是内存.

  2. data.table 接口目前允许修改/更新列通过引用(注意,我们不需要将结果重新分配回变量).

     # 按引用进行子分配,就地更新 'y'DT[x >= 1L, y := NA]

    但是 dplyr 永远不会通过引用更新.dplyr 等价物将是(注意需要重新分配结果):

     # 复制整个 'y' 列ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))

    对此的一个问题是引用透明度.通过引用更新 data.table 对象,尤其是在函数内可能并不总是可取的.但这是一个非常有用的功能:参见thisthis 发布有趣的案例.我们想保留它.

    因此,我们正致力于在 data.table 中导出 shallow() 函数,这将为用户提供两种可能性.例如,如果希望不修改函数内的输入 data.table,则可以这样做:

     foo <- function(DT) {DT = 浅(DT) ## 浅拷贝DTDT[, newcol := 1L] ##不影响原来的DTDT[x>2L, newcol := 2L] ##不需要复制(内部),因为这个列只存在于浅复制的DT中DT[x>2L, x := 3L] ## 必须复制(就像 base R/dplyr 一样);否则原来的 DT 会## 也得到修改.}

    通过不使用shallow(),旧功能被保留:

     bar <- function(DT) {DT[, newcol := 1L] ## 旧行为,原始 DT 通过引用更新DT[x>2L, x := 3L] ## 旧行为,更新原始 DT 中的列 x.}

    通过使用shallow() 创建浅拷贝,我们了解到您不想修改原始对象.我们在内部处理所有事情,以确保同时确保复制您修改的列仅在绝对必要时.实施后,这应该完全解决引用透明度问题,同时为用户提供两种可能性.

    <块引用>

    此外,一旦 shallow() 被导出,dplyr 的 data.table 接口应该避免几乎所有的副本.所以那些喜欢 dplyr 语法的人可以将它与 data.tables 一起使用.

    <块引用>

    但是它仍然缺少data.table提供的许多功能,包括(子)引用赋值.

  3. 加入时聚合:

    假设您有两个 data.tables,如下所示:

     DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c(a", a", b";, b"), z=1:8, key=c(x", y"))# x y z# 1: 1 一个 1# 2: 1 一个 2# 3: 1 b 3# 4: 1 b 4# 5: 2 5# 6: 2 一个 6# 7: 2 b 7# 8: 2 b 8DT2 = data.table(x=1:2, y=c(a", b"), mul=4:3, key=c(x", y"))# x y mul# 1: 1 一个 4# 2: 2 b 3

    并且您希望在按 x,y 列连接时为 DT2 中的每一行获取 sum(z) * mul.我们可以:

      1. 聚合 DT1 得到 sum(z),2)执行连接和 3)乘(或)

        data.table方式

        DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]

        dplyr 等效

        DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%right_join(DF2) %>% mutate(z = z * mul)

      1. 一次性完成(使用 by = .EACHI 功能):

        DT1[DT2, list(z=sum(z) * mul), by = .EACHI]

    有什么优势?

    • 我们不必为中间结果分配内存.

    • 我们不必分组/散列两次(一次用于聚合,另一次用于加入).

    • 更重要的是,通过查看(2)中的j,我们想要执行的操作就很清楚了.

    查看这篇博文以了解by = .EACHI的详细说明.没有实现中间结果,连接+聚合一次性完成.

    看看这个thisthis 用于实际使用场景的帖子.

    dplyr 中,您必须先加入并聚合或聚合,然后再加入,就内存而言(反过来又转化为速度),这两种方法都没有那么高效.

  4. 更新和加入:

    考虑如下所示的 data.table 代码:

     DT1[DT2, col := i.mul]

    添加/更新 DT1 的列 colmul 来自 DT2 的那些行上的 DT2 的键列与 DT1 匹配.我认为 dplyr 中没有完全等效的此操作,即,无需避免 *_join 操作,该操作必须复制整个 DT1 只是给它添加一个新列,这是不必要的.

    查看这篇博文以获得真实的使用场景.

<块引用>

总而言之,重要的是要意识到优化的每一点都很重要.正如 Grace Hopper 所说,注意你的纳秒

3.语法

现在让我们看看语法.哈德利在此处评论:

<块引用>

数据表速度非常快,但我认为它们的简洁使得更难学习并且使用它的代码在你写完之后更难阅读 ...

我觉得这句话毫无意义,因为它非常主观.我们或许可以尝试对比语法的一致性.我们将并排比较 data.table 和 dplyr 语法.

我们将使用如下所示的虚拟数据:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))DF = as.data.frame(DT)

  1. 基本聚合/更新操作.

     # case (a)DT[, sum(y), by = z] ## data.table 语法DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr 语法DT[, y := cumsum(y), by = z]ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))# 案例 (b)DT[x>2, sum(y), by = z]DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))DT[x>2, y := cumsum(y), by = z]ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))# 案例 (c)DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])

    • data.table 语法紧凑,而 dplyr 非常冗长.在情况 (a) 中,事情或多或少是等价的.

    • 在情况 (b) 中,我们必须在 summarising 时在 dplyr 中使用 filter().但是在更新时,我们不得不将逻辑移到mutate()中.然而,在 data.table 中,我们用相同的逻辑表达两个操作 - 对 x > 的行进行操作.2,但在第一种情况下,获取 sum(y),而在第二种情况下,使用其累积总和更新 y 的那些行.

      当我们说DT[i, j, by]形式一致时,这就是我们的意思.

    • 与情况 (c) 类似,当我们有 if-else 条件时,我们能够在两者中表达逻辑 as-is"data.table 和 dplyr.但是,如果我们只想返回满足 if 条件的那些行,否则就跳过,我们不能直接使用 summarise() (AFAICT).我们必须先filter()然后总结,因为summarise()总是期望一个单个值.

      虽然返回相同的结果,但在此处使用 filter() 会使实际操作不那么明显.

      在第一种情况下也很可能使用 filter()(对我来说似乎并不明显),但我的观点是我们不应该这样做.

  2. 多列聚合/更新

     # case (a)DT[, lapply(.SD, sum), by = z] ## data.table 语法DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr 语法DT[, (cols) := lapply(.SD, sum), by = z]ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))# 案例 (b)DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]DF %>% group_by(z) %>% summarise_each(funs(sum, mean))# 案例 (c)DT[, c(.N, lapply(.SD, sum)), by = z]DF %>% group_by(z) %>% summarise_each(funs(n(), mean))

    • 在情况 (a) 中,代码或多或少是等效的.data.table 使用熟悉的基本函数 lapply(),而 dplyr 引入了 *_each() 以及一堆函数来乐趣().

    • data.table 的 := 需要提供列名,而 dplyr 会自动生成.

    • 在情况 (b) 中,dplyr 的语法相对简单.改进多个函数的聚合/更新在 data.table 的列表中.

    • 不过,在情况 (c) 中,dplyr 将返回 n() 列数的倍数,而不是一次.在data.table中,我们需要做的就是在j中返回一个列表.列表的每个元素都将成为结果中的一列.所以,我们可以再次使用熟悉的基本函数 c().N 连接到一个 list,它返回一个 列表.

    <块引用>

    注意:再次,在data.table中,我们需要做的就是在j中返回一个列表.列表中的每个元素都将成为结果中的一列.您可以使用 c()as.list()lapply()list() 等... 实现此目的的基本函数,而无需学习任何新函数.

    <块引用>

    您至少需要学习特殊变量 - .N.SD.dplyr 中的等价物是 n().

  3. 加入

    dplyr 为每种类型的连接提供单独的函数,其中 data.table 允许使用相同的语法 DT[i, j, by](并有理由)进行连接.它还提供了一个等效的 merge.data.table() 函数作为替代.

     setkey(DT1, x, y)# 1. 正常加入DT1[DT2] ## data.table 语法left_join(DT2, DT1) ## dplyr 语法# 2. 加入时选择列DT1[DT2, .(z, i.mul)]left_join(选择(DT2,x,y,mul),选择(DT1,x,y,z))# 3. 加入时聚合DT1[DT2, .(sum(z) * i.mul), by = .EACHI]DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%内连接(DF2)%>% 变异(z = z*mul)%>% 选择(-mul)# 4. 加入时更新DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]??#5. 滚动连接DT1[DT2,滚动 = -Inf]??# 6. 控制输出的其他参数DT1[DT2,mult =第一个"]??

  • 有些人可能会发现每个连接的单独函数更好(左、右、内、反、半等),而其他人可能喜欢 data.table 的 DT[i, j, by]merge() 类似于基础 R.

  • 但是 dplyr join 就是这样做的.而已.没什么.

  • data.tables 可以在加入 (2) 时选择列,并且在 dplyr 中,您需要先 select() 在两个 data.frames 上先加入,然后才能加入,如上所示.否则,您会将连接与不必要的列具体化,只是为了稍后删除它们而效率低下.

  • data.tables 可以在加入时聚合使用 by = .EACHI 特性 (3) 并且还可以加入时更新 (4).为什么要将整个连接结果物化为仅添加/更新几列?

  • data.table 能够滚动连接 (5) - 滚动向前,LOCF向后滚动、NOCB最近的.

  • data.table 也有 mult = 参数,用于选择 firstlastall 匹配项(6).

  • data.table 有 allow.cartesian = TRUE 参数来防止意外的无效连接.

<块引用>

再一次,语法与 DT[i, j, by] 一致,附加参数允许进一步控制输出.

  1. do()...

    dplyr 的汇总是专门为返回单个值的函数设计的.如果您的函数返回多个/不相等的值,您将不得不求助于 do().您必须事先了解所有函数的返回值.

     DT[, list(x[1], y[1]), by = z] ## data.table 语法DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr 语法DT[, list(x[1:2], y[1]), by = z]DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))DT[, 分位数 (x, 0.25), by = z]DF %>% group_by(z) %>% summarise(quantile(x, 0.25))DT[, 分位数(x, c(0.25, 0.75)), by = z]DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))DT[, as.list(summary(x)), by = z]DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))

  • .SD 的等价物是 .

  • 在data.table中,你几乎可以在j中抛出任何东西——唯一要记住的是它返回一个列表,这样列表的每个元素都被转换为一个列.

  • 在 dplyr 中,不能这样做.必须求助于 do() 取决于您对函数是否总是返回单个值的确定程度.而且速度很慢.

<块引用>

再次强调,data.table 的语法与DT[i, j, by] 一致.我们可以继续在 j 中抛出表达式而不必担心这些事情.

看看这个问题这个.我想知道是否可以使用 dplyr 的语法直接表达答案......

<块引用>

总而言之,我特别强调了几个实例,其中 dplyr 的语法效率低下、受限或无法使操作简单明了.这尤其是因为 data.table 对更难阅读/学习"有相当多的强烈反对.语法(如上面粘贴/链接的那个).大多数涉及 dplyr 的帖子都讨论了最直接的操作.这很棒.但是意识到它的语法和功能限制也很重要,我还没有看到关于它的帖子.

<块引用>

data.table 也有它的怪癖(我已经指出我们正在尝试修复其中一些).我们也在尝试改进 data.table 的连接,因为我在此处强调了这一点.

<块引用>

但是我们还应该考虑 dplyr 与 data.table 相比缺少的功能数量.

4.特点

我已经在此处以及这篇博文中指出了大部分功能.另外:

  • fread - 快速文件阅读器已经推出很长时间了.

  • fwrite - 并行 快速文件写入器现已可用.有关实现的详细说明,请参阅这篇文章#1664 用于跟踪进一步的发展.

  • 自动索引 - 在内部优化基本 R 语法的另一个方便的功能.

  • Ad-hoc grouping:dplyrsummarise() 期间通过对变量进行分组来自动对结果进行排序,这可能并不总是可取.

  • 上面提到的 data.table 连接(速度/内存效率和语法)的众多优势.

  • Non-equi joins:允许使用其他运算符<=, <, >, >= 以及数据的所有其他优势进行连接.table 连接.

  • 重叠范围连接已在数据中实现.table 最近.查看这篇博文了解基准测试概述.

  • setorder() data.table 中的函数,允许通过引用真正快速地重新排序 data.tables.

  • dplyr 提供数据库接口使用相同的语法,data.table 目前没有.

  • data.table 提供了更快的 set 操作(由 Jan Gorecki 编写) - fsetdifffinintersectfunionfsetequal 以及附加的 all 参数(如在 SQL 中).

  • data.table 加载干净,没有屏蔽警告,并具有在此处描述的机制,用于[.data.frame 传递给任何 R 包时的兼容性.dplyr 更改了可能导致问题的基本函数 filterlag[;例如这里这里.


最后:

  • 关于数据库——data.table 没有理由不能提供类似的接口,但这不是现在的优先事项.如果用户非常喜欢该功能,它可能会得到提升.. 不确定.

  • 关于并行性 - 一切都很困难,直到有人继续去做.当然这需要努力(线程安全).

    • 目前(在 v1.9.7 devel 中)在使用 OpenMP 并行化已知耗时部分以提高性能方面取得了进展.

Overview

I'm relatively familiar with data.table, not so much with dplyr. I've read through some dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:

  1. data.table and dplyr are comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
  2. dplyr has more accessible syntax
  3. dplyr abstracts (or will) potential DB interactions
  4. There are some minor functionality differences (see "Examples/Usage" below)

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).

Question

What I want to know is:

  1. Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
  2. Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.

One recent SO question got me thinking about this a bit more, because up until that point I didn't think dplyr would offer much beyond what I can already do in data.table. Here is the dplyr solution (data at end of Q):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

Which was much better than my hack attempt at a data.table solution. That said, good data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to data.table (i.e. doesn't use some of the more esoteric tricks).

Ideally what I'd like to see is some good examples were the dplyr or data.table way is substantially more concise or performs substantially better.

Examples

Usage
  • dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do in the answer to @eddi's question).
  • data.table supports rolling joins (thanks @dholstius) as well as overlap joins
  • data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
  • dplyr offers standard evaluation versions of functions (e.g. regroup, summarize_each_) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)
Benchmarks

Data

This is for the first example I showed in the question section.

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))

解决方案

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.


The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas. See also updated benchmarks, which include Spark and pydatatable as well.

On benchmarks, it would be great to cover these remaining aspects as well:

  • Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.

  • Benchmark other operations such as update and joins.

  • Also benchmark memory footprint for each operation in addition to runtime.

2. Memory usage

  1. Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.

  2. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

     # sub-assign by reference, updates 'y' in-place
     DT[x >= 1L, y := NA]
    

    But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):

     # copies the entire 'y' column
     ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
    

    A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:

     foo <- function(DT) {
         DT = shallow(DT)          ## shallow copy DT
         DT[, newcol := 1L]        ## does not affect the original DT 
         DT[x > 2L, newcol := 2L]  ## no need to copy (internally), as this column exists only in shallow copied DT
         DT[x > 2L, x := 3L]       ## have to copy (like base R / dplyr does always); otherwise original DT will 
                                   ## also get modified.
     }
    

    By not using shallow(), the old functionality is retained:

     bar <- function(DT) {
         DT[, newcol := 1L]        ## old behaviour, original DT gets updated by reference
         DT[x > 2L, x := 3L]       ## old behaviour, update column x in original DT.
     }
    

    By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.

    Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.

    But it will still lack many features that data.table provides, including (sub)-assignment by reference.

  3. Aggregate while joining:

    Suppose you have two data.tables as follows:

     DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
     #    x y z
     # 1: 1 a 1
     # 2: 1 a 2
     # 3: 1 b 3
     # 4: 1 b 4
     # 5: 2 a 5
     # 6: 2 a 6
     # 7: 2 b 7
     # 8: 2 b 8
     DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
     #    x y mul
     # 1: 1 a   4
     # 2: 2 b   3
    

    And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either:

      1. aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)

        data.table way

        DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]

        dplyr equivalent

        DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul)

      1. do it all in one go (using by = .EACHI feature):

        DT1[DT2, list(z=sum(z) * mul), by = .EACHI]

    What is the advantage?

    • We don't have to allocate memory for the intermediate result.

    • We don't have to group/hash twice (one for aggregation and other for joining).

    • And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

    Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

    Have a look at this, this and this posts for real usage scenarios.

    In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

  4. Update and joins:

    Consider the data.table code shown below:

     DT1[DT2, col := i.mul]
    

    adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

    Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)

  1. Basic aggregation/update operations.

     # case (a)
     DT[, sum(y), by = z]                       ## data.table syntax
     DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
     DT[, y := cumsum(y), by = z]
     ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
    
     # case (b)
     DT[x > 2, sum(y), by = z]
     DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
     DT[x > 2, y := cumsum(y), by = z]
     ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
    
     # case (c)
     DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
     DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
     DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
     DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
    

    • data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

    • In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

      This is what we mean when we say the DT[i, j, by] form is consistent.

    • Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

      While it returns the same result, using filter() here makes the actual operation less obvious.

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

  2. Aggregation / update on multiple columns

     # case (a)
     DT[, lapply(.SD, sum), by = z]                     ## data.table syntax
     DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
     DT[, (cols) := lapply(.SD, sum), by = z]
     ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
    
     # case (b)
     DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
     DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
    
     # case (c)
     DT[, c(.N, lapply(.SD, sum)), by = z]     
     DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
    

    • In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

    • data.table's := requires column names to be provided, whereas dplyr generates it automatically.

    • In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

    • In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

    Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

    You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

  3. Joins

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.

     setkey(DT1, x, y)
    
     # 1. normal join
     DT1[DT2]            ## data.table syntax
     left_join(DT2, DT1) ## dplyr syntax
    
     # 2. select columns while join    
     DT1[DT2, .(z, i.mul)]
     left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
    
     # 3. aggregate while join
     DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
     DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% 
         inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
    
     # 4. update while join
     DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
     ??
    
     # 5. rolling join
     DT1[DT2, roll = -Inf]
     ??
    
     # 6. other arguments to control output
     DT1[DT2, mult = "first"]
     ??
    

  • Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

  • However dplyr joins do just that. Nothing more. Nothing less.

  • data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

  • data.tables can aggregate while joining using by = .EACHI feature (3) and also update while joining (4). Why materialize the entire join result to add/update just a few columns?

  • data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

  • data.table also has mult = argument which selects first, last or all matches (6).

  • data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

  1. do()...

    dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

     DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
     DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
     DT[, list(x[1:2], y[1]), by = z]
     DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
    
     DT[, quantile(x, 0.25), by = z]
     DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
     DT[, quantile(x, c(0.25, 0.75)), by = z]
     DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
    
     DT[, as.list(summary(x)), by = z]
     DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
    

  • .SD's equivalent is .

  • In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

  • In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

  • fread - fast file reader has been available for a long time now.

  • fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

  • Automatic indexing - another handy feature to optimise base R syntax as is, internally.

  • Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.

  • Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.

  • Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.

  • Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.

  • setorder() function in data.table that allows really fast reordering of data.tables by reference.

  • dplyr provides interface to databases using the same syntax, which data.table does not at the moment.

  • data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).

  • data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


Finally:

  • On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.

  • On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).

    • Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.

这篇关于data.table 与 dplyr:一个人能不能做得好,另一个不能或做得不好?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆