data.table 上的 dplyr,我真的在使用 data.table 吗? [英] dplyr on data.table, am I really using data.table?

查看:17
本文介绍了data.table 上的 dplyr,我真的在使用 data.table 吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我在 datatable 之上使用 dplyr 语法,我是否可以获得 datatable 的所有速度优势,同时仍然使用 dplyr 的语法?换句话说,如果我使用 dplyr 语法查询数据表,我是否会误用数据表?或者我是否需要使用纯数据表语法来利用其所有功能.

提前感谢您的建议.代码示例:

library(data.table)图书馆(dplyr)钻石DT <- data.table(ggplot2::diamonds)setkey(钻石DT,切割)钻石DT %>%过滤器(切割!=公平")%>%group_by(cut)%>%总结(平均价格 = 平均值(价格),MedianPrice = as.numeric(median(price)),计数 = n()) %>%排列(desc(计数))

结果:

# cut AvgPrice MedianPrice Count# 1 理想 3457.542 1810.0 21551# 2 高级 4584.258 3185.0 13791# 3 非常好 3981.760 2648.0 12082# 4 好 3928.864 3050.5 4906

这是我想出的数据表等效项.不确定它是否符合 DT 良好实践.但我想知道代码是否真的比幕后的 dplyr 语法更高效:

diamondsDT [cut != "Fair"] [, .(AvgPrice = mean(price),MedianPrice = as.numeric(median(price)),计数 = .N), by=cut] [订单(-计数)]

解决方案

没有直接/简单的答案,因为这两个包的理念在某些方面有所不同.所以有些妥协是不可避免的.以下是您可能需要解决/考虑的一些问题.

涉及i的操作(==filter() and slice() in dplyr)

假设 DT 有 10 列.考虑这些 data.table 表达式:

DT[a >1, .N] ## --- (1)DT[a>1, mean(b), by=.(c, d)] ## --- (2)

(1) 给出 DT 中的行数,其中列 a >1.(2) 返回 mean(b)c,d 分组的 i 中与 (1) 相同的表达式.

常用的 dplyr 表达式是:

DT %>% filter(a > 1) %>% summarise(n()) ## --- (3)DT %>% filter(a > 1) %>% group_by(c, d) %>% summarise(mean(b)) ## --- (4)

显然,data.table 代码更短.此外,它们还内存效率更高1.为什么?因为在 (3) 和 (4) 中,filter() 首先返回 所有 10 列的行,当在(3)中时,我们只需要行数,而在 (4) 中,我们只需要列 b, c, d 用于后续操作.为了克服这个问题,我们必须select() 列先验:

DT %>% select(a) %>% filter(a > 1) %>% summarise(n()) ## --- (5)DT %>% select(a,b,c,d) %>% filter(a > 1) %>% group_by(c,d) %>% summarise(mean(b)) ## --- (6)

<块引用>

强调两个包之间的主要哲学差异至关重要:

  • data.table中,我们喜欢将这些相关的操作放在一起,这样可以查看j-expression(来自同一个函数调用) 并意识到 (1) 中不需要任何列.i 中的表达式被计算出来,.N 只是给出行数的逻辑向量的总和;整个子集永远不会实现.在(2)中,只有列 b,c,d 在子集中具体化,其他列被忽略.

  • 但是在 dplyr 中,哲学是让一个函数做一件事.(至少目前)无法判断 filter() 之后的操作是否需要我们过滤的所有列.如果您想有效地执行此类任务,则需要提前考虑.在这种情况下,我个人认为这是违反直觉的.

请注意,在 (5) 和 (6) 中,我们仍然对不需要的列 a 进行子集化.但我不确定如何避免这种情况.如果 filter() 函数有一个参数来选择要返回的列,我们可以避免这个问题,但是该函数不会只做一个任务(这也是一个 dplyr 设计选择).>

通过引用子赋值

<块引用>

dplyr 将永远通过引用更新.这是两个包之间的另一个巨大(哲学)差异.

例如,在 data.table 你可以这样做:

DT[a %in% some_vals, a := NA]

它仅在满足条件的那些行上更新列 a 通过引用.目前 dplyr 在内部深度复制整个 data.table 以添加一个新列.@BrodieG 已经在他的回答中提到了这一点.

但是当 FR #617 实施.同样相关:dplyr:FR#614.请注意,您修改的列将始终被复制(因此速度较慢/内存效率较低).将无法通过引用更新列.

其他功能

  • 在data.table中,您可以在加入时进行聚合,这更易于理解并且内存效率高,因为中间连接结果永远不会实现.以这篇文章为例.您不能(目前?)使用 dplyr 的 data.table/data.frame 语法来做到这一点.

  • dplyr 的语法也不支持
  • data.table 的滚动连接功能.

  • 我们最近在 data.table 中实现了重叠连接以连接间隔范围(这里是一个示例),目前这是一个单独的函数 foverlaps(),因此可以与管道操作符一起使用(magrittr/pipeR? - 我自己从未尝试过).

    但最终,我们的目标是将其集成到 [.data.table 中,以便我们可以获取其他功能,例如分组、加入时聚合等.这些功能将具有上述相同的限制.

  • 自 1.9.4 起,data.table 使用辅助键实现自动索引,以基于常规 R 语法的快速二进制搜索子集.例如:DT[x == 1]DT[x %in% some_vals] 将在第一次运行时自动创建一个索引,然后将用于连续的子集使用二分搜索从同一列到快速子集.此功能将继续发展.查看此要点,了解此功能的简要概述.

    从为 data.tables 实现 filter() 的方式来看,它没有利用此功能.

  • dplyr 的一个特性是它还提供了接口到数据库 使用相同的语法,data.table 目前没有.

因此,您必须权衡这些(可能还有其他要点),并根据您是否可以接受这些权衡来决定.

HTH

<小时>

(1) 请注意,内存效率直接影响速度(尤其是当数据变大时),因为在大多数情况下,瓶颈是将数据从主内存移动到缓存中(并尽可能多地使用缓存中的数据 -减少缓存未命中 - 以减少访问主内存).此处不详述.

If I use dplyr syntax on top of a datatable, do I get all the speed benefits of datatable while still using the syntax of dplyr? In other words, do I mis-use the datatable if I query it with dplyr syntax? Or do I need to use pure datatable syntax to harness all of its power.

Thanks in advance for any advice. Code Example:

library(data.table)
library(dplyr)

diamondsDT <- data.table(ggplot2::diamonds)
setkey(diamondsDT, cut) 

diamondsDT %>%
    filter(cut != "Fair") %>%
    group_by(cut) %>%
    summarize(AvgPrice = mean(price),
                 MedianPrice = as.numeric(median(price)),
                 Count = n()) %>%
    arrange(desc(Count))

Results:

#         cut AvgPrice MedianPrice Count
# 1     Ideal 3457.542      1810.0 21551
# 2   Premium 4584.258      3185.0 13791
# 3 Very Good 3981.760      2648.0 12082
# 4      Good 3928.864      3050.5  4906

Here is the datatable equivalence I came up with. Not sure if it complies to DT good practice. But I wonder if the code is really more efficient than dplyr syntax behind the scene:

diamondsDT [cut != "Fair"
        ] [, .(AvgPrice = mean(price),
                 MedianPrice = as.numeric(median(price)),
                 Count = .N), by=cut
        ] [ order(-Count) ]

解决方案

There is no straightforward/simple answer because the philosophies of both these packages differ in certain aspects. So some compromises are unavoidable. Here are some of the concerns you may need to address/consider.

Operations involving i (== filter() and slice() in dplyr)

Assume DT with say 10 columns. Consider these data.table expressions:

DT[a > 1, .N]                    ## --- (1)
DT[a > 1, mean(b), by=.(c, d)]   ## --- (2)

(1) gives the number of rows in DT where column a > 1. (2) returns mean(b) grouped by c,d for the same expression in i as (1).

Commonly used dplyr expressions would be:

DT %>% filter(a > 1) %>% summarise(n())                        ## --- (3) 
DT %>% filter(a > 1) %>% group_by(c, d) %>% summarise(mean(b)) ## --- (4)

Clearly, data.table codes are shorter. In addition they are also more memory efficient1. Why? Because in both (3) and (4), filter() returns rows for all 10 columns first, when in (3) we just need the number of rows, and in (4) we just need columns b, c, d for the successive operations. To overcome this, we have to select() columns apriori:

DT %>% select(a) %>% filter(a > 1) %>% summarise(n()) ## --- (5)
DT %>% select(a,b,c,d) %>% filter(a > 1) %>% group_by(c,d) %>% summarise(mean(b)) ## --- (6)

It is essential to highlight a major philosophical difference between the two packages:

  • In data.table, we like to keep these related operations together, and that allows to look at the j-expression (from the same function call) and realise there's no need for any columns in (1). The expression in i gets computed, and .N is just sum of that logical vector which gives the number of rows; the entire subset is never realised. In (2), just column b,c,d are materialised in the subset, other columns are ignored.

  • But in dplyr, the philosophy is to have a function do precisely one thing well. There is (at least currently) no way to tell if the operation after filter() needs all those columns we filtered. You'll need to think ahead if you want to perform such tasks efficiently. I personally find it counter-intutitive in this case.

Note that in (5) and (6), we still subset column a which we don't require. But I'm not sure how to avoid that. If filter() function had an argument to select the columns to return, we could avoid this issue, but then the function will not do just one task (which is also a dplyr design choice).

Sub-assign by reference

dplyr will never update by reference. This is another huge (philosophical) difference between the two packages.

For example, in data.table you can do:

DT[a %in% some_vals, a := NA]

which updates column a by reference on just those rows that satisfy the condition. At the moment dplyr deep copies the entire data.table internally to add a new column. @BrodieG already mentioned this in his answer.

But the deep copy can be replaced by a shallow copy when FR #617 is implemented. Also relevant: dplyr: FR#614. Note that still, the column you modify will always be copied (therefore tad slower / less memory efficient). There will be no way to update columns by reference.

Other functionalities

  • In data.table, you can aggregate while joining, and this is more straightfoward to understand and is memory efficient since the intermediate join result is never materialised. Check this post for an example. You can't (at the moment?) do that using dplyr's data.table/data.frame syntax.

  • data.table's rolling joins feature is not supported in dplyr's syntax as well.

  • We recently implemented overlap joins in data.table to join over interval ranges (here's an example), which is a separate function foverlaps() at the moment, and therefore could be used with the pipe operators (magrittr / pipeR? - never tried it myself).

    But ultimately, our goal is to integrate it into [.data.table so that we can harvest the other features like grouping, aggregating while joining etc.. which will have the same limitations outlined above.

  • Since 1.9.4, data.table implements automatic indexing using secondary keys for fast binary search based subsets on regular R syntax. Ex: DT[x == 1] and DT[x %in% some_vals] will automatically create an index on the first run, which will then be used on successive subsets from the same column to fast subset using binary search. This feature will continue to evolve. Check this gist for a short overview of this feature.

    From the way filter() is implemented for data.tables, it doesn't take advantage of this feature.

  • A dplyr feature is that it also provides interface to databases using the same syntax, which data.table doesn't at the moment.

So, you will have to weigh in these (and probably other points) and decide based on whether these trade-offs are acceptable to you.

HTH


(1) Note that being memory efficient directly impacts speed (especially as data gets larger), as the bottleneck in most cases is moving the data from main memory onto cache (and making use of data in cache as much as possible - reduce cache misses - so as to reduce accessing main memory). Not going into details here.

这篇关于data.table 上的 dplyr,我真的在使用 data.table 吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆