dplyr on data.table,我真的用data.table吗? [英] dplyr on data.table, am I really using data.table?

查看:134
本文介绍了dplyr on data.table,我真的用data.table吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我在 datatable 之上使用 dplyr 语法,那么在仍然使用dplyr语法的同时,是否可以获得datatable的所有速度优势?换句话说,如果我用dplyr语法查询它,我误用datatable吗?或者,我需要使用纯数据的语法来利用它的所有权力。



提前感谢任何建议。代码示例:

 库(data.table)
库(dplyr)

diamondsDT < - data.table(ggplot2 :: diamonds)
setkey(diamondsDT,cut)

diamondsDT%>%
过滤器(cut!=Fair)%> ;%
group_by(cut)%>%
总结(AvgPrice =均值(price),
MedianPrice = as.numeric(median(price)),
Count = n ())%>%
arrange(desc(Count))

/ p>

 #cut AvgPrice MedianPrice Count 
#1 Ideal 3457.542 1810.0 21551
#2 Premium 4584.258 3185.0 13791
#3很好3981.760 2648.0 12082
#4好3928.864 3050.5 4906

这是我想出了数据的等价性。不确定是否符合DT的良好做法。但是我想知道代码是否比场景背后的dplyr语法更有效:

  diamondsDT [cut!=Fair
] [,。(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N),by = cut
] [ order(-Count)]


解决方案

简单的回答是因为这两个包裹的哲学在某些方面有所不同。所以一些妥协是不可避免的。以下是您可能需要解决/考虑的一些问题。



涉及 i的操作(== filter() slice()在dplyr中)



假设 DT ,含10列。考虑这些data.table表达式:

  DT [a> 1,.N] ## ---(1)
DT [a> 1,mean(b),by =。(c,d)] ## ---(2)

(1)给出了 DT 中的行数,其中列 a> 1 。 (2)在中为 c,d 分配相同表达式的 mean(b) as(1)。



常用 dplyr 表达式将是: p>

  DT%>%filter(a> 1)%>%summarize(n())## --- 3)
DT%>%filter(a> 1)%>%group_by(c,d)%>%summarize(mean(b))## ---(4)

显然,data.table代码较短。此外,他们也更加高效地记忆 1 。为什么?因为在(3)和(4)中, filter()首先返回所有10列的行,在(3)我们只需要行数,在(4)中,我们只需要连续操作的列 b,c,d 。为了克服这个问题,我们必须 select()列apriori:

  DT%>%选择(a)%>%过滤器(a> 1)%>%summaryize(n())## ---(5)
DT%>%select ,b,c,d)%>%过滤器(a> 1)%>%group_by(c,d)%>%summary(mean(b))## ---(6)




必须突出显示两个包之间的主要哲学差异:




  • data.table 中,我们希望将这些相关操作保持在一起,这允许查看 j-expression (从相同的函数调用),并意识到(1)中不需要任何列。 i 中的表达式被计算,而 .N 只是给出行数的逻辑向量的总和;整个子集从未实现。在(2)中,只有列 b,c,d 在子集中实现,其他列将被忽略。


  • 但是,在 dplyr 中,理念是让一个功能正好做一件事情。有(至少目前)没有办法判断 filter()之后的操作是否需要我们过滤的所有列。如果您想要有效地执行这些任务,您将需要考虑。在这种情况下,我个人觉得这是非常有用的。



5)和(6),我们仍然是不需要的子集列 a 。但我不知道如何避免这种情况。如果 filter()函数有一个参数来选择要返回的列,我们可以避免这个问题,但是函数不会只执行一个任务(这也是一个dplyr设计选择)。



通过引用子分配$ / $ $ $ $ $ $ $ $ $ $ $ $ $ $ >永远不要通过引用更新。这是两个包之间的另一个巨大(哲学)差异。


例如,在data.table中,您可以执行以下操作:

  DT [a%in%some_vals,a:= NA] 

通过引用更新列 a ,只需满足条件的那些行。目前,dplyr在内部深入复制整个data.table以添加一个新列。 @BrodieG已经在他的回答中提到了这一点。



但是,当 FR#617 。也相关: dplyr:FR#614 。请注意,仍然修改的列将始终被复制(因此缓存/较少内存有效)。没有办法通过引用更新列。



其他功能




  • 在data.table中,您可以在加入时进行聚合,这更为直观而且内存效率更高,因为中间连接结果从未实现。检查这篇文章为例。你现在不能使用dplyr的data.table / data.frame语法。


  • data.table的滚动联接功能不是也支持dplyr的语法。


  • 我们最近在data.table中实现了重叠连接,以加入间隔范围(这里是一个例子),这是一个单独的函数 foverlaps(),因此可以与管道操作符(magrittr / pipeR ? - 从来没有尝试过它)。



    但最终我们的目标是将其集成到 [。data.table 中,以便我们可以收获其他特征如分组,聚合加入等。这将具有上述相同的限制。


  • 自1.9.4起,data.table实现了使用次级用于基于常规R语法的快速二叉搜索子集的键。例如: DT [x == 1] DT [x%in%some_vals] 将自动创建一个索引第一次运行,然后将在同一列的连续子集上使用二进制搜索来快速子集。此功能将继续发展。请查看这个要点,了解此功能的简要概述。



    从data.tables实现 filter()的方式,它不会利用此功能。


  • 一个dplyr功能是它还提供与数据库的接口使用相同的语法,哪个data.table不在这个时刻。




所以,你必须权衡这些(也可能是其他几点),并根据这些交易 -



HTH






(1 )请注意,内存有效直接影响速度(特别是数据变大),因为大多数情况下的瓶颈是将数据从主内存移动到缓存(并尽可能多地利用缓存中的数据),从而减少高速缓存未命中以减少访问主存储器)。


If I use dplyr syntax on top of a datatable, do I get all the speed benefits of datatable while still using the syntax of dplyr? In other words, do I mis-use the datatable if I query it with dplyr syntax? Or do I need to use pure datatable syntax to harness all of its power.

Thanks in advance for any advice. Code Example:

library(data.table)
library(dplyr)

diamondsDT <- data.table(ggplot2::diamonds)
setkey(diamondsDT, cut) 

diamondsDT %>%
    filter(cut != "Fair") %>%
    group_by(cut) %>%
    summarize(AvgPrice = mean(price),
                 MedianPrice = as.numeric(median(price)),
                 Count = n()) %>%
    arrange(desc(Count))

Results:

#         cut AvgPrice MedianPrice Count
# 1     Ideal 3457.542      1810.0 21551
# 2   Premium 4584.258      3185.0 13791
# 3 Very Good 3981.760      2648.0 12082
# 4      Good 3928.864      3050.5  4906

Here is the datatable equivalence I came up with. Not sure if it complies to DT good practice. But I wonder if the code is really more efficient than dplyr syntax behind the scene:

diamondsDT [cut != "Fair"
        ] [, .(AvgPrice = mean(price),
                 MedianPrice = as.numeric(median(price)),
                 Count = .N), by=cut
        ] [ order(-Count) ]

解决方案

There is no straightforward/simple answer because the philosophies of both these packages differ in certain aspects. So some compromises are unavoidable. Here are some of the concerns you may need to address/consider.

Operations involving i (== filter() and slice() in dplyr)

Assume DT with say 10 columns. Consider these data.table expressions:

DT[a > 1, .N]                    ## --- (1)
DT[a > 1, mean(b), by=.(c, d)]   ## --- (2)

(1) gives the number of rows in DT where column a > 1. (2) returns mean(b) grouped by c,d for the same expression in i as (1).

Commonly used dplyr expressions would be:

DT %>% filter(a > 1) %>% summarise(n())                        ## --- (3) 
DT %>% filter(a > 1) %>% group_by(c, d) %>% summarise(mean(b)) ## --- (4)

Clearly, data.table codes are shorter. In addition they are also more memory efficient1. Why? Because in both (3) and (4), filter() returns rows for all 10 columns first, when in (3) we just need the number of rows, and in (4) we just need columns b, c, d for the successive operations. To overcome this, we have to select() columns apriori:

DT %>% select(a) %>% filter(a > 1) %>% summarise(n()) ## --- (5)
DT %>% select(a,b,c,d) %>% filter(a > 1) %>% group_by(c,d) %>% summarise(mean(b)) ## --- (6)

It is essential to highlight a major philosophical difference between the two packages:

  • In data.table, we like to keep these related operations together, and that allows to look at the j-expression (from the same function call) and realise there's no need for any columns in (1). The expression in i gets computed, and .N is just sum of that logical vector which gives the number of rows; the entire subset is never realised. In (2), just column b,c,d are materialised in the subset, other columns are ignored.

  • But in dplyr, the philosophy is to have a function do precisely one thing well. There is (at least currently) no way to tell if the operation after filter() needs all those columns we filtered. You'll need to think ahead if you want to perform such tasks efficiently. I personally find it counter-intutitive in this case.

Note that in (5) and (6), we still subset column a which we don't require. But I'm not sure how to avoid that. If filter() function had an argument to select the columns to return, we could avoid this issue, but then the function will not do just one task (which is also a dplyr design choice).

Sub-assign by reference

dplyr will never update by reference. This is another huge (philosophical) difference between the two packages.

For example, in data.table you can do:

DT[a %in% some_vals, a := NA]

which updates column a by reference on just those rows that satisfy the condition. At the moment dplyr deep copies the entire data.table internally to add a new column. @BrodieG already mentioned this in his answer.

But the deep copy can be replaced by a shallow copy when FR #617 is implemented. Also relevant: dplyr: FR#614. Note that still, the column you modify will always be copied (therefore tad slower / less memory efficient). There will be no way to update columns by reference.

Other functionalities

  • In data.table, you can aggregate while joining, and this is more straightfoward to understand and is memory efficient since the intermediate join result is never materialised. Check this post for an example. You can't (at the moment?) do that using dplyr's data.table/data.frame syntax.

  • data.table's rolling joins feature is not supported in dplyr's syntax as well.

  • We recently implemented overlap joins in data.table to join over interval ranges (here's an example), which is a separate function foverlaps() at the moment, and therefore could be used with the pipe operators (magrittr / pipeR? - never tried it myself).

    But ultimately, our goal is to integrate it into [.data.table so that we can harvest the other features like grouping, aggregating while joining etc.. which will have the same limitations outlined above.

  • Since 1.9.4, data.table implements automatic indexing using secondary keys for fast binary search based subsets on regular R syntax. Ex: DT[x == 1] and DT[x %in% some_vals] will automatically create an index on the first run, which will then be used on successive subsets from the same column to fast subset using binary search. This feature will continue to evolve. Check this gist for a short overview of this feature.

    From the way filter() is implemented for data.tables, it doesn't take advantage of this feature.

  • A dplyr feature is that it also provides interface to databases using the same syntax, which data.table doesn't at the moment.

So, you will have to weigh in these (and probably other points) and decide based on whether these trade-offs are acceptable to you.

HTH


(1) Note that being memory efficient directly impacts speed (especially as data gets larger), as the bottleneck in most cases is moving the data from main memory onto cache (and making use of data in cache as much as possible - reduce cache misses - so as to reduce accessing main memory). Not going into details here.

这篇关于dplyr on data.table,我真的用data.table吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆