dplyr on data.table,我真的使用data.table吗? [英] dplyr on data.table, am I really using data.table?

查看:149
本文介绍了dplyr on data.table,我真的使用data.table吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我在数据类型之上使用 dplyr 语法,我是否仍然使用dplyr的语法获得datatable的所有速度优势?换句话说,如果我使用dplyr语法查询它,我是否误用了datatable?或者我需要使用纯数据表语法来利用它的所有功能。



提前感谢任何建议。代码示例:

  library(data.table)
库(dplyr)

diamondsDT < - data.table(ggplot2 :: diamonds)
setkey(diamondsDT,cut)

diamondsDT%>%
filter(cut!=Fair)%> ;%
group_by(cut)%>%
summarize(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = n ())%>%
arrange(desc(Count))

/ p>

 #cut AvgPrice MedianPrice Count 
#1理想3457.542 1810.0 21551
#2 Premium 4584.258 3185.0 13791
#3很好3981.760 2648.0 12082
#4好3928.864 3050.5 4906

我想出了数据等价。不确定是否符合DT良好做法。但我不知道代码是否真的比dplyr语法后面的效率更高:

  diamondsDT [cut!=Fair
] [,。(AvgPrice = mean(price),
MedianPrice = as.numeric(median(price)),
Count = .N),by = cut
] order(-Count)]


解决方案

简单的答案,因为这两个包的哲学在某些方面不同。所以一些妥协是不可避免的。



涉及 i 的操作(== filter() slice() in dplyr)



DT 与说10列。考虑这些data.table表达式:

  DT [a> 1,.N] ## ---(1)
DT [a> 1,mean(b),by =。(c,d)] ## ---(2)

(1)给出 DT 中的行数,其中 a> 1 。 (2)对中的相同表达式返回 c,d 分组的 mean(b)表达式将是:

$

p>

  DT%>%filter(a> 1)%>%summarize(n 3)
DT%>%filter(a> 1)%>%group_by(c,d)%>%summarize(mean(b))## ---(4)

很明显,data.table代码更短。此外,它们还具有更高的存储器效率 1 。为什么?因为在(3)和(4)中, filter()首先返回所有10列的行,行数,在(4)中,我们只需要连续操作的列 b,c,d 。要克服这个问题,我们必须 select()列apriori:

  DT%>%select(a)%b%%select(a)%>%filter(a> 1)%>%summarize ,b,c,d)%过滤器(a> 1)%>%group_by(c,d)%>%总计(平均b 




必须强调两个包之间的主要哲学差异:




  • data.table 中,我们希望将这些相关操作保持在一起,允许查看 j-expression (来自同一个函数调用),并且意识到(1)中不需要任何列。计算 i 中的表达式,并且 .N 只是给出行数的逻辑向量的和;整个子集从未实现。在(2)中,只是列 b,c,d 在子集中具体化,其他列将被忽略。


  • 但是在 dplyr 中,哲学是让一个函数只做一件事情。 。有(至少目前)没有办法告诉 filter()之后的操作是否需要我们过滤的所有列。如果你想有效地执行这样的任务,你需要提前思考。



请注意,在这种情况下, 5)和(6),我们仍然子集列 a ,我们不需要。但我不知道如何避免。如果 filter()函数有一个参数来选择要返回的列,我们可以避免这个问题,但是该函数不会只执行一个任务(它也是一个dplyr设计选择)。



通过引用细分




dplyr将从不通过引用更新。这是两个包之间另一个巨大的(哲学)区别。


例如,在data.table中,您可以:

  DT [a%in%some_vals,a:= NA] 

它只在满足条件的那些行上更新 a 引用。此时,dplyr deep将整个data.table内部复制以添加一个新列。 @BrodieG在他的回答中已经提到了这一点。



但是,当FR#617 。相关性: dplyr:FR#614 。请注意,仍然,您修改的列将始终被复制(因此速度较慢/内存不足)。



其他功能




  • 在data.table中,您可以在加入时进行聚合,这更直接地了解并且是内存高效的,因为中间连接结果从未实现。请查看此帖以获取示例。你不能(在目前?)使用dplyr的data.table / data.frame语法。


  • data.table的滚动连接功能不是也支持dplyr的语法。


  • 我们最近在data.table中实现了重叠连接,在间隔范围内进行连接(这里是一个例子),这是一个单独的函数 foverlaps(),因此可以与管道操作符(magrittr / pipeR ? - 从来没有尝试过它自己)。



    但最终我们的目标是将它集成到 [。data.table


  • 自从1.9.4版本以来,data.table实现了自动索引,使用了secondary键,用于基于常规R语法的快速二进制搜索子集。例如: DT [x == 1] DT [x%in%some_vals] 将自动创建索引第一次运行,然后将使用二进制搜索从相同的列到快速子集的连续子集。此功能将继续发展。请检查此要点,了解此功能的简要概述。



    filter()的方式实现data.tables,它不利用此功能。


  • dplyr功能是它还提供数据库接口,使用相同的语法,其中data.table目前不是。




因此,您必须权衡这些(可能还有其他点)






(1) )注意,存储器效率直接影响速度(特别是随着数据变大),因为在大多数情况下,瓶颈是将数据从主存储器移动到高速缓存(并尽可能多地利用高速缓存中的数据 - 减少高速缓存未命中等等)以减少访问主存储器)。这里不详细介绍。


If I use dplyr syntax on top of a datatable, do I get all the speed benefits of datatable while still using the syntax of dplyr? In other words, do I mis-use the datatable if I query it with dplyr syntax? Or do I need to use pure datatable syntax to harness all of its power.

Thanks in advance for any advice. Code Example:

library(data.table)
library(dplyr)

diamondsDT <- data.table(ggplot2::diamonds)
setkey(diamondsDT, cut) 

diamondsDT %>%
    filter(cut != "Fair") %>%
    group_by(cut) %>%
    summarize(AvgPrice = mean(price),
                 MedianPrice = as.numeric(median(price)),
                 Count = n()) %>%
    arrange(desc(Count))

Results:

#         cut AvgPrice MedianPrice Count
# 1     Ideal 3457.542      1810.0 21551
# 2   Premium 4584.258      3185.0 13791
# 3 Very Good 3981.760      2648.0 12082
# 4      Good 3928.864      3050.5  4906

Here is the datatable equivalence I came up with. Not sure if it complies to DT good practice. But I wonder if the code is really more efficient than dplyr syntax behind the scene:

diamondsDT [cut != "Fair"
        ] [, .(AvgPrice = mean(price),
                 MedianPrice = as.numeric(median(price)),
                 Count = .N), by=cut
        ] [ order(-Count) ]

解决方案

There is no straightforward/simple answer because the philosophies of both these packages differ in certain aspects. So some compromises are unavoidable. Here are some of the concerns you may need to address/consider.

Operations involving i (== filter() and slice() in dplyr)

Assume DT with say 10 columns. Consider these data.table expressions:

DT[a > 1, .N]                    ## --- (1)
DT[a > 1, mean(b), by=.(c, d)]   ## --- (2)

(1) gives the number of rows in DT where column a > 1. (2) returns mean(b) grouped by c,d for the same expression in i as (1).

Commonly used dplyr expressions would be:

DT %>% filter(a > 1) %>% summarise(n())                        ## --- (3) 
DT %>% filter(a > 1) %>% group_by(c, d) %>% summarise(mean(b)) ## --- (4)

Clearly, data.table codes are shorter. In addition they are also more memory efficient1. Why? Because in both (3) and (4), filter() returns rows for all 10 columns first, when in (3) we just need the number of rows, and in (4) we just need columns b, c, d for the successive operations. To overcome this, we have to select() columns apriori:

DT %>% select(a) %>% filter(a > 1) %>% summarise(n()) ## --- (5)
DT %>% select(a,b,c,d) %>% filter(a > 1) %>% group_by(c,d) %>% summarise(mean(b)) ## --- (6)

It is essential to highlight a major philosophical difference between the two packages:

  • In data.table, we like to keep these related operations together, and that allows to look at the j-expression (from the same function call) and realise there's no need for any columns in (1). The expression in i gets computed, and .N is just sum of that logical vector which gives the number of rows; the entire subset is never realised. In (2), just column b,c,d are materialised in the subset, other columns are ignored.

  • But in dplyr, the philosophy is to have a function do precisely one thing well. There is (at least currently) no way to tell if the operation after filter() needs all those columns we filtered. You'll need to think ahead if you want to perform such tasks efficiently. I personally find it counter-intutitive in this case.

Note that in (5) and (6), we still subset column a which we don't require. But I'm not sure how to avoid that. If filter() function had an argument to select the columns to return, we could avoid this issue, but then the function will not do just one task (which is also a dplyr design choice).

Sub-assign by reference

dplyr will never update by reference. This is another huge (philosophical) difference between the two packages.

For example, in data.table you can do:

DT[a %in% some_vals, a := NA]

which updates column a by reference on just those rows that satisfy the condition. At the moment dplyr deep copies the entire data.table internally to add a new column. @BrodieG already mentioned this in his answer.

But the deep copy can be replaced by a shallow copy when FR #617 is implemented. Also relevant: dplyr: FR#614. Note that still, the column you modify will always be copied (therefore tad slower / less memory efficient). There will be no way to update columns by reference.

Other functionalities

  • In data.table, you can aggregate while joining, and this is more straightfoward to understand and is memory efficient since the intermediate join result is never materialised. Check this post for an example. You can't (at the moment?) do that using dplyr's data.table/data.frame syntax.

  • data.table's rolling joins feature is not supported in dplyr's syntax as well.

  • We recently implemented overlap joins in data.table to join over interval ranges (here's an example), which is a separate function foverlaps() at the moment, and therefore could be used with the pipe operators (magrittr / pipeR? - never tried it myself).

    But ultimately, our goal is to integrate it into [.data.table so that we can harvest the other features like grouping, aggregating while joining etc.. which will have the same limitations outlined above.

  • Since 1.9.4, data.table implements automatic indexing using secondary keys for fast binary search based subsets on regular R syntax. Ex: DT[x == 1] and DT[x %in% some_vals] will automatically create an index on the first run, which will then be used on successive subsets from the same column to fast subset using binary search. This feature will continue to evolve. Check this gist for a short overview of this feature.

    From the way filter() is implemented for data.tables, it doesn't take advantage of this feature.

  • A dplyr feature is that it also provides interface to databases using the same syntax, which data.table doesn't at the moment.

So, you will have to weigh in these (and probably other points) and decide based on whether these trade-offs are acceptable to you.

HTH


(1) Note that being memory efficient directly impacts speed (especially as data gets larger), as the bottleneck in most cases is moving the data from main memory onto cache (and making use of data in cache as much as possible - reduce cache misses - so as to reduce accessing main memory). Not going into details here.

这篇关于dplyr on data.table,我真的使用data.table吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆