dplyr中的唯一行:tbl_dt中的row_number()与tbl_df不一致 [英] unique rows in dplyr : row_number() from tbl_dt inconsistent to tbl_df

查看:123
本文介绍了dplyr中的唯一行:tbl_dt中的row_number()与tbl_df不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道如何从 data.table dplyr 工作流程的某个地方。由于v0.2可以使用 row_number == 1 请参阅: 使用dplyr删除重复的行

I am wondering how to get unique rows from a data.table in a somewhere along a dplyr workflow. Since v0.2 I can use row_number==1 (see: Remove duplicated rows using dplyr)

但是,

tbl_df(data)%>%group_by(Var1,Var2)%>%filter(row_number()== 1)

tbl_dt(data)%>%group_by(Var1,Var2)%>%filter(row_number()== 1) code>不。 这是一个错误?

library(dplyr)
library(data.table)
library(microbenchmark)

little <- expand.grid(rep(letters,121),rep(letters,121)) # my 10M row dataset.
tbl_dt(little) %>% group_by(Var1,Var2) %>% filter(row_number() == 1)



结果:



result:

> Error in rank(x, ties.method = "first") : 
> argument "x" is missing, with no default






这是我实际上发现它是。我正在问:

我可以使用 unique.data.table 方法:

 dt_u <- function() {
           tbl_dt(little) %>% 
           group_by(Var1,Var2) %>% 
           unique(.) %>% 
           tbl_dt(.) }

我可以使用总结然后选择删除新的列:

I can use summarise then select away the new col:

dt_ss <- function() {
           tbl_dt(little) %>% 
           group_by(Var1,Var2) %>% 
           summarise( n = n() ) %>% 
           select( -(n) ) }

我可以使用 row_number()== 1 #不支持tbl_dt!

 dt_rn <- function() {
           tbl_dt(little) %>% 
           group_by(Var1,Var2) %>% 
           filter( row_number() == 1 ) }

等等他 tbl_df()等价。

> Unit: milliseconds
>     expr       min        lq    median        uq       max neval
>  dt_ss()  579.0385  618.0002  661.9056  694.0705  764.2221    20
>  dt_u()   690.1284  729.8723  756.5505  783.7379  897.4799    20
>  df_ss()  419.7841  436.9871  448.1717  461.7023  523.2798    20
>  df_u()  3971.1699 4044.3663 4097.9848 4168.3468 4245.8346    20
>  df_rn()  646.1497  687.3472  711.3924  724.6235  754.3166    20


推荐答案

有趣的是。你的基准激起了我的兴趣。我觉得有点奇怪,你不直接与 data.table unique.data.table 进行比较。所以这里的结果也包括在我的系统上。

Interesting. Your benchmarks spiked my interest. I find it a bit odd that you don't compare against data.table's unique.data.table directly. So here are the results by including that as well on my system.

# extra function with which the benchmark shown below was run
dt_direct <- function() unique(dt) # where dt = as.data.table(little)

# Unit: milliseconds
#         expr       min        lq    median        uq       max neval
#       dt_u() 1472.2460 1571.0871 1664.0476 1742.5184 2647.2118    20
#       df_u() 6084.2877 6303.9058 6490.1686 6844.8767 7370.3322    20
#      dt_ss() 1340.8479 1485.4064 1552.8756 1586.6706 1810.2979    20
#      df_ss()  799.5289  835.8599  884.6501  957.2208 1251.5994    20
#      df_rn() 1410.0145 1576.2033 1660.1124 1770.2645 2442.7578    20
#  dt_direct()  452.6010  463.6116  486.5015  568.0451  670.3673    20

比您运行速度最快的解决方案快1.8倍。

It's 1.8x faster than the fastest solution from all your runs.

现在,我们将唯一值的数量从676 增加到大约10,000,看看会发生什么。

Now, let's increase the number of unique values from 676 to about 10,000 and see what happens.

val = paste0("V", 1:100)
little <- data.frame(Var1=sample(val, 1e7, TRUE), Var2=sample(val, 1e7, TRUE))
dt <- as.data.table(little)

# Unit: milliseconds
#         expr      min        lq    median        uq       max neval
#       dt_u() 1709.458 1776.3510 1892.7761 1991.6339 2562.9171    20
#       df_u() 7541.364 7735.4725 7981.3483 8462.9093 9552.8629    20
#      dt_ss() 1555.110 1627.6519 1791.5219 1911.3594 2299.2864    20
#      df_ss() 1436.355 1500.1043 1528.1319 1649.3043 1961.9945    20
#      df_rn() 2001.396 2189.5164 2393.8861 2550.2198 3047.7019    20
#  dt_direct()  508.596  525.7299  577.6982  674.2288  893.2116    20

而且这是2.6倍快。


注意:我不时间创建<$ c $因为在实际情况下,您可以使用 fread 直接获取一个data.table,或者使用 setDT 通过引用转换 data.table 或直接使用 data.table(。)而不是 data.fame(。) - 这不是时间。

Note: I don't time the creation of dt here because, in real use cases, you can either use fread to get a data.table directly, or use setDT to convert a data.table by reference or directly use data.table(.) instead of data.fame(.) - which is not timed as well.






但是为什么 dt_u dt_ss 那么?

通过查看文件 grouping-dt.r -dt.r ,这是因为1)副本和2)设置键。 (1)基本上是因为必须做(2)。如果您使用 dplyr 进行总结操作,则相当于:

By looking at the file grouped-dt.r and manip-grouped-dt.r, this is happening because of 1) copies and 2) setting keys. (1) is basically because of having to do (2). If you do a summarise operation using dplyr, it's equivalent to:

DT <- copy(DT);
setkey(DT, <group_cols>  ## these two are in grouped_dt
DT[, j, by=<group_cols>] ## this is in summarise.grouped_dt
DT <- copy(DT)           ## because it calls grouped_dt AGAIN!
## and sets key again - which is O(n) now as DT checked if sorted first..

我不知道为什么在在Hadey的回答下的这个讨论

## equivalent ad-hoc by
DT[, j, by=<group_cols] ## no copy, no setkey

它避免了这两个副本和设置密钥。

It avoids both copies and setting key.

如果您更改这是有效的:

It is even worse if you mutate. It's effectively doing:

DT <- copy(DT)
setkey(DT, <group_cols>) ## these two are in grouped_dt
DT <- copy(DT)           ## mutate.grouped_dt copies copied data again
DT[, `:=`(...), by=<group_cols>] ## this is in mutate.grouped_dt
DT = copy(DT) ## because of another call to grouped_dt!!!
## and sets key again - which is O(n) now as DT is checked if sorted first..

在这里,ad-hoc解决方案只是简单的:

Here again, the ad-hoc solution is simply:

DT   = copy(DT)
DT[, `:=`(...), by=group_cols]

复制和设置密钥。只有复制可以满足dplyr的不修改对象的理念。所以,这往往会慢一点,在 dplyr 中占用内存的两倍。

It avoids 2 copies and setting key.. The only copy is there to satisfy dplyr's philosophy of not modifying objects in-place. So, this'll always be slower + taking up twice the memory in dplyr.

同样,可以避免某些连接的副本如我在这里所说的那样

Similarly, copies on some joins can be avoided as I've commented here.

来自 dplyr v0.2的NEWS项目说:



  • dplyr在设置数据表的键时更加小心,因此它不会意外修改反对它不拥有。 它也避免了不必要的设置,对性能造成负面影响。 (#193,#255)。

但是很明显,一些讨论的案例没有。

But clearly quite some discussed cases haven't made it.

到目前为止,我写了关于您问题下的性能标记。也就是说,如果您正在寻找性能,您应该避免所有使(不必要的)副本(和设置键)的情况,直到修复。

So far I wrote about the performance tag under your question. That is, if you're looking for performance, you should be avoiding all cases which makes (unnecessary) copies (and setting keys), until fixed.

在这个特殊情况下,我可以想出的最好的答案只是调用 unique.data.table 直接在 dplyr ish方式:

In that essence, in this particular case, the best answer I could come up with is just call unique.data.table directly in dplyrish way:

tbl_dt(little) %>% unique(.)

这篇关于dplyr中的唯一行:tbl_dt中的row_number()与tbl_df不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆