data.table - 选择组内的前 n 行 [英] data.table - select first n rows within group

查看:19
本文介绍了data.table - 选择组内的前 n 行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

虽然很简单,但我不知道一个 data.table 解决方案来选择数据表中组中的前 n 行.你能帮帮我吗?

As simple as it is, I don't know a data.table solution to select the first n rows in groups in a data table. Can you please help me out?

推荐答案

作为替代方案:

dt[, .SD[1:3], cyl]

当您查看示例数据集的速度时,head 方法与 .I @eddi 的方法.与 microbenchmark 包比较:

When you look at speed on the example dataset, the head method is on par with the .I method of @eddi. Comparing with the microbenchmark package:

microbenchmark(head = dt[, head(.SD, 3), cyl],
               SD = dt[, .SD[1:3], cyl], 
               I = dt[dt[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

结果:

Unit: relative
 expr      min       lq     mean   median       uq       max neval cld
 head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000    10  a 
   SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401    10   b
    I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973    10  a 

但是,data.table 是专门为大型数据集设计的.所以,再次运行这个比较:

However, data.table is specifically designed for large datasets. So, running this comparison again:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
               SD = largeDT[, .SD[1:3], cyl], 
               I = largeDT[largeDT[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

结果:

Unit: relative
 expr      min       lq     mean   median       uq     max neval cld
 head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876    10   b
   SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000    10  a 

现在 .I 方法显然是最快的.

Now the .I method is clearly the fastest one.

2016 年 2 月 12 日更新:

使用 data.table 包的最新开发版本,.I 方法仍然胜出..SD 方法或 head() 方法是否更快似乎取决于数据集的大小.现在基准给出:

With the most recent development version of the data.table package, the .I method still wins. Whether the .SD method or the head() method is faster seems to depend on the size of the dataset. Now the benchmark gives:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213    10   b
   SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 

但是,如果数据集稍微小一些(但仍然很大),几率会发生变化:

However with a somewhat smaller dataset (but still quite big), the odds change:

largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

基准测试现在稍微支持 head 方法而不是 .SD 方法:

the benchmark is now slightly in favor of the head method over the .SD method:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812    10   b
   SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 

这篇关于data.table - 选择组内的前 n 行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆