data.table - 选择组内的前 n 行 [英] data.table - select first n rows within group
问题描述
虽然很简单,但我不知道一个 data.table
解决方案来选择数据表中组中的前 n 行.你能帮帮我吗?
As simple as it is, I don't know a data.table
solution to select the first n rows in groups in a data table. Can you please help me out?
推荐答案
作为替代方案:
dt[, .SD[1:3], cyl]
当您查看示例数据集的速度时,head
方法与 .I
@eddi 的方法.与 microbenchmark
包比较:
When you look at speed on the example dataset, the head
method is on par with the .I
method of @eddi. Comparing with the microbenchmark
package:
microbenchmark(head = dt[, head(.SD, 3), cyl],
SD = dt[, .SD[1:3], cyl],
I = dt[dt[, .I[1:3], cyl]$V1],
times = 10, unit = "relative")
结果:
Unit: relative
expr min lq mean median uq max neval cld
head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 10 a
SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401 10 b
I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973 10 a
但是,data.table
是专门为大型数据集设计的.所以,再次运行这个比较:
However, data.table
is specifically designed for large datasets. So, running this comparison again:
# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
SD = largeDT[, .SD[1:3], cyl],
I = largeDT[largeDT[, .I[1:3], cyl]$V1],
times = 10, unit = "relative")
结果:
Unit: relative
expr min lq mean median uq max neval cld
head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876 10 b
SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a
现在 .I
方法显然是最快的.
Now the .I
method is clearly the fastest one.
2016 年 2 月 12 日更新:
使用 data.table 包的最新开发版本,.I
方法仍然胜出..SD
方法或 head()
方法是否更快似乎取决于数据集的大小.现在基准给出:
With the most recent development version of the data.table package, the .I
method still wins. Whether the .SD
method or the head()
method is faster seems to depend on the size of the dataset. Now the benchmark gives:
Unit: relative
expr min lq mean median uq max neval cld
head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213 10 b
SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
但是,如果数据集稍微小一些(但仍然很大),几率会发生变化:
However with a somewhat smaller dataset (but still quite big), the odds change:
largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]
基准测试现在稍微支持 head
方法而不是 .SD
方法:
the benchmark is now slightly in favor of the head
method over the .SD
method:
Unit: relative
expr min lq mean median uq max neval cld
head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812 10 b
SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297 10 b
I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
这篇关于data.table - 选择组内的前 n 行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!