使用`data.table`基于一个变量获取第一个子组 [英] Use `data.table` to get first of subgroup based on a variable
问题描述
考虑由分组变量(此处 id
)和有序变量(此处 date
)组成的数据集
Consider a data set consisting of a grouping variable (here id
) and an ordered variable (here date
)
(df <- data.frame(
id = rep(1:2,2),
date = 4:1
))
# id date
# 1 1 4
# 2 2 3
# 3 1 2
# 4 2 1
我想知道最简单的方法是在 data.table
等同于此 dplyr
代码:
I'm wondering what the easiest way is in data.table
to do the equivalent of this dplyr
code:
library(dplyr)
df %>%
group_by(id) %>%
filter(min_rank(date)==1)
# Source: local data frame [2 x 2]
# Groups: id
#
# id date
# 1 1 2
# 2 2 1
ie对于每个 id
获得日期
的第一个
i.e. for each id
get the first according to date
.
基于类似的stackoverflow问题(使用data.table 为一个组的每个元素创建一个索引),我想出了这个
Based on a similar stackoverflow question (Create an "index" for each element of a group with data.table), I came up with this
library(data.table)
dt <- data.table(df)
setkey(dt, id, date)
for(k in unique(dt$id)){
dt[id==k, index := 1:.N]
}
dt[index==1,]
但是似乎应该有一个单行。不熟悉 data.table
我以为这样
But it seems like there should be a one-liner for this. Being unfamiliar with data.table
I thought something like this
dt[,,mult="first", by=id]
代码的最后一个位似乎应该按 id
分组,然后取第一个(在 id
date
,因为我以这种方式设置了键。)
should work, but alas! The last bit of code seems like it should group by id
and then take the first (which within id
would be determined by date
since I've set the keys in this way.)
EDIT
感谢Ananda Mahto,这个单行将会出现在我的 data.table
Thanks to Ananda Mahto, this one-liner will now be in my data.table
repertoire
dt[,.SD[1], by=id]
# id date
# 1: 1 2
# 2: 2 1
推荐答案
直接与您的来源 data.frame
直接合作,您可以尝试:
Working directly with your source data.frame
, you can try:
setkey(as.data.table(df), id, date)[, .SD[1], by = id]
# id date
# 1: 1 2
# 2: 2 1
延伸您的原创想法,只要做:
Extending your original idea, you can just do:
dt <- data.table(df)
setkey(dt, id, date)
dt[, index := sequence(.N), by = id][index == 1]
# id date index
# 1: 1 2 1
# 2: 2 1 1
对于 head
vs [1]
,大卫是正确的,但我不知道什么尺度。
It might be that at a certain scale, David is correct about head
vs [1]
, but I'm not sure what scale that would be.
set.seed(1)
nrow <- 10000
ncol <- 20
df <- data.frame(matrix(sample(10, nrow * ncol, TRUE), nrow = nrow, ncol = ncol))
fun1 <- function() setkey(as.data.table(df), X1, X2)[, head(.SD, 1), by = X1]
fun2 <- function() setkey(as.data.table(df), X1, X2)[, .SD[1], by = X1]
library(microbenchmark)
microbenchmark(fun1(), fun2())
# Unit: milliseconds
# expr min lq mean median uq max neval
# fun1() 12.178189 12.496777 13.400905 12.808523 13.483545 30.28425 100
# fun2() 4.474345 4.554527 4.948255 4.620596 4.965912 8.17852 100
这篇关于使用`data.table`基于一个变量获取第一个子组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!