使用`data.table`基于一个变量获取第一个子组 [英] Use `data.table` to get first of subgroup based on a variable

查看:115
本文介绍了使用`data.table`基于一个变量获取第一个子组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑由分组变量(此处 id )和有序变量(此处 date )组成的数据集

Consider a data set consisting of a grouping variable (here id) and an ordered variable (here date)

(df <- data.frame(
  id = rep(1:2,2),
  date = 4:1
))
#   id date
# 1  1    4
# 2  2    3
# 3  1    2
# 4  2    1

我想知道最简单的方法是在 data.table 等同于此 dplyr 代码:

I'm wondering what the easiest way is in data.table to do the equivalent of this dplyr code:

library(dplyr)
df %>%
  group_by(id) %>%
  filter(min_rank(date)==1)
# Source: local data frame [2 x 2]
# Groups: id
# 
#   id date
# 1  1    2
# 2  2    1

ie对于每个 id 获得日期的第一个

i.e. for each id get the first according to date.

基于类似的stackoverflow问题(使用data.table 为一个组的每个元素创建一个索引),我想出了这个

Based on a similar stackoverflow question (Create an "index" for each element of a group with data.table), I came up with this

library(data.table)
dt <- data.table(df)
setkey(dt, id, date)
for(k in unique(dt$id)){
  dt[id==k, index := 1:.N]
}
dt[index==1,]

但是似乎应该有一个单行。不熟悉 data.table 我以为这样

But it seems like there should be a one-liner for this. Being unfamiliar with data.table I thought something like this

dt[,,mult="first", by=id]

代码的最后一个位似乎应该按 id 分组,然后取第一个(在 id date ,因为我以这种方式设置了键。)

should work, but alas! The last bit of code seems like it should group by id and then take the first (which within id would be determined by date since I've set the keys in this way.)

EDIT

感谢Ananda Mahto,这个单行将会出现在我的 data.table

Thanks to Ananda Mahto, this one-liner will now be in my data.table repertoire

dt[,.SD[1], by=id]
#    id date
# 1:  1    2
# 2:  2    1


推荐答案

直接与您的来源 data.frame 直接合作,您可以尝试:

Working directly with your source data.frame, you can try:

setkey(as.data.table(df), id, date)[, .SD[1], by = id]
#    id date
# 1:  1    2
# 2:  2    1






延伸您的原创想法,只要做:


Extending your original idea, you can just do:

dt <- data.table(df)
setkey(dt, id, date)
dt[, index := sequence(.N), by = id][index == 1]
#    id date index
# 1:  1    2     1
# 2:  2    1     1






对于 head vs [1] ,大卫是正确的,但我不知道什么尺度。


It might be that at a certain scale, David is correct about head vs [1], but I'm not sure what scale that would be.

set.seed(1)
nrow <- 10000
ncol <- 20

df <- data.frame(matrix(sample(10, nrow * ncol, TRUE), nrow = nrow, ncol = ncol))

fun1 <- function() setkey(as.data.table(df), X1, X2)[, head(.SD, 1), by = X1]
fun2 <- function() setkey(as.data.table(df), X1, X2)[, .SD[1], by = X1]

library(microbenchmark)
microbenchmark(fun1(), fun2())
# Unit: milliseconds
#    expr       min        lq      mean    median        uq      max neval
#  fun1() 12.178189 12.496777 13.400905 12.808523 13.483545 30.28425   100
#  fun2()  4.474345  4.554527  4.948255  4.620596  4.965912  8.17852   100

这篇关于使用`data.table`基于一个变量获取第一个子组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆