按每个组中的最大值过滤数据框 [英] Filter dataframe by maximum values in each group
问题描述
我有一个180,000 x 400数据帧,其中行对应于用户,但每个用户恰好有两行。
I have a 180,000 x 400 dataframe where the rows correspond to users but every user has exactly two rows.
id date ...
1 2012 ...
3 2010 ...
2 2013 ...
2 2014 ...
1 2011 ...
3 2014 ...
我想对数据进行子集化,以便仅保留每个用户的最新行(即,每个id的日期值最高的行)。
I want to subset the data so that only the most recent row for each user is retained (i.e. the row with the highest value for date for each id).
我首先尝试使用 which()
循环 ids
和<$ c sapply()
中的$ c> ifelse()语句很慢( O(n ^ 2)
我相信)。
I first tried using which()
looping ids
with an ifelse()
statement in sapply()
which was painfully slow (O(n^2)
I believe).
然后我尝试通过 id
对 df
进行排序,然后然后以2为增量循环并比较相邻的日期,但这也很慢(我猜是因为R中的循环是无望的)。这两个日期的比较是瓶颈,因为排序几乎是即时的。
Then I tried sorting the df
by id
and then looping through in increments of two and comparing adjacent dates but this was also slow (I guess because loops in R are hopeless). The comparison of the two dates is the bottleneck as the sort was pretty much instant.
有没有办法向量化比较?
Is there a way to vectorize the comparison?
aa <- df[order(df$id, -df$date), ] #sort by id and reverse of date
aa[!duplicated(aa$id),]
运行非常快! / p>
Runs very quickly!!
推荐答案
以下是使用data.table包的简单快速的方法
Here's a simple and fast approach using data.table package
library(data.table)
setDT(df)[, .SD[which.max(date)], id]
# id date
# 1: 1 2012
# 2: 3 2014
# 3: 2 2014
或者(由于键 by
setkey(setDT(df), id)[, .SD[which.max(date)], id]
或通过 data.table
包
unique(setorder(setDT(df), id, -date), by = "id")
或
setorder(setDT(df), id, -date)[!duplicated(id)]
或基本R解决方案
with(df, tapply(date, id, function(x) x[which.max(x)]))
## 1 2 3
## 2012 2014 2014
另一种方式
library(dplyr)
df %>%
group_by(id) %>%
filter(date == max(date)) # Will keep all existing columns but allow multiple rows in case of ties
# Source: local data table [3 x 2]
# Groups: id
#
# id date
# 1 1 2012
# 2 2 2014
# 3 3 2014
或
df %>%
group_by(id) %>%
slice(which.max(date)) # Will keep all columns but won't return multiple rows in case of ties
或
df %>%
group_by(id) %>%
summarise(max(date)) # Will remove all other columns and wont return multiple rows in case of ties
这篇关于按每个组中的最大值过滤数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!