按每个组中的最大值过滤数据框 [英] Filter dataframe by maximum values in each group

查看:121
本文介绍了按每个组中的最大值过滤数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个180,000 x 400数据帧,其中行对应于用户,但每个用户恰好有两行。

I have a 180,000 x 400 dataframe where the rows correspond to users but every user has exactly two rows.

id   date  ...
1    2012    ...
3    2010    ...
2    2013    ...
2    2014    ...
1    2011    ...
3    2014    ...

我想对数据进行子集化,以便仅保留每个用户的最新行(即,每个id的日期值最高的行)。

I want to subset the data so that only the most recent row for each user is retained (i.e. the row with the highest value for date for each id).

我首先尝试使用 which()循环 ids 和<$ c sapply()中的$ c> ifelse()语句很慢( O(n ^ 2)我相信)。

I first tried using which() looping ids with an ifelse() statement in sapply() which was painfully slow (O(n^2) I believe).

然后我尝试通过 id df 进行排序,然后然后以2为增量循环并比较相邻的日期,但这也很慢(我猜是因为R中的循环是无望的)。这两个日期的比较是瓶颈,因为排序几乎是即时的。

Then I tried sorting the df by id and then looping through in increments of two and comparing adjacent dates but this was also slow (I guess because loops in R are hopeless). The comparison of the two dates is the bottleneck as the sort was pretty much instant.

有没有办法向量化比较?

Is there a way to vectorize the comparison?

来自删除重复项,并保持条目具有最大绝对值的解决方案

aa <- df[order(df$id, -df$date), ] #sort by id and reverse of date
aa[!duplicated(aa$id),]

运行非常快! / p>

Runs very quickly!!

推荐答案

以下是使用data.table包的简单快速的方法

Here's a simple and fast approach using data.table package

library(data.table)
setDT(df)[, .SD[which.max(date)], id]
#    id date
# 1:  1 2012
# 2:  3 2014
# 3:  2 2014

或者(由于 by

setkey(setDT(df), id)[, .SD[which.max(date)], id]

或通过 data.table

unique(setorder(setDT(df), id, -date), by = "id")

setorder(setDT(df), id, -date)[!duplicated(id)]

或基本R解决方案

with(df, tapply(date, id, function(x) x[which.max(x)]))
##    1    2    3 
## 2012 2014 2014 

另一种方式

library(dplyr)
df %>%
  group_by(id) %>%
  filter(date == max(date)) # Will keep all existing columns but allow multiple rows in case of ties
# Source: local data table [3 x 2]
# Groups: id
# 
#   id date
# 1  1 2012
# 2  2 2014
# 3  3 2014

df %>%
  group_by(id) %>%
  slice(which.max(date)) # Will keep all columns but won't return multiple rows in case of ties

df %>%
  group_by(id) %>%
  summarise(max(date)) # Will remove all other columns and wont return multiple rows in case of ties

这篇关于按每个组中的最大值过滤数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆