按组选择具有最小值的行 [英] Select rows with min value by group
问题描述
我遇到了一个困扰我一段时间的问题……希望这里的任何人都能帮助我。
I got a problems that bugs me for some time… hopefully anybody here can help me.
我得到了以下数据框
f <- c('a','a','b','b','b','c','d','d','d','d')
v1 <- c(1.3,10,2,10,10,1.1,10,3.1,10,10)
v2 <- c(1:10)
df <- data.frame(f,v1,v2)
f是一个因素; v1和v2是值。
对于f的每个级别,我只想保留一行:在该因子级别中具有v1最低值的那一行。
f is a factor; v1 and v2 are values. For each level of f, I want only want to keep one row: the one that has the lowest value of v1 in this factor level.
f v1 v2
a 1.3 1
b 2 3
c 1.1 6
d 3.1 8
我尝试了各种不同的操作,包括合计,ddply,by,tapply……但似乎没有任何效果。如有任何建议,我将非常感谢。
I tried various things with aggregate, ddply, by, tapply… but nothing seems to work. For any suggestions, I would be very thankful.
推荐答案
使用DWin解决方案,使用<$ c可以避免 tapply
$ c> ave 。
Using DWin's solution, tapply
can be avoided using ave
.
df[ df$v1 == ave(df$v1, df$f, FUN=min), ]
这可以进一步提高速度,如下所示。请注意,这也取决于级别数。我给出这个信息是因为我注意到 ave
经常被遗忘,尽管它是R中功能更强大的功能之一。
This gives another speed-up, as shown below. Mind you, this is also dependent on the number of levels. I give this as I notice that ave
is far too often forgotten about, although it is one of the more powerful functions in R.
f <- rep(letters[1:20],10000)
v1 <- rnorm(20*10000)
v2 <- 1:(20*10000)
df <- data.frame(f,v1,v2)
> system.time(df[ df$v1 == ave(df$v1, df$f, FUN=min), ])
user system elapsed
0.05 0.00 0.05
> system.time(df[ df$v1 %in% tapply(df$v1, df$f, min), ])
user system elapsed
0.25 0.03 0.29
> system.time(lapply(split(df, df$f), FUN = function(x) {
+ vec <- which(x[3] == min(x[3]))
+ return(x[vec, ])
+ })
+ .... [TRUNCATED]
user system elapsed
0.56 0.00 0.58
> system.time(df[tapply(1:nrow(df),df$f,function(i) i[which.min(df$v1[i])]),]
+ )
user system elapsed
0.17 0.00 0.19
> system.time( ddply(df, .var = "f", .fun = function(x) {
+ return(subset(x, v1 %in% min(v1)))
+ }
+ )
+ )
user system elapsed
0.28 0.00 0.28
这篇关于按组选择具有最小值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!