从数据框中提取具有最高值和最低值的行 [英] Extract rows with highest and lowest values from a data frame
问题描述
我对 R 很陌生,我主要使用它来使用 ggplot2
库来可视化统计数据.现在我在数据准备方面遇到了问题.
I'm quite new to R, I use it mainly for visualising statistics using ggplot2
library. Now I have faced a problem with data preparation.
我需要编写一个函数,它将从数据框中删除一些(2、5 或 10)行,这些行在指定列中具有最高和最低值,并将它们放入另一个数据框中,并对每个组合执行此操作两个因素(就我而言:每天和服务器).
I need to write a function, that will remove some number (2, 5 or 10) rows from a data frame that have highest and lowest values in specified column and put them into another data frame, and do this for each combination of two factors (in my case: for each day and server).
到目前为止,我已经完成了以下步骤(使用 esoph
示例数据集的 MWE).
Up to this point, I have done the following steps (MWE using esoph
example dataset).
我已经根据所需的参数对框架进行了排序(示例中为 ncontrols
):
I have sorted the frame according to the desired parameter (ncontrols
in example):
esoph<-esoph[with(esoph,order(-ncontrols)) ,]
我可以显示每个因子值的第一个/最后一个记录(在本例中为每个年龄范围):
I can display first/last records for each factor value (in this example for each age range):
by(data=esoph,INDICES=esoph$agegp,FUN=head,3)
by(data=esoph,INDICES=esoph$agegp,FUN=tail,3)
所以基本上,我可以看到最高和最低值,但我不知道如何将它们提取到另一个数据框中以及如何将它们从主数据框中删除.
So basically, I can see the highest and lowest values, but I don't know how to extract them into another data frame and how to remove them from the main one.
同样在上面的例子中,我可以看到一个因素(年龄范围)的每个值的顶部/底部记录,但实际上我需要知道两个因素的每个值的最高和最低记录-- 在本例中,它们可以是 agegp
和 alcgp
.
Also in the above example I can see top/bottom records for each value of one factor (age range), but in reality I need to know highest and lowest records for each value of two factors -- in this example they could be agegp
and alcgp
.
我什至不确定上述这些步骤是否正确 - 也许使用 plyr
会更好?我很感激任何提示.
I am not even sure if these above steps are OK - perhaps using plyr
would work better? I'd appreciate any hints.
推荐答案
是的,您可以使用 plyr
如下:
Yes, you can use plyr
as follows:
esoph <- data.frame(agegp = sample(letters[1:2], 20, replace = TRUE),
alcgp = sample(LETTERS[1:2], 20, replace = TRUE),
ncontrols = runif(20))
ddply(esoph, c("agegp", "alcgp"),
function(x){idx <- c(which.min(x$ncontrols),
which.max(x$ncontrols))
x[idx, , drop = FALSE]})
# agegp alcgp ncontrols
# 1 a A 0.03091483
# 2 a A 0.88529790
# 3 a B 0.51265447
# 4 a B 0.86111649
# 5 b A 0.28372232
# 6 b A 0.61698401
# 7 b B 0.05618841
# 8 b B 0.89346943
ddply(esoph, c("agegp", "alcgp"),
function(x){idx <- c(which.min(x$ncontrols),
which.max(x$ncontrols))
x[-idx, , drop = FALSE]})
# agegp alcgp ncontrols
# 1 a A 0.3745029
# 2 a B 0.7621474
# 3 a B 0.6319013
# 4 b A 0.3055078
# 5 b A 0.5146028
# 6 b B 0.3735615
# 7 b B 0.2528612
# 8 b B 0.4415205
# 9 b B 0.6868219
# 10 b B 0.3750102
# 11 b B 0.2279462
# 12 b B 0.1891052
可能有很多选择,例如如果您的数据已经排序,则使用 head
和 tail
,但这应该可以工作.
There are possibly many alternatives, e.g. using head
and tail
if your data is already sorted, but this should work.
这篇关于从数据框中提取具有最高值和最低值的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!