根据条件选择和识别元素子集 [英] selecting and identifying a subset of elements based on criteria

查看:60
本文介绍了根据条件选择和识别元素子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从整体中选择满足某些条件的元素子集。大约有20个元素,每个元素都有多个属性。我想选择五个元素,这些元素在一个属性上与固定标准的差异最小,而在另一个属性上提供最高的平均值。



最后,我想要在20个元素的多个集合上应用该函数。



到目前为止,我已经能够手动识别子集,但是我想除了返回值本身之外,还可以返回值的索引。



目标:


  1. 我想找到一组X1的五个值与固定值(55)的差异最小,并为X2的平均值提供最大的值。


  2. 我想对多个集合执行此操作。







  #####生成示例数据
#####这有五个组,每个组有两个变量x1和x2
set.seed(271828)

grp <-gl(5,20)
x1 <-round(rnorm(100,45,12) ,digits = 0)
x2<-round(rbeta(100,2,4),digits = 2)
id<-seq(1,100,1)

#####这是数据到达我的方式来分析
dat<-as.data.frame(cbind(id,grp,x1,x2))






数据将以这种格式到达,并带有 id 作为每个元素的唯一标识符。






  ### ##撤出第一批示范
dat.grp.1<-dat [which(grp == 1),]

暴击率<-55
x< ;-t(combn(dat.grp.1 $ x1,5))
y<-t(combn(dat.grp.1 $ x2,5))

mean.x <-rowMeans(x)
mean.y<-rowMeans(y)
k<-(mean.x -crit)^ 2

out<-cbind(x,mean.x,k,y,mean.y)

#####最小差异量
选择<-out [which(k == min(k)),]
选择

##### y的差异和高值(X2的均值)由手
排序<-out [order(k),]
head(sorted,n = 20)






关于 pick ,我可以看到X1的值是:

 >选择
mean.xk mean.y
[1,] 55 47 48 48 52 50 25 0.62 0.08 0.31 0.18 0.54 0.346
[2,] 55 48 48 47 52 50 25 0.62 0.31 0.18 0.48 0.54 0.426

我想返回 id 这些元素的值,所以我知道我选择了元素:3、8、10、11和18(选择集合2,因为与 k 的差异是相同,但 y 的平均值较高)。

 > dat.grp.1 
id grp x1 x2
1 1 1 45 0.12
2 2 1 27 0.34
3 3 1 55 0.62
4 4 1 39 0.32
5 5 1 41 0.18
6 6 1 29 0.47
7 7 1 47 0.08
8 8 1 48 0.31
9 9 1 35 0.48
10 10 1 48 0.18
11 11 1 47 0.48
12 12 1 31 0.29
13 13 1 39 0.15
14 14 1 36 0.54
15 15 1 36 0.20
16 16 1 38 0.40
17 17 1 30 0.31
18 18 1 52 0.54
19 19 1 44 0.37
20 20 1 31 0.20

现在手动执行此操作,但最好将其设为手动。



任何帮助都将不胜感激。

解决方案

您已经快到了。您可以将 sorted 的定义更改为

  sorted<-out [order(k,-mean.y),] 

,然后进行排序[1,] (或者如果您更喜欢 sorted [1,,drop = FALSE] )是您选择的集合。



如果您想要索引而不是/除点之外,则可以更早地包括它们。替换:

  x<-t(combn(dat.grp.1 $ x1,5))
y< ;-t(combn(dat.grp.1 $ x2,5))

with

  idx<-t(combn(1:nrow(dat.grp.1),5))
x<- t(apply(idx,1,function(i){dat.grp.1 [i, x1]}))
y<-t(apply(idx,1,function(i){dat。 grp.1 [i, x2]}))

并包括

将int全部放在一起:

  #####撤出第一批示范
dat.grp.1<-dat [which(grp == 1),]

暴击<-55
idx<-t(combn(1:nrow(dat.grp.1),5))
x<- t(apply(idx,1,function(i){dat.grp.1 [i, x1]}))
y<-t(apply(idx,1,function(i){dat。 grp.1 [i, x2]}))

mean.x<-rowMeans(x)
mean.y<-rowMeans(y)
k< ;-(mean.x-crit)^ 2

out<-cbind(idx,x,mean.x,k,y,mean.y)

# ####用lea找到布景st差异,在
#####中,第二个最大的均值
选择<-out [order(k,-mean.y)[1] ,, drop = FALSE]
选择

这将使

  mean.xk mean.y 
[1,] 3 8 10 11 18 55 48 48 47 52 50 25 0.62 0.31 0.18 0.48 0.54 0.426

编辑:请求应用 idx 的描述;除了在评论中可以做的之外,我还需要更多选择,因此将其添加到答案中。



idx 是一个矩阵(15504 x 5),每行是数据帧的一组(5)索引。 apply 允许逐行(逐行是边距1)对每一行进行操作。东西就是值,并用它们来索引 dat.grp.1 的所需行,并拉出相应的 x1 值。我可以将 dat.grp.1 [i, x1] 写为 dat.grp.1 $ x1 [i] idx 的每一行成为一列,索引到 dat.grp.1 的结果为行,因此整个过程都需要转置。



您可以将循环分解开,看看每个步骤的工作方式如何。使函数成为非匿名函数。

  f<-function(i){dat.grp.1 [i , x1]} 

并在 idx <时通过行/ code>。

 > f(idx [1,])
[1] 45 27 55 39 41
> f(idx [2,])
[1] 45 27 55 39 29
> f(idx [3,])
[1] 45 27 55 39 47
> f(idx [4,])
[1] 45 27 55 39 48

这些是什么捆绑成 x

 >头(x,4)
[,1] [,2] [,3] [,4] [,5]
[1,] 45 27 55 39 41
[2, ] 45 27 55 39 29
[3,] 45 27 55 39 47
[4,] 45 27 55 39 48

对于循环子集, plyr 库对此非常方便。设置方式(将感兴趣的子集分配给变量并使用该变量)使转换变得容易。为一个子集创建答案所需执行的所有操作都会以该部分作为参数进入函数。

  find.best.set <-function(dat.grp.1){
crit<-55
idx<-t(combn(1:nrow(dat.grp.1),5))
x<-t(apply(idx,1,function(i){dat.grp.1 [i, x1]})))
y<-t(apply(idx,1,function(i ){dat.grp.1 [i, x2]}))

mean.x<-rowMeans(x)
mean.y<-rowMeans(y)
k<-(平均值.x-暴击)^ 2

out<-cbind(idx,x,mean.x,k,y,mean.y)

out [order(k,-mean.y)[1] ,, drop = FALSE]
}

这基本上是您以前的工作,但是摆脱了一些不必要的任务。



现在将其包装在 plyr中调用。

  library( plyr)
ddply(dat,。(grp ),find.best.set)

这给出了

  grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 
1 1 3 8 10 11 18 55 48 48 47 52 50 25 0.62 0.31 0.18 0.48 0.54 0.426
2 2 8 10 12 15 16 53 35 55 76 56 55 0 0.71 0.20 0.43 0.50 0.70 0.508
3 3 4 10 15 17 20 47 48 73 55 52 55 0 0.67 0.54 0.28 0.42 0.31 0.444
4 4 2 11 13 17 19 47 46 70 62 50 55 0 0.35 0.47 0.18 0.13 0.47 0.320
5 5 3 6 10 17 19 72 40 58 66 39 55 0 0.33 0.42 0.32 0.32 0.51 0.380

我不知道那是最适合您结果的格式,但是它反映了您给出的示例。


I would like to select a subset of elements from a whole that satisfy certain conditions. There are about 20 elements, each having multiple attributes. I would like to select five elements that offer the least amount of discrepancy from a fixed criterion on one attribute, and offers the highest average value on another attribute.

Lastly, I would like to apply the function over multiple sets of 20 elements.

Thus far, I have been able to identify the subsets "by hand," but I'd like to be able to return the index of the values in addition to returning the values themselves.

Objectives:

  1. I would like to find the set of five values for X1 that are the least discrepant from a fixed value (55), and provide the largest value for the average of X2.

  2. I would like to do this for multiple sets.


#####  generating example data
#####  this has five groups, each with two variables x1 and x2
set.seed(271828)

grp <- gl(5,20)
x1 <- round(rnorm(100,45, 12), digits=0)
x2 <- round(rbeta(100,2,4), digits = 2)
id <- seq(1,100,1)

#####  this is how the data would arrive for me to analyze
dat <- as.data.frame(cbind(id,grp,x1,x2))


The data would arrive in this format, with id as a unique identifier for each element.


#####  pulling out the first group for demonstration
dat.grp.1 <- dat[ which(grp == 1), ]

crit <- 55
x <- t(combn(dat.grp.1$x1, 5))
y <- t(combn(dat.grp.1$x2, 5))

mean.x <- rowMeans(x)
mean.y <- rowMeans(y)
k <- (mean.x - crit)^2

out <- cbind(x, mean.x, k, y, mean.y)

#####  finding the sets with the least amount of discrepancy
pick <- out[ which(k == min(k)), ]
pick

#####  finding the sets with low discrepancy and high values of y (means of X2) by "hand"
sorted <- out[order(k), ]
head(sorted, n=20)


With respect to the values in pick, I can see that the values of X1 are:

> pick
                    mean.x  k                          mean.y
[1,] 55 47 48 48 52     50 25 0.62 0.08 0.31 0.18 0.54  0.346
[2,] 55 48 48 47 52     50 25 0.62 0.31 0.18 0.48 0.54  0.426

I would like to return the id value for these elements, so that I know that I pick elements: 3, 8, 10, 11, and 18 (choosing set 2 since the discrepancy from k is the same, but the mean for y is higher).

> dat.grp.1 
    id grp x1   x2
 1   1   1 45 0.12
 2   2   1 27 0.34
 3   3   1 55 0.62
 4   4   1 39 0.32
 5   5   1 41 0.18
 6   6   1 29 0.47
 7   7   1 47 0.08
 8   8   1 48 0.31
 9   9   1 35 0.48
10  10   1 48 0.18
11  11   1 47 0.48
12  12   1 31 0.29
13  13   1 39 0.15
14  14   1 36 0.54
15  15   1 36 0.20
16  16   1 38 0.40
17  17   1 30 0.31
18  18   1 52 0.54
19  19   1 44 0.37
20  20   1 31 0.20

Doing this "by hand" works for now, but it would be good to make this as "hands-off" as possible.

Any help is greatly appreciated.

解决方案

You are almost there. You can change your definition of sorted to

sorted <- out[order(k, -mean.y), ]

And then sorted[1,] (or if you prefer sorted[1,,drop=FALSE]) is your selected set.

If you want the indexes rather than/in addition to the points, then you can include that earlier. Replace:

x <- t(combn(dat.grp.1$x1, 5))
y <- t(combn(dat.grp.1$x2, 5))

with

idx <- t(combn(1:nrow(dat.grp.1), 5))
x <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x1"]}))
y <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x2"]}))

and include idx in out later.

Putting int all together:

#####  pulling out the first group for demonstration
dat.grp.1 <- dat[ which(grp == 1), ]

crit <- 55
idx <- t(combn(1:nrow(dat.grp.1), 5))
x <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x1"]}))
y <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x2"]}))

mean.x <- rowMeans(x)
mean.y <- rowMeans(y)
k <- (mean.x - crit)^2

out <- cbind(idx, x, mean.x, k, y, mean.y)

#####  finding the sets with the least amount of discrepancy and among
##### those the largest second mean
pick <- out[order(k, -mean.y)[1],,drop=FALSE]
pick

which gives

                                 mean.x  k                          mean.y
[1,] 3 8 10 11 18 55 48 48 47 52     50 25 0.62 0.31 0.18 0.48 0.54  0.426

EDIT: description of applying over idx was requested; I want more options than just what i can do in a comment so I'm adding it to my answer. Will also address looping over subsets.

idx is a matrix (15504 x 5), each row of which is a set of (5) indexes for the dataframe. apply allows going through row-by-row (row-by-row is margin 1) to do something with each row. That something is take the values and use them to index the desired rows of dat.grp.1 and pull out the corresponding x1 values. I could have written dat.grp.1[i,"x1"] as dat.grp.1$x1[i]. Each row of idx becomes a column and the results of indexing into dat.grp.1 are the rows, so the whole thing needs to be transposed.

You can break the loop apart to see how each step works if you like. Make the function into a non-anonymous function.

f <- function(i) {dat.grp.1[i,"x1"]}

and pass row at a time of idx to it.

> f(idx[1,])
[1] 45 27 55 39 41
> f(idx[2,])
[1] 45 27 55 39 29
> f(idx[3,])
[1] 45 27 55 39 47
> f(idx[4,])
[1] 45 27 55 39 48

These are what get bundled into x

> head(x,4)
     [,1] [,2] [,3] [,4] [,5]
[1,]   45   27   55   39   41
[2,]   45   27   55   39   29
[3,]   45   27   55   39   47
[4,]   45   27   55   39   48

As for looping over subsets, the plyr library is very handy for this. The way you have set it up (assign the subset of interest to a variable and work with that) makes the transformation easy. Everything you do to create the answer for one subset goes into a function with that part as a parameter.

find.best.set <- function(dat.grp.1) {
    crit <- 55
    idx <- t(combn(1:nrow(dat.grp.1), 5))
    x <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x1"]}))
    y <- t(apply(idx, 1, function(i) {dat.grp.1[i,"x2"]}))

    mean.x <- rowMeans(x)
    mean.y <- rowMeans(y)
    k <- (mean.x - crit)^2

    out <- cbind(idx, x, mean.x, k, y, mean.y)

    out[order(k, -mean.y)[1],,drop=FALSE]
}

This is basically what you had before, but getting rid of some unnecessary assignments.

Now wrap this in a plyr call.

library("plyr")
ddply(dat, .(grp), find.best.set)

which gives

  grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12  V13  V14  V15  V16  V17   V18
1   1  3  8 10 11 18 55 48 48 47  52  50  25 0.62 0.31 0.18 0.48 0.54 0.426
2   2  8 10 12 15 16 53 35 55 76  56  55   0 0.71 0.20 0.43 0.50 0.70 0.508
3   3  4 10 15 17 20 47 48 73 55  52  55   0 0.67 0.54 0.28 0.42 0.31 0.444
4   4  2 11 13 17 19 47 46 70 62  50  55   0 0.35 0.47 0.18 0.13 0.47 0.320
5   5  3  6 10 17 19 72 40 58 66  39  55   0 0.33 0.42 0.32 0.32 0.51 0.380

I don't know that that is the best format for your results, but it mirrors the example you gave.

这篇关于根据条件选择和识别元素子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆