所有成员均满足R中特定要求的标记组 [英] Flagging groups in which all members fulfill a certain requirement in R

查看:60
本文介绍了所有成员均满足R中特定要求的标记组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设以下数据:

GroupId <-          c(1,1,1,1,2,2,2,3,3)
IndId <-            c(1,1,2,2,3,4,4,5,5)
IndGroupProperty <- c(1,2,1,2,3,3,4,5,6)
PropertyType <-     c(1,2,1,2,2,2,1,2,2)

df <- data.frame(GroupId, IndId, IndGroupProperty, PropertyType)
df

这些是多级数据,其中每个组 GroupId 由一个或多个有权访问一个或多个属性的个人 IndId IndGroupProperty ,它们对于各自的组是唯一的(即,属性1属于组1,而没有其他组)。这些属性每个都属于类型 PropertyType

These are multi-level data, where each group GroupId consists of one or multiple individuals IndId having access to one or more properties IndGroupProperty, which are unique to their respective group (i.e. property 1 belongs to group 1 and no other group). These properties each belong to a type PropertyType.

任务是在每行中用一个虚拟变量标记是至少一个属于该组中每个人的1型属性。

The task is to flag each row with a dummy variable where there is at least one type-1 property belonging to each individual in the group.

对于我们的示例数据,这就是:

For our sample data, this simply is:

ValidGroup <-       c(1,1,1,1,0,0,0,0,0)
df <- data.frame(df, ValidGroup)
df

前四行标记为1,因为每一行组(1)的个人(1、2)可以访问类型1属性(1)。
随后的三行属于组(2),其中只有个人(4)可以访问类型1属性(4)。因此,这些未标记(0)。
最后两行也不接收任何标志。组(3)仅由一个可以访问两个type-2属性(5,6)的个人(5)组成。

The first four rows are flagged with a 1, because each individual (1, 2) of group (1) has access to a type-1 property (1). The three subsequent rows belong to group (2), in which only individual (4) has access to a type-1 property (4). Thus these are not flagged (0). The last two rows also receives no flag. Group (3) consists only of a single individual (5) with access to two type-2 properties (5, 6).

我研究了几个命令:级别似乎缺乏团体支持; nlme 包中的 getGroups 不喜欢我的真实数据的输入。我猜想 doBy 中可能有一些有用的东西,但是 summaryBy 似乎并不需要级别作为函数。

I have looked into several commands: levels seems to lack group support; getGroups in the nlme package does not like the input of my real data; I guess that there might be something useful in doBy, but summaryBy does not seem to take levels as a function.

解决方案编辑: dplyr 解决方案由Henrik打包成函数:

Solution dplyr solution by Henrik wrapped into a function:

foobar <- function(object, group, ind, type){
groupvar <- deparse(substitute(group)) 
indvar <- deparse(substitute(ind)) 
typevar <- deparse(substitute(type)) 
eval(substitute(
object[, c(groupvar, indvar, typevar)] %.%
  group_by(group, ind) %.%
  mutate(type1 = any(type == 1))  %.%
  group_by(group, add = FALSE) %.%
  mutate(ValidGroup = all(type1) * 1) %.%
  select(-type1)
  ))
}


推荐答案

您也可以尝试 ave

# for each individual within group, calculate number of 1s in PropertyType
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))

# within each group, check if all v1 is 1.
# The boolean result is coerced to 1 and 0 by ave.  
df$ValidGroup <- ave(v1, df$GroupId, FUN = function(x) all(x == 1))

#   GroupId IndId IndGroupProperty PropertyType ValidGroup
# 1       1     1                1            1          1
# 2       1     1                2            2          1
# 3       1     2                1            1          1
# 4       1     2                2            2          1
# 5       2     3                3            2          0
# 6       2     4                3            2          0
# 7       2     4                4            1          0
# 8       3     5                5            2          0
# 9       3     5                6            2          0

编辑添加了 dplyr 替代方案和基准对于不同大小的数据集:原始数据以及比原始数据大10到100倍的数据。

Edit Added dplyr alternative and benchmark for data sets of different size: original data, and data that are 10 and 100 times larger than original.

首先包装函数中的替代项:

First wrap up the alternatives in functions:

fun_ave <- function(df){
  v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
df$ValidGroup <- ave(v1, list(df$GroupId), FUN = function(x) all(x == 1))
df  
}

library(dplyr)
fun_dp <- function(df){
df %.%
  group_by(GroupId, IndId) %.%
  mutate(
    type1 = any(PropertyType == 1)) %.%
  group_by(GroupId, add = FALSE) %.%
  mutate(
    ValidGroup = all(type1) * 1) %.%
  select(-type1)
}


fun_by <- function(df){
  bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
    foo <- table(xx$IndId,xx$PropertyType)
    if ( !("1" %in% colnames(foo)) ) {
      return(FALSE)   # no PropertyType=1 at all in this group
    } else {
      return(all(foo[,"1"]>0))    # return whether all IndId have an 1 entry
    }})
  cbind(df,ValidGroup = as.integer(bar[as.character(df$GroupId)]))
}

基准

原始数据:

microbenchmark(
  fun_ave(df),
  fun_dp(df),
  fun_by(df))

# Unit: microseconds
#        expr      min        lq    median        uq       max neval
# fun_ave(df)  497.964  519.8215  538.8275  563.5355   651.535   100
#  fun_dp(df)  851.861  870.6765  931.1170  968.5590  1760.360   100
#  fun_by(df) 1343.743 1412.5455 1464.6225 1581.8915 12588.607   100

在很小的数据集上 ave 的速度大约是 dplyr 的两倍,比的2.5倍。

On a tiny data set ave is about twice as fast as dplyr and more than 2.5 times faster than by.

生成一些较大的数据;组和个人数量的10倍

Generate some larger data; 10 times the number of groups and individuals

GroupId <- sample(1:30, 100, replace = TRUE)
IndId <- sample(1:50, 100, replace = TRUE)
PropertyType <- sample(1:2, 100, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)

microbenchmark(
  fun_ave(df2),
  fun_dp(df2),
  fun_by(df2))
# Unit: milliseconds
#          expr      min       lq    median        uq       max neval
#  fun_ave(df2) 2.928865 3.185259  3.270978  3.435002  5.151457   100
#   fun_dp(df2) 1.079176 1.231226  1.273610  1.352866  2.717896   100
#   fun_by(df2) 9.464359 9.855317 10.137180 10.484994 12.445680   100

dplyr ave 快三倍,比 by 快十倍。

dplyr is three times faster than ave and nearly 10 times faster than by.

组和个人数量的100倍

100 times the number of groups and individuals

GroupId <- sample(1:300, 1000, replace = TRUE)
IndId <- sample(1:500, 1000, replace = TRUE)
PropertyType <- sample(1:2, 1000, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)

microbenchmark(
  fun_ave(df2),
  fun_dp(df2),
  fun_by(df2))

# Unit: milliseconds
# expr        min         lq    median        uq      max neval
# fun_ave(df2) 337.889895 392.983915 413.37554 441.58179 549.5516   100
#  fun_dp(df2)   3.253872   3.477195   3.58173   3.73378  75.8730   100
#  fun_by(df2)  92.248791 102.122733 104.09577 109.99285 186.6829   100

ave 现在真的很松散。 dplyr 30倍,比 ave <快100倍以上/ code>。

ave is really loosing ground now. dplyr is nearly 30 times faster than by, and more than 100 times faster than ave.

这篇关于所有成员均满足R中特定要求的标记组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆