所有成员均满足R中特定要求的标记组 [英] Flagging groups in which all members fulfill a certain requirement in R
问题描述
假设以下数据:
GroupId <- c(1,1,1,1,2,2,2,3,3)
IndId <- c(1,1,2,2,3,4,4,5,5)
IndGroupProperty <- c(1,2,1,2,3,3,4,5,6)
PropertyType <- c(1,2,1,2,2,2,1,2,2)
df <- data.frame(GroupId, IndId, IndGroupProperty, PropertyType)
df
这些是多级数据,其中每个组 GroupId
由一个或多个有权访问一个或多个属性的个人 IndId
IndGroupProperty
,它们对于各自的组是唯一的(即,属性1属于组1,而没有其他组)。这些属性每个都属于类型 PropertyType
。
These are multi-level data, where each group GroupId
consists of one or multiple individuals IndId
having access to one or more properties IndGroupProperty
, which are unique to their respective group (i.e. property 1 belongs to group 1 and no other group). These properties each belong to a type PropertyType
.
任务是在每行中用一个虚拟变量标记是至少一个属于该组中每个人的1型属性。
The task is to flag each row with a dummy variable where there is at least one type-1 property belonging to each individual in the group.
对于我们的示例数据,这就是:
For our sample data, this simply is:
ValidGroup <- c(1,1,1,1,0,0,0,0,0)
df <- data.frame(df, ValidGroup)
df
前四行标记为1,因为每一行组(1)的个人(1、2)可以访问类型1属性(1)。
随后的三行属于组(2),其中只有个人(4)可以访问类型1属性(4)。因此,这些未标记(0)。
最后两行也不接收任何标志。组(3)仅由一个可以访问两个type-2属性(5,6)的个人(5)组成。
The first four rows are flagged with a 1, because each individual (1, 2) of group (1) has access to a type-1 property (1). The three subsequent rows belong to group (2), in which only individual (4) has access to a type-1 property (4). Thus these are not flagged (0). The last two rows also receives no flag. Group (3) consists only of a single individual (5) with access to two type-2 properties (5, 6).
我研究了几个命令:级别
似乎缺乏团体支持; nlme
包中的 getGroups
不喜欢我的真实数据的输入。我猜想 doBy
中可能有一些有用的东西,但是 summaryBy
似乎并不需要级别
作为函数。
I have looked into several commands: levels
seems to lack group support; getGroups
in the nlme
package does not like the input of my real data; I guess that there might be something useful in doBy
, but summaryBy
does not seem to take levels
as a function.
解决方案编辑: dplyr
解决方案由Henrik打包成函数:
Solution dplyr
solution by Henrik wrapped into a function:
foobar <- function(object, group, ind, type){
groupvar <- deparse(substitute(group))
indvar <- deparse(substitute(ind))
typevar <- deparse(substitute(type))
eval(substitute(
object[, c(groupvar, indvar, typevar)] %.%
group_by(group, ind) %.%
mutate(type1 = any(type == 1)) %.%
group_by(group, add = FALSE) %.%
mutate(ValidGroup = all(type1) * 1) %.%
select(-type1)
))
}
推荐答案
您也可以尝试 ave
:
# for each individual within group, calculate number of 1s in PropertyType
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
# within each group, check if all v1 is 1.
# The boolean result is coerced to 1 and 0 by ave.
df$ValidGroup <- ave(v1, df$GroupId, FUN = function(x) all(x == 1))
# GroupId IndId IndGroupProperty PropertyType ValidGroup
# 1 1 1 1 1 1
# 2 1 1 2 2 1
# 3 1 2 1 1 1
# 4 1 2 2 2 1
# 5 2 3 3 2 0
# 6 2 4 3 2 0
# 7 2 4 4 1 0
# 8 3 5 5 2 0
# 9 3 5 6 2 0
编辑添加了 dplyr
替代方案和基准对于不同大小的数据集:原始数据以及比原始数据大10到100倍的数据。
Edit Added dplyr
alternative and benchmark for data sets of different size: original data, and data that are 10 and 100 times larger than original.
首先包装函数中的替代项:
First wrap up the alternatives in functions:
fun_ave <- function(df){
v1 <- with(df, ave(PropertyType, list(GroupId, IndId), FUN = function(x) sum(x == 1)))
df$ValidGroup <- ave(v1, list(df$GroupId), FUN = function(x) all(x == 1))
df
}
library(dplyr)
fun_dp <- function(df){
df %.%
group_by(GroupId, IndId) %.%
mutate(
type1 = any(PropertyType == 1)) %.%
group_by(GroupId, add = FALSE) %.%
mutate(
ValidGroup = all(type1) * 1) %.%
select(-type1)
}
fun_by <- function(df){
bar <- by(data=df,INDICES=df$GroupId,FUN=function(xx){
foo <- table(xx$IndId,xx$PropertyType)
if ( !("1" %in% colnames(foo)) ) {
return(FALSE) # no PropertyType=1 at all in this group
} else {
return(all(foo[,"1"]>0)) # return whether all IndId have an 1 entry
}})
cbind(df,ValidGroup = as.integer(bar[as.character(df$GroupId)]))
}
基准
原始数据:
microbenchmark(
fun_ave(df),
fun_dp(df),
fun_by(df))
# Unit: microseconds
# expr min lq median uq max neval
# fun_ave(df) 497.964 519.8215 538.8275 563.5355 651.535 100
# fun_dp(df) 851.861 870.6765 931.1170 968.5590 1760.360 100
# fun_by(df) 1343.743 1412.5455 1464.6225 1581.8915 12588.607 100
在很小的数据集上 ave
的速度大约是 dplyr
的两倍,比快
的2.5倍。
On a tiny data set ave
is about twice as fast as dplyr
and more than 2.5 times faster than by
.
生成一些较大的数据;组和个人数量的10倍
Generate some larger data; 10 times the number of groups and individuals
GroupId <- sample(1:30, 100, replace = TRUE)
IndId <- sample(1:50, 100, replace = TRUE)
PropertyType <- sample(1:2, 100, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)
microbenchmark(
fun_ave(df2),
fun_dp(df2),
fun_by(df2))
# Unit: milliseconds
# expr min lq median uq max neval
# fun_ave(df2) 2.928865 3.185259 3.270978 3.435002 5.151457 100
# fun_dp(df2) 1.079176 1.231226 1.273610 1.352866 2.717896 100
# fun_by(df2) 9.464359 9.855317 10.137180 10.484994 12.445680 100
dplyr
比 ave
快三倍,比 by
快十倍。
dplyr
is three times faster than ave
and nearly 10 times faster than by
.
组和个人数量的100倍
100 times the number of groups and individuals
GroupId <- sample(1:300, 1000, replace = TRUE)
IndId <- sample(1:500, 1000, replace = TRUE)
PropertyType <- sample(1:2, 1000, replace = TRUE)
df2 <- data.frame(GroupId, IndId, PropertyType)
microbenchmark(
fun_ave(df2),
fun_dp(df2),
fun_by(df2))
# Unit: milliseconds
# expr min lq median uq max neval
# fun_ave(df2) 337.889895 392.983915 413.37554 441.58179 549.5516 100
# fun_dp(df2) 3.253872 3.477195 3.58173 3.73378 75.8730 100
# fun_by(df2) 92.248791 102.122733 104.09577 109.99285 186.6829 100
ave
现在真的很松散。 dplyr
比快
30倍,比 ave <快100倍以上/ code>。
ave
is really loosing ground now. dplyr
is nearly 30 times faster than by
, and more than 100 times faster than ave
.
这篇关于所有成员均满足R中特定要求的标记组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!