功能与group_by类似,当组不相互排斥 [英] Function similar to group_by when groups are not mutually exlcusive
问题描述
我想在R中创建一个函数,类似于 dplyr
的 group_by
函数,当组合使用总结
可以提供数据集的摘要统计信息,其中组成员身份不是互斥的。即,观察可以属于多个组。考虑它的一种方法可能是考虑标签;观察可能属于一个或多个可能重叠的标签。
例如,将R的 esoph
数据集( https://stat.ethz.ch/R-manual /R-devel/library/datasets/html/esoph.html )记录了食管癌的病例对照研究。假设我对癌症病例的总数和每个标签的数量和比例感兴趣,标签是:65岁以上;每天喝80g酒精; 20多克/天烟草;和高风险组,符合前三项标准。
我们将数据集转换为长格式(每行一个参与者),然后将这些标签(逻辑列)添加到数据集中:
library('dplyr')
data(esoph)
esophlong = bind_rows(esoph%>%。[rep(seq_len(nrow(。)),。$ ncases),1: 3]%>%mutate(case = 1),
esoph%>%[rep(seq_len(nrow(。)),$ ncontrols),1:3]%>%mutate = 0)
)%>%
mutate(highage =(agecp%in%c('65 -74','75+')),
highalc =(alcgp%in %c('80 -119','120+')),
hightob =(%c('20 -29','30+')中的tobgp%),
highrisk = & highalc& hightob)
)
我通常的做法是创建一个数据集,每个观察重复它所属的每个标签,然后总结
此数据集:
esophdup = bind_rows(esophlong%>%filter(嗨ghage)%>%mutate(tag ='age> = 65'),
esophlong%>%filter(highalc)%>%mutate(tag ='alc> = 80'),
esophlong%>%filter(hightob)%>%mutate(tag ='tob> = 20'),
esophlong%>%filter(highrisk)%>%mutate(tag ='high risk '),
esophlong%>%filter()%>%mutate(tag ='all')
)%>%
mutate(tag = factor unique(。$ tag)))
summary = esophdup%>%
group_by(tag)%>%
总结(n = n(),ncases = sum (case),case.rate = mean(case))
这种方法对于大型数据集或对于大量的标签,我会经常用尽内存来存储它。
另一种方法是分别对每个标签汇总
,然后再绑定这些摘要数据集,如下所示:
summary.age = esophlong%>%
过滤器(高)%>%
总结(n = n(),ncases = sum(case),case.rate = mean(case))%>%
mutate(tag ='age> = 65')
summary.alc = esophlong%>%
过滤器(highalc)%>%
总结(n = n(),ncases = sum(case),case.rate = mean(case))%>%
mutate(tag ='alc> = 80')
summary.tob = esophlong%>%
过滤器(hightob)%>%
总汇( n = n(),ncases = sum(case),case.rate = mean(case))%>%
mutate(tag ='tob> = 20')
summary %s
总结(n = n(),ncases = sum(case),case.rate = mean(case))%>% ;%
mut ate(tag ='high risk')
summary.all = esophlong%>%
总结(n = n(),ncases = sum(case),case.rate = (case))%>%
mutate(tag ='all')
summary = bind_rows(summary.age,summary.alc,summary.tob,summary.highrisk,summary。所有)
当我有大量的标签或我想要这种方法是耗时和乏味的在整个项目中经常重复使用标签,以获得不同的摘要度量。
我记住的功能是说, group_by_tags(data,key,.. 。)
,其中包含一个用于指定分组列名称的参数,应该是这样的:
summary = esophlong%>%
group_by_tags(key ='tags',
'age> = 65'= highage,
'alc> = 80'= highalc,
'tob> = 20'= hightob,
'high risk'= hig hrisk,
'all ages'= 1
)%>%
总汇(n = n(),ncases = sum(case),case.rate = mean(case))
与摘要数据集如下所示:
>总结
标签n ncases case.rate
1 age> = 65 273 68 0.2490842
2 alc> = 80 301 96 0.3189369
3 tob> = 20 278 64 0.2302158
4高风险11 5 0.4545455
5全部1175 200 0.1702128
更好的是,采用因子类型和逻辑类型的变量,以便可以单独总结每个年龄组,65岁以上的人和每个人:
summaryage = esophlong%>%
group_by_tags(key ='AgeGroup',
agegp,
'65 +'=(agegp%in% c('65 -74','75+')),
'all'= 1
)%>%
总结(n = n(),ncases = sum ),case.rate = mean(case))
>摘要
Age.group n ncases case.rate
1 25-34 117 1 0.0085470
2 35-44 208 9 0.0432692
3 45-5 4 259 46 0.1776062
4 55-64 318 76 0.2389937
5 65-74 216 55 0.2546296
6 75+ 57 13 0.2280702
7 65+ 273 68 0.2490842
8全部1175 200 0.1702128
也许这是不可能的 ...
而您可能需要传递标签的列名称向量/列表。
任何想法?
$编辑:要清楚,解决方案应该将标签/组定义和所需的摘要统计信息作为参数,而不是内置到函数本身。作为两步
数据%>%group_by_tags(tags)%>%summarise_tags(stats)
或一步 data%> ;%summary_tags(tags,stats)
process。这是@ eddi答案的变体。我正在采取 highage
等的定义作为功能的工作的一部分:
library(data.table)
custom_summary = function(DT,tags,stats){
setDT(DT)
rows = stack(lapply(tags [-1],function x)DT [eval(x),which = TRUE]))
DT [rows $ values,eval(stats),by =。(tag = rows $ ind)]
}
还有一些例子:
data(esoph)
library(dplyr)
esophlong = bind_rows(esoph%>%。[rep(seq_len(nrow(。)),。$ ncases),1:3] %>%mutate(case = 1),
esoph%>%[rep(seq_len(nrow(。)),$ ncontrols),1:3]%>%mutate )
)
custom_summary(
DT = esophlong,
tags = quote(list(
'age> = 65'=%c中的agegp% ('65 -74','75+'),
'alc> = 80'=%c('80 -119','120+')中的alcgp%,
'tob& 20'=%c('20 -29','30+')中的tobgp%,
'high risk'= eval(substitute(`age> = 65`& `alc> = 80`& `tob> = 20`,as.list(tags))),
'all ages'= TRUE
)),
stats = quote(list(
n = .N ,
n_cases = sum(case),
case.rate = mean(case)
))
)
标签n n_cases case.rate
1:年龄> = 65 273 68 0.2490842
2:alc> = 80 301 96 0.3189369
3:tob> = 20 278 64 0.2302158
4:高风险11 5 0.4545455
5:所有年龄1175 200 0.1702128
使用 eval
内部 DT [...]
被解释为在数据中.table常见问题。
I would like to create a function in R, similar to dplyr
's group_by
function, that when combined with summarise
can give summary statistics for a dataset where group membership is not mutually exclusive. I.e., observations can belong to multiple groups. One way to think about it might be to consider tags; observations may belong to one or more tags which might overlap.
For example, take R's esoph
dataset (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/esoph.html) documenting a case-control study of esophageal cancer. Suppose I'm interested in the number and proportion of cancer cases overall and per 'tag', where the tags are: 65+ years old; 80+ gm/day alcohol; 20+ gm/day tobacco; and a 'high risk' group where the previous 3 criteria are met.
Let's transform the dataset to long format (one participant per row) and then add these tags (logical columns) to the dataset:
library('dplyr')
data(esoph)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
) %>%
mutate(highage=(agegp %in% c('65-74','75+')),
highalc=(alcgp %in% c('80-119','120+')),
hightob=(tobgp %in% c('20-29','30+')),
highrisk=(highage & highalc & hightob)
)
My usual approach is to create a dataset where each observation is duplicated for every tag it belongs to, and then summarise
this dataset:
esophdup = bind_rows(esophlong %>% filter(highage) %>% mutate(tag='age>=65'),
esophlong %>% filter(highalc) %>% mutate(tag='alc>=80'),
esophlong %>% filter(hightob) %>% mutate(tag='tob>=20'),
esophlong %>% filter(highrisk) %>% mutate(tag='high risk'),
esophlong %>% filter() %>% mutate(tag='all')
) %>%
mutate(tag=factor(tag, levels = unique(.$tag)))
summary = esophdup %>%
group_by(tag) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case))
This approach is inefficient for large datasets or for a large number of tags and I will often run out of memory to store it.
An alternative is to summarise
each tag separately and then bind these summary datasets afterwards, as follows:
summary.age = esophlong %>%
filter(highage) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='age>=65')
summary.alc = esophlong %>%
filter(highalc) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='alc>=80')
summary.tob = esophlong %>%
filter(hightob) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='tob>=20')
summary.highrisk = esophlong %>%
filter(highrisk) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='high risk')
summary.all = esophlong %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='all')
summary=bind_rows(summary.age,summary.alc,summary.tob,summary.highrisk,summary.all)
This approach is time-consuming and tedious when I have a large number of tags or I want to reuse the tags often for different summary measures throughout a project.
The function I have in mind, say group_by_tags(data, key, ...)
, which includes an argument to specify the name of the grouping column, should work something like this:
summary = esophlong %>%
group_by_tags(key='tags',
'age>=65'=highage,
'alc>=80'=highalc,
'tob>=20'=hightob,
'high risk'=highrisk,
'all ages'=1
) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case))
with the summary dataset looking like this:
> summary
tags n ncases case.rate
1 age>=65 273 68 0.2490842
2 alc>=80 301 96 0.3189369
3 tob>=20 278 64 0.2302158
4 high risk 11 5 0.4545455
5 all 1175 200 0.1702128
Even better, it could take variables of type "factor" as well as "logical" so that it could summarise, say, each age group individually, the 65+ year olds, and everybody:
summaryage = esophlong %>%
group_by_tags(key='Age.group',
agegp,
'65+'=(agegp %in% c('65-74','75+')),
'all'=1
) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case))
>summaryage
Age.group n ncases case.rate
1 25-34 117 1 0.0085470
2 35-44 208 9 0.0432692
3 45-54 259 46 0.1776062
4 55-64 318 76 0.2389937
5 65-74 216 55 0.2546296
6 75+ 57 13 0.2280702
7 65+ 273 68 0.2490842
8 all 1175 200 0.1702128
Perhaps it's not possible with ...
and instead you might need to pass a vector/list of column names for the tags.
Any ideas?
EDIT: to be clear, the solution should take tag/group definitions and the required summary statistics as arguments, rather than being built into the function itself. Either as a two-step data %>% group_by_tags(tags) %>% summarise_tags(stats)
or a one-step data %>% summary_tags(tags,stats)
process.
This is a variation on @eddi's answer. I am taking the definitions of highage
et al as part of the function's job:
library(data.table)
custom_summary = function(DT, tags, stats){
setDT(DT)
rows = stack(lapply(tags[-1], function(x) DT[eval(x), which=TRUE]))
DT[rows$values, eval(stats), by=.(tag = rows$ind)]
}
And some example usage:
data(esoph)
library(dplyr)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
)
custom_summary(
DT = esophlong,
tags = quote(list(
'age>=65' = agegp %in% c('65-74','75+'),
'alc>=80' = alcgp %in% c('80-119','120+'),
'tob>=20' = tobgp %in% c('20-29','30+'),
'high risk' = eval(substitute(`age>=65` & `alc>=80` & `tob>=20`, as.list(tags))),
'all ages' = TRUE
)),
stats = quote(list(
n = .N,
n_cases = sum(case),
case.rate = mean(case)
))
)
tag n n_cases case.rate
1: age>=65 273 68 0.2490842
2: alc>=80 301 96 0.3189369
3: tob>=20 278 64 0.2302158
4: high risk 11 5 0.4545455
5: all ages 1175 200 0.1702128
The technique of using eval
inside DT[...]
is explained in the data.table FAQ.
这篇关于功能与group_by类似,当组不相互排斥的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!