功能与group_by类似,当组不相互排斥 [英] Function similar to group_by when groups are not mutually exlcusive

查看:203
本文介绍了功能与group_by类似,当组不相互排斥的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在R中创建一个函数,类似于 dplyr group_by 函数,当组合使用总结可以提供数据集的摘要统计信息,其中组成员身份不是互斥的。即,观察可以属于多个组。考虑它的一种方法可能是考虑标签;观察可能属于一个或多个可能重叠的标签。



例如,将R的 esoph 数据集( https://stat.ethz.ch/R-manual /R-devel/library/datasets/html/esoph.html )记录了食管癌的病例对照研究。假设我对癌症病例的总数和每个标签的数量和比例感兴趣,标签是:65岁以上;每天喝80g酒精; 20多克/天烟草;和高风险组,符合前三项标准。
我们将数据集转换为长格式(每行一个参与者),然后将这些标签(逻辑列)添加到数据集中:

  library('dplyr')
data(esoph)
esophlong = bind_rows(esoph%>%。[rep(seq_len(nrow(。)),。$ ncases),1: 3]%>%mutate(case = 1),
esoph%>%[rep(seq_len(nrow(。)),$ ncontrols),1:3]%>%mutate = 0)
)%>%
mutate(highage =(agecp%in%c('65 -74','75+')),
highalc =(alcgp%in %c('80 -119','120+')),
hightob =(%c('20 -29','30+')中的tobgp%),
highrisk = & highalc& hightob)

我通常的做法是创建一个数据集,每个观察重复它所属的每个标签,然后总结此数据集:

  esophdup = bind_rows(esophlong%>%filter(嗨ghage)%>%mutate(tag ='age> = 65'),
esophlong%>%filter(highalc)%>%mutate(tag ='alc> = 80'),
esophlong%>%filter(hightob)%>%mutate(tag ='tob> = 20'),
esophlong%>%filter(highrisk)%>%mutate(tag ='high risk '),
esophlong%>%filter()%>%mutate(tag ='all')
)%>%
mutate(tag = factor unique(。$ tag)))

summary = esophdup%>%
group_by(tag)%>%
总结(n = n(),ncases = sum (case),case.rate = mean(case))

这种方法对于大型数据集或对于大量的标签,我会经常用尽内存来存储它。



另一种方法是分别对每个标签汇总,然后再绑定这些摘要数据集,如下所示:

  summary.age = esophlong%>%
过滤器(高)%>%
总结(n = n(),ncases = sum(case),case.rate = mean(case))%>%
mutate(tag ='age> = 65')

summary.alc = esophlong%>%
过滤器(highalc)%>%
总结(n = n(),ncases = sum(case),case.rate = mean(case))%>%
mutate(tag ='alc> = 80')

summary.tob = esophlong%>%
过滤器(hightob)%>%
总汇( n = n(),ncases = sum(case),case.rate = mean(case))%>%
mutate(tag ='tob> = 20')

summary %s

总结(n = n(),ncases = sum(case),case.rate = mean(case))%>% ;%
mut ate(tag ='high risk')

summary.all = esophlong%>%
总结(n = n(),ncases = sum(case),case.rate = (case))%>%
mutate(tag ='all')

summary = bind_rows(summary.age,summary.alc,summary.tob,summary.highrisk,summary。所有)

当我有大量的标签或我想要这种方法是耗时和乏味的在整个项目中经常重复使用标签,以获得不同的摘要度量。



我记住的功能是说, group_by_tags(data,key,.. 。),其中包含一个用于指定分组列名称的参数,应该是这样的:

  summary = esophlong%>%
group_by_tags(key ='tags',
'age> = 65'= highage,
'alc> = 80'= highalc,
'tob> = 20'= hightob,
'high risk'= hig hrisk,
'all ages'= 1
)%>%
总汇(n = n(),ncases = sum(case),case.rate = mean(case))

与摘要数据集如下所示:

 >总结
标签n ncases case.rate
1 age> = 65 273 68 0.2490842
2 alc> = 80 301 96 0.3189369
3 tob> = 20 278 64 0.2302158
4高风险11 5 0.4545455
5全部1175 200 0.1702128

更好的是,采用因子类型和逻辑类型的变量,以便可以单独总结每个年龄组,65岁以上的人和每个人:

  summaryage = esophlong%>%
group_by_tags(key ='AgeGroup',
agegp,
'65 +'=(agegp%in% c('65 -74','75+')),
'all'= 1
)%>%
总结(n = n(),ncases = sum ),case.rate = mean(case))

>摘要
Age.group n ncases case.rate
1 25-34 117 1 0.0085470
2 35-44 208 9 0.0432692
3 45-5 4 259 46 0.1776062
4 55-64 318 76 0.2389937
5 65-74 216 55 0.2546296
6 75+ 57 13 0.2280702
7 65+ 273 68 0.2490842
8全部1175 200 0.1702128

也许这是不可能的 ... 而您可能需要传递标签的列名称向量/列表。



任何想法?


$编辑:要清楚,解决方案应该将标签/组定义和所需的摘要统计信息作为参数,而不是内置到函数本身。作为两步数据%>%group_by_tags(tags)%>%summarise_tags(stats)或一步 data%> ;%summary_tags(tags,stats) process。

解决方案

这是@ eddi答案的变体。我正在采取 highage 等的定义作为功能的工作的一部分:

  library(data.table)
custom_summary = function(DT,tags,stats){
setDT(DT)
rows = stack(lapply(tags [-1],function x)DT [eval(x),which = TRUE]))
DT [rows $ values,eval(stats),by =。(tag = rows $ ind)]
}

还有一些例子:

  data(esoph)
library(dplyr)
esophlong = bind_rows(esoph%>%。[rep(seq_len(nrow(。)),。$ ncases),1:3] %>%mutate(case = 1),
esoph%>%[rep(seq_len(nrow(。)),$ ncontrols),1:3]%>%mutate )


custom_summary(
DT = esophlong,
tags = quote(list(
'age> = 65'=%c中的agegp% ('65 -74','75+'),
'alc> = 80'=%c('80 -119','120+')中的alcgp%,
'tob& 20'=%c('20 -29','30+')中的tobgp%,
'high risk'= eval(substitute(`age> = 65`& `alc> = 80`& `tob> = 20`,as.list(tags))),
'all ages'= TRUE
)),
stats = quote(list(
n = .N ,
n_cases = sum(case),
case.rate = mean(case)
))


标签n n_cases case.rate
1:年龄> = 65 273 68 0.2490842
2:alc> = 80 301 96 0.3189369
3:tob> = 20 278 64 0.2302158
4:高风险11 5 0.4545455
5:所有年龄1175 200 0.1702128

使用 eval 内部 DT [...] 被解释为在数据中.table常见问题


I would like to create a function in R, similar to dplyr's group_by function, that when combined with summarise can give summary statistics for a dataset where group membership is not mutually exclusive. I.e., observations can belong to multiple groups. One way to think about it might be to consider tags; observations may belong to one or more tags which might overlap.

For example, take R's esoph dataset (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/esoph.html) documenting a case-control study of esophageal cancer. Suppose I'm interested in the number and proportion of cancer cases overall and per 'tag', where the tags are: 65+ years old; 80+ gm/day alcohol; 20+ gm/day tobacco; and a 'high risk' group where the previous 3 criteria are met. Let's transform the dataset to long format (one participant per row) and then add these tags (logical columns) to the dataset:

library('dplyr')
data(esoph)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
                      esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
            ) %>% 
            mutate(highage=(agegp %in% c('65-74','75+')),
                   highalc=(alcgp %in% c('80-119','120+')),
                   hightob=(tobgp %in% c('20-29','30+')),
                   highrisk=(highage & highalc & hightob)
            )

My usual approach is to create a dataset where each observation is duplicated for every tag it belongs to, and then summarise this dataset:

esophdup = bind_rows(esophlong %>% filter(highage) %>% mutate(tag='age>=65'),
                     esophlong %>% filter(highalc) %>% mutate(tag='alc>=80'),
                     esophlong %>% filter(hightob) %>% mutate(tag='tob>=20'),
                     esophlong %>% filter(highrisk) %>% mutate(tag='high risk'),
                     esophlong %>% filter() %>% mutate(tag='all')
           ) %>%
           mutate(tag=factor(tag, levels = unique(.$tag)))

summary = esophdup %>%
          group_by(tag) %>%
          summarise(n=n(), ncases=sum(case), case.rate=mean(case))

This approach is inefficient for large datasets or for a large number of tags and I will often run out of memory to store it.

An alternative is to summarise each tag separately and then bind these summary datasets afterwards, as follows:

summary.age = esophlong %>%
              filter(highage) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='age>=65')

summary.alc = esophlong %>%
              filter(highalc) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='alc>=80')

summary.tob = esophlong %>%
              filter(hightob) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='tob>=20')

summary.highrisk = esophlong %>%
              filter(highrisk) %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='high risk')

summary.all = esophlong %>%
              summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
              mutate(tag='all')

summary=bind_rows(summary.age,summary.alc,summary.tob,summary.highrisk,summary.all)  

This approach is time-consuming and tedious when I have a large number of tags or I want to reuse the tags often for different summary measures throughout a project.

The function I have in mind, say group_by_tags(data, key, ...), which includes an argument to specify the name of the grouping column, should work something like this:

summary = esophlong %>% 
          group_by_tags(key='tags',
                        'age>=65'=highage,
                        'alc>=80'=highalc,
                        'tob>=20'=hightob,
                        'high risk'=highrisk,
                        'all ages'=1
          ) %>%
          summarise(n=n(), ncases=sum(case), case.rate=mean(case))

with the summary dataset looking like this:

> summary
       tags     n ncases case.rate
1   age>=65   273     68 0.2490842
2   alc>=80   301     96 0.3189369
3   tob>=20   278     64 0.2302158
4 high risk    11      5 0.4545455
5       all  1175    200 0.1702128

Even better, it could take variables of type "factor" as well as "logical" so that it could summarise, say, each age group individually, the 65+ year olds, and everybody:

summaryage = esophlong %>% 
          group_by_tags(key='Age.group',
                        agegp,
                        '65+'=(agegp %in% c('65-74','75+')),
                        'all'=1                 
          ) %>%
          summarise(n=n(), ncases=sum(case), case.rate=mean(case))

>summaryage
  Age.group     n ncases case.rate
1     25-34   117      1 0.0085470
2     35-44   208      9 0.0432692
3     45-54   259     46 0.1776062
4     55-64   318     76 0.2389937
5     65-74   216     55 0.2546296
6       75+    57     13 0.2280702
7       65+   273     68 0.2490842
8       all  1175    200 0.1702128

Perhaps it's not possible with ... and instead you might need to pass a vector/list of column names for the tags.

Any ideas?

EDIT: to be clear, the solution should take tag/group definitions and the required summary statistics as arguments, rather than being built into the function itself. Either as a two-step data %>% group_by_tags(tags) %>% summarise_tags(stats) or a one-step data %>% summary_tags(tags,stats) process.

解决方案

This is a variation on @eddi's answer. I am taking the definitions of highage et al as part of the function's job:

library(data.table)
custom_summary = function(DT, tags, stats){
    setDT(DT)
    rows = stack(lapply(tags[-1], function(x) DT[eval(x), which=TRUE]))
    DT[rows$values, eval(stats), by=.(tag = rows$ind)]
}

And some example usage:

data(esoph)
library(dplyr)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
                      esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
            )

custom_summary(
    DT = esophlong, 
    tags = quote(list(
        'age>=65'   = agegp %in% c('65-74','75+'),
        'alc>=80'   = alcgp %in% c('80-119','120+'),
        'tob>=20'   = tobgp %in% c('20-29','30+'),
        'high risk' = eval(substitute(`age>=65` & `alc>=80` & `tob>=20`, as.list(tags))),
        'all ages'  = TRUE
    )),
    stats = quote(list(
        n           = .N, 
        n_cases     = sum(case), 
        case.rate   = mean(case)
    ))
)

         tag    n n_cases case.rate
1:   age>=65  273      68 0.2490842
2:   alc>=80  301      96 0.3189369
3:   tob>=20  278      64 0.2302158
4: high risk   11       5 0.4545455
5:  all ages 1175     200 0.1702128

The technique of using eval inside DT[...] is explained in the data.table FAQ.

这篇关于功能与group_by类似,当组不相互排斥的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆