R,dplyr:如果每个组中只有一个唯一的非NA元素,则按组折叠字符变量元素 [英] R, dplyr: Collapse character variable elements by group if there is only one unique non-NA element per group

查看:54
本文介绍了R,dplyr:如果每个组中只有一个唯一的非NA元素,则按组折叠字符变量元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

说我有以下数据.患者家乡的数据 df 和一个任意的临床指标,心率:

  id<-c(rep(1:3,每个= 2),rep(4,3))pt_hometown<-c("Atlanta",NA,NA,圣地亚哥",不,不,旧金山",西雅图",北美)pt_heartrate<-c(不适用,82,不,不,76、76,90,93,NA)df<-data.frame(id = id,pt_hometown = pt_hometown,pt_heartrate = pt_heartrate,stringsAsFactors = FALSE)df 

哪个给

  id pt_hometown pt_heartrate1亚特兰大1< NA>822< NA>不适用2圣地亚哥3< NA>763< NA>764旧金山904西雅图934< NA>不适用 

我在这里学到的 summarise_each 可以将一个或多个函数应用于分组的数据框,以将记录折叠为每个组一个.最简单的情况是从 df 内的所有变量中选择第一个非NA值,然后将它们折叠为每组一个.

  df1<-df%>%group_by(id)%&%;%summarise_each(funs(first(.[!is.na(.)])))df1编号pt_hometown pt_heartrate1亚特兰大822圣地亚哥3不适用764旧金山90 

当然,对于实际应用,人们可能希望更加具体地崩溃.我知道如何按类型对 df 变量进行分组,例如,根据每个 id 选择 max 心率并折叠为一条记录,但是我不知道该怎么做,因为只有一个唯一的非NA值,所以有条件地将字符变量折叠到每组一个记录中.

更具体地,考虑编号为 id 的患者4.他们具有 pt_hometown 的两个唯一值:旧金山"和西雅图".显然,两者都不正确.因此,我想折叠每个只有一个非NA值的组的记录,但是保留存在多个非NA元素的行,然后提请我们的小组注意以决定如何纠正原始数据集中的错误.

所以我希望 df1 看起来像这样:

  id pt_hometown pt_heartrate1亚特兰大822圣地亚哥3< NA>764旧金山904西雅图93 

这是我尝试过的:

  df1<-df%>%group_by(id)%&%;%summarise_each_(funs(first(.[!is.na(.)]))),df [length(unique(.[!is.na(.)]))== 1]) 

解决方案

如上所述,当前无法使用 dplyr :: summarise_each 并返回可变的行数.

如果您想继续使用dplyr,可以使用 mutate_each distinct 来规避.

这是一个例子:

  f<-function(.)if(length(unique(.[!is.na(.)])))> 1L).否则第一(.[!is.na(.)])df%>%group_by(id)%&%;%mutate_each(funs(f))%>%ungroup()%&%;%与众不同()%&%;%filter(rowSums(is.na(.))< 2L)#假设您在ID列中没有NA#来源:本地数据帧[5 x 3]##id pt_hometown pt_heartrate#1 1亚特兰大82#2 2圣地亚哥#3 3不适用76#4 4旧金山90#5 4西雅图93 

但是,在我对您之前的问题的回答或eddi的回答中,data.table方法可能会更有效.

Say I have the following data.frame df of patient hometowns and one arbitrary clinical metric, heart rate:

id          <- c(rep(1:3, each = 2), rep(4, 3))
pt_hometown <- c("Atlanta", NA, 
                 NA, "San Diego", 
                 NA, NA, 
                 "San Francisco", "Seattle", NA)
pt_heartrate <- c(NA, 82,
                  NA, NA,
                  76, 76,
                  90, 93, NA)

df <- data.frame(id = id, 
                 pt_hometown = pt_hometown,
                 pt_heartrate = pt_heartrate,
                 stringsAsFactors = FALSE)
df

Which gives

id   pt_hometown pt_heartrate
 1       Atlanta           NA
 1          <NA>           82
 2          <NA>           NA
 2     San Diego           NA
 3          <NA>           76
 3          <NA>           76
 4 San Francisco           90
 4       Seattle           93
 4          <NA>           NA

As I've learned here, summarise_each can apply one or more functions to a grouped dataframe to collapse records to one per group. The simplest case might be selecting the first non-NA value from all variables within df and collapsing them down to one per group.

  df1 <- df %>%  
    group_by(id) %>%
    summarise_each(funs(first(.[!is.na(.)]))

df1

id   pt_hometown pt_heartrate
 1       Atlanta           82
 2     San Diego           NA
 3            NA           76
 4 San Francisco           90

Of course, for practical applications, one might want to collapse with a bit more specificity. I know how to group df's variables by type and, for instance, select the max heart rate per id and collapse to one record, but what I do not know how to do is conditionally collapse character variables to one record per group, given there is only one unique non-NA value.

More concretely, consider the patient with id number 4. They have two unique values for pt_hometown, "San Francisco" and "Seattle". Obviously both cannot be correct. So I would like to collapse records for each group where there is only one non-NA value, but retain rows where multiple non-NA elements exist and then bring it to the attention of our group to decide how to correct the mistake in the original dataset.

So I'd like df1 to look like this:

id   pt_hometown pt_heartrate
 1       Atlanta           82
 2     San Diego           NA
 3          <NA>           76
 4 San Francisco           90
 4       Seattle           93

This is what I've tried:

df1 <- df %>%  
  group_by(id) %>%
  summarise_each_(funs(first(.[!is.na(.)])), df[length(unique(.[!is.na(.)])) == 1])

解决方案

As commented above, there is currently no way to use dplyr::summarise_each with variable number of rows to be returned.

If you want to go on using dplyr, you could circumvent this by using mutate_each and distinct.

Here's an example:

f <- function(.) if(length(unique(.[!is.na(.)])) > 1L) . else first(.[!is.na(.)]) 

df %>% 
  group_by(id) %>%
  mutate_each(funs(f)) %>%
  ungroup() %>%
  distinct() %>% 
  filter(rowSums(is.na(.)) < 2L)     # assuming you don't have NAs in the ID column

#Source: local data frame [5 x 3]
#
#  id   pt_hometown pt_heartrate
#1  1       Atlanta           82
#2  2     San Diego           NA
#3  3            NA           76
#4  4 San Francisco           90
#5  4       Seattle           93

However, the data.table approach in my answer to your previous question or that by eddi would probably be more efficient.

这篇关于R,dplyr:如果每个组中只有一个唯一的非NA元素,则按组折叠字符变量元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆