在具有分组数据的行和列中选择最大值 [英] Select a maximum value across rows and columns with grouped data

查看:57
本文介绍了在具有分组数据的行和列中选择最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的数据具有一个 IndID 字段以及三列,其中包含数字,某些情况下还包括NA,每个 IndID 的行数都不同.

The data below have an IndID field as well as three columns containing numbers, including NA in some instances, with a varying number of rows for each IndID.

library(dplyr)
n = 10
set.seed(123)
dat <- data.frame(IndID = sample(c("AAA", "BBB", "CCC", "DDD"), n, replace = T),
                  Num1 = c(2,4,2,4,4,1,3,4,3,2),
                  Num2 = sample(c(1,2,5,8,7,8,NA), n, replace = T),
                  Num3 = sample(c(NA, NA,NA,8,7,9,NA), n, replace = T)) %>%
  arrange(IndID)

 head(dat)
  IndID Num1 Num2 Num3
1   AAA    1   NA    7
2   BBB    2   NA   NA
3   BBB    2    7    7
4   BBB    2   NA   NA
5   CCC    3    2    8
6   CCC    3    5   NA

对于每个 IndID ,我想创建一个新列 Max ,其中包含 Num1 的最大值: Num3 .在大多数情况下,这涉及跨多个行和多个列查找最大值.在 dplyr 中,我错过了最后一步(如下),并且希望您提出任何建议.

For each IndID, I would like to make a new column Max that contains the maximum value for Num1:Num3. In most instances this involves finding the max value across multiple rows and columns. Within dplyr I am missing the final step (below) and would appreciate any suggestions.

dat %>%
  group_by(IndID) %>%
  mutate(Max = "???")

推荐答案

一个选项是 pmax 以获取行最大数

An option is pmax to get the rowwise maxs

dat %>%        
   mutate(Max = pmax(Num1, Num2, Num3, na.rm = TRUE))

如果有很多列,我们可以获取列名,将其转换为符号,然后求值( !!! )

If there are many columns, we can get the column names, convert it to symbol and then evaluate (!!!)

dat %>%        
    mutate(Max = pmax(!!! rlang::syms(names(.)[-1]), na.rm = TRUE))
# A tibble: 10 x 5
# Groups:   IndID [4]
#   IndID  Num1  Num2  Num3   Max
#   <fct> <dbl> <dbl> <dbl> <dbl>
# 1 AAA       1    NA     7     7
# 2 BBB       2    NA    NA     2
# 3 BBB       2     7     7     7
# 4 BBB       2    NA    NA     2
# 5 CCC       3     2     8     8
# 6 CCC       3     5    NA     5
# 7 DDD       4     8     7     8
# 8 DDD       4     7    NA     7
# 9 DDD       4     1     7     7
#10 DDD       4     1     7     7


如果要获取按"IndID"分组的所有"Num"列的最大值,则有多种方法.


If this is to get the max of all 'Num' column grouped by 'IndID', there are multiple ways.

1)在上述步骤中,我们可以将其扩展为按"IndID"分组,然后采用行最大值('Max')的 max

1) From the above step, we can extend it to group by 'IndID' and then take the max of row maxs ('Max')

dat %>%        
    mutate(Max = pmax(!!! rlang::syms(names(.)[-1]), na.rm = TRUE)) %>% 
    group_by(IndID) %>% 
    mutate(Max = max(Max))

2)另一个选项是使用 gather 将宽"格式转换为长",然后按"IndID"分组,获得 max 和 right_join 与原始数据集

2) Another option is to convert the 'wide' format to 'long' with gather, then grouped by 'IndID', get the max of 'val' column and right_join with the original dataset

library(tidyverse)
gather(dat, key, val, -IndID) %>% 
    group_by(IndID) %>% 
    summarise(Max = max(val,na.rm = TRUE)) %>% 
    right_join(dat)


3)或不重塑为'long'格式的另一种选择是按'IndID', unlist 分组后对数据集进行 nest 并获取数字"列的 max


3) Or another option without reshaping into 'long' format would be to nest the dataset after grouping by 'IndID', unlist and get the max of the 'Num' columns

dat %>%
   group_by(IndID) %>%
   nest %>%
   mutate(data =  map(data,  ~ .x %>% 
                                mutate(Max = max(unlist(.), na.rm = TRUE)))) %>% 
   unnest

这篇关于在具有分组数据的行和列中选择最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆