在具有分组数据的行和列中选择最大值 [英] Select a maximum value across rows and columns with grouped data
问题描述
下面的数据具有一个 IndID
字段以及三列,其中包含数字,某些情况下还包括NA,每个 IndID
的行数都不同.
The data below have an IndID
field as well as three columns containing numbers, including NA in some instances, with a varying number of rows for each IndID
.
library(dplyr)
n = 10
set.seed(123)
dat <- data.frame(IndID = sample(c("AAA", "BBB", "CCC", "DDD"), n, replace = T),
Num1 = c(2,4,2,4,4,1,3,4,3,2),
Num2 = sample(c(1,2,5,8,7,8,NA), n, replace = T),
Num3 = sample(c(NA, NA,NA,8,7,9,NA), n, replace = T)) %>%
arrange(IndID)
head(dat)
IndID Num1 Num2 Num3
1 AAA 1 NA 7
2 BBB 2 NA NA
3 BBB 2 7 7
4 BBB 2 NA NA
5 CCC 3 2 8
6 CCC 3 5 NA
对于每个 IndID
,我想创建一个新列 Max
,其中包含 Num1
的最大值: Num3
.在大多数情况下,这涉及跨多个行和多个列查找最大值.在 dplyr
中,我错过了最后一步(如下),并且希望您提出任何建议.
For each IndID
, I would like to make a new column Max
that contains the maximum value for Num1
:Num3
. In most instances this involves finding the max value across multiple rows and columns. Within dplyr
I am missing the final step (below) and would appreciate any suggestions.
dat %>%
group_by(IndID) %>%
mutate(Max = "???")
推荐答案
一个选项是 pmax
以获取行最大数
An option is pmax
to get the rowwise maxs
dat %>%
mutate(Max = pmax(Num1, Num2, Num3, na.rm = TRUE))
如果有很多列,我们可以获取列名,将其转换为符号,然后求值( !!!
)
If there are many columns, we can get the column names, convert it to symbol and then evaluate (!!!
)
dat %>%
mutate(Max = pmax(!!! rlang::syms(names(.)[-1]), na.rm = TRUE))
# A tibble: 10 x 5
# Groups: IndID [4]
# IndID Num1 Num2 Num3 Max
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 AAA 1 NA 7 7
# 2 BBB 2 NA NA 2
# 3 BBB 2 7 7 7
# 4 BBB 2 NA NA 2
# 5 CCC 3 2 8 8
# 6 CCC 3 5 NA 5
# 7 DDD 4 8 7 8
# 8 DDD 4 7 NA 7
# 9 DDD 4 1 7 7
#10 DDD 4 1 7 7
如果要获取按"IndID"分组的所有"Num"列的最大值,则有多种方法.
If this is to get the max of all 'Num' column grouped by 'IndID', there are multiple ways.
1)在上述步骤中,我们可以将其扩展为按"IndID"分组,然后采用行最大值('Max')的 max
1) From the above step, we can extend it to group by 'IndID' and then take the max
of row maxs ('Max')
dat %>%
mutate(Max = pmax(!!! rlang::syms(names(.)[-1]), na.rm = TRUE)) %>%
group_by(IndID) %>%
mutate(Max = max(Max))
2)另一个选项是使用 gather
将宽"格式转换为长",然后按"IndID"分组,获得 max 和
right_join
与原始数据集
2) Another option is to convert the 'wide' format to 'long' with gather
, then grouped by 'IndID', get the max
of 'val' column and right_join
with the original dataset
library(tidyverse)
gather(dat, key, val, -IndID) %>%
group_by(IndID) %>%
summarise(Max = max(val,na.rm = TRUE)) %>%
right_join(dat)
3)或不重塑为'long'格式的另一种选择是按'IndID', unlist
分组后对数据集进行 nest
并获取数字"列的 max
3) Or another option without reshaping into 'long' format would be to nest
the dataset after grouping by 'IndID', unlist
and get the max
of the 'Num' columns
dat %>%
group_by(IndID) %>%
nest %>%
mutate(data = map(data, ~ .x %>%
mutate(Max = max(unlist(.), na.rm = TRUE)))) %>%
unnest
这篇关于在具有分组数据的行和列中选择最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!