计算分组数据帧中的唯一字符值:dplyr::count()、stringr::str_count() 和/或 purrr::map() [英] Counting unique character values in grouped dataframe: dplyr::count(), stringr::str_count() and/or purrr::map()
问题描述
基于 在 purrr::map() 中使用 dplyr::count() 时出错一个>
我想要按行子集计算唯一字符值的数据帧完整数据集超过 1000 行,多种肿瘤类型
I want dataframes of counts unique character values by subsets of rows Full dataset is 1000+ rows, many tumour types
玩具示例:
library(tidyverse)
df <- tibble::tribble(
~tumour, ~impact.on.surgery, ~impact.on.radiotherapy, ~impact.on.chemotherapy, ~impact.on.biologics, ~impact.on.immunotherapy,
'Breast', NA, NA, NA, 'Interrupted', NA,
'Breast', NA, NA, NA, 'As.planned', NA,
'Breast', NA, NA, NA, 'Interrupted', NA,
'Breast', NA, NA, 'As.planned', NA, NA,
'Breast', NA, NA, NA, NA, NA,
'Breast', NA, NA, NA, 'Interrupted', NA
> df
# A tibble: 6 x 6
tumour impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
<chr> <lgl> <lgl> <chr> <chr> <lgl>
1 Breast NA NA NA Interrupted NA
2 Breast NA NA NA As.planned NA
3 Breast NA NA NA Interrupted NA
4 Breast NA NA As.planned NA NA
5 Breast NA NA NA NA NA
6 Breast NA NA NA Interrupted NA
)
所需的输出:理想情况下,作为按肿瘤类型命名的数据帧列表,我可以稍后 reduce(bind_rows, .id = 'tumour')
附加 .id
列标签
Desired output:
Ideally as a named list of dataframes by tumour type, so I can then later reduce(bind_rows, .id = 'tumour')
appending a .id
column label
$ Breast
# A tibble: 2 x 6
impact impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Interrupted 0 0 0 3 0
2 As.planned 0 0 1 1 0
到目前为止尝试过:
# Gets single row tibble, but not sure how to `.id` label each row, map across all values & bind
df %>%
summarise(across(starts_with('impact'), ~sum(str_count(.x, 'As.planned'), na.rm = T)))
# A tibble: 1 x 5
impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
<int> <int> <int> <int> <int>
1 0 0 1 1 0
# ?Counts all variable values (no need to specify), simpler code, but also counts `NAs` and I can't pivot that to a wide form as it has 'counted' the tumour
df %>%
map_dfr(~count(data.frame(x=.), x), .id = 'var')
var x n
1 tumour Breast 6
2 impact.on.surgery <NA> 6
3 impact.on.radiotherapy <NA> 6
4 impact.on.chemotherapy As.planned 1
5 impact.on.chemotherapy <NA> 5
6 impact.on.biologics As.planned 1
7 impact.on.biologics Interrupted 3
8 impact.on.biologics <NA> 2
9 impact.on.immunotherapy <NA> 6
推荐答案
map
的一个选项是循环遍历要计算的元素,即Interrupted"、As.planned";,然后使用 summarise
across
将 starts_with
前缀命名为 'impact' 的列在按 'tumour' 分组后,通过取sum
每列中的逻辑向量
An option with map
would be to loop over the elements to be counted i.e. "Interrupted", "As.planned", then use summarise
across
the columns that starts_with
prefix names 'impact' after grouping by 'tumour', get the frequency count by taking the sum
of logical vector in each column
library(dplyr)
library(purrr)
library(stringr)
map_dfr(dplyr::lst('Interrupted', 'As.planned'), ~
df %>%
group_by(tumour) %>%
summarise(across(starts_with('impact'), function(x)
sum( x == .x, na.rm = TRUE)), .groups = 'drop'), .id = 'impact') %>%
mutate(impact = str_remove_all(impact, '"'))
# A tibble: 2 x 7
# impact tumour impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
# <chr> <chr> <int> <int> <int> <int> <int>
#1 Interrupted Breast 0 0 0 3 0
#2 As.planned Breast 0 0 1 1 0
或者为了避免在值周围使用引号,使用 setNames
而不是 lst
map_dfr(setNames(c('Interrupted', 'As.planned'),
c('Interrupted', 'As.planned')), ~
df %>%
group_by(tumour) %>%
summarise(across(starts_with('impact'), function(x)
sum( x == .x, na.rm = TRUE)), .groups = 'drop'), .id = 'impact')
或者使用 base R
lst1 <- lapply(c("Interrupted", "As.planned"),
function(x) aggregate(.~ tumour, df, FUN = function(y)
sum(y == x, na.rm = TRUE), na.action = NULL))
data.frame(impact = c("Interrupted", "As.planned"), do.call(rbind, lst1))
# impact tumour impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
#1 Interrupted Breast 0 0 0 3 0
#2 As.planned Breast 0 0 1 1 0
这篇关于计算分组数据帧中的唯一字符值:dplyr::count()、stringr::str_count() 和/或 purrr::map()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!