仅使用 dplyr 在 3 列中应用条件时如何填充 Yes 和 No? [英] How to populate Yes and No when applying conditions accros 3 columns only with dplyr?

查看:28
本文介绍了仅使用 dplyr 在 3 列中应用条件时如何填充 Yes 和 No?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个新列,比如 test,基于 3 列的几个条件.我正在尝试仅使用 tidyverse 来实现这一目标.这是我的条件:

I am trying to create a new column, say test, with several conditions based on 3 columns. I am tryiing to achieve this with tidyverse only. Here are my conditions:

  • 如果我在某一列中有 Yes,不管其他类别 (No/Unknown/NA) 在同一列中的 id:previous_cabg、previous_pci、previous_ami 然后在测试变量中赋值 Yes
  • 如果我在所有列中都有 No 相同的 id,那么为测试变量分配 NO
  • 如果我有一列 NONA/Unknown 在其他列中具有相同的 id 然后在测试中分配 No变量
  • 如果我在所有列中都有 Yes 相同的 ID,则在测试变量中分配 Yes
  • 如果我有是在一列和不适用/未知对于每列中的相同ID然后分配Yes`在测试变量中
  • if I have Yes in one column, regardless of other categories (No/Unknown/NA) in the same id across columns: previous_cabg, previous_pci, previous_ami then assign Yes in test variable
  • if I have No in all columns for the same id then assign NO for the test variable
  • if I have NO for one column and NA/Unknown in the other columns for the same id then assign with No in the test variable
  • if I have Yes in all column for the same id then assign Yes in the test variable
  • if I haveYes in one column and NA/Unknownfor the same id in each column then assignYes`in test variable

这是我拥有的数据集类型:

This is the type of dataset I have:

structure(list(id = c(112139L, 43919L, 92430L, 87137L, 95417L, 
66955L, 16293L, 61396L, 25379L, 79229L, 27107L, 63243L, 50627L, 
17968L, 83015L, 96549L, 7332L, 4873L, 98131L, 93506L, 52894L, 
59327L, 85003L, 96623L, 82999L, 65769L, 67063L, 21744L, 62961L, 
2229L, 103673L, 9367L, 60215L, 74044L, 58422L, 57530L, 100399L, 
46483L, 108690L, 62017L, 46467L, 79562L, 4800L, 119158L, 103222L, 
32908L, 14491L, 30293L, 52558L, 122304L, 42281L, 1553L, 111771L, 
23087L, 30147L, 37842L, 51552L, 20148L, 28L, 7477L), previous_cabg = structure(c(1L, 
1L, 1L, NA, 1L, NA, NA, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, NA, 1L, 1L, NA, 1L, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 3L, 
1L, 1L, NA, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
"Unknown", "Yes"), class = "factor"), previous_pci = structure(c(1L, 
1L, 2L, NA, 1L, NA, NA, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
2L, NA, 2L, 1L, NA, 2L, NA, 1L, 2L, 1L, 1L, 1L, NA, 2L, 1L, 1L, 
2L, 2L, NA, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 2L, 1L, 1L), .Label = c("No", 
"Yes", "Unknown"), class = "factor"), previous_ami = structure(c(2L, 
2L, 1L, 2L, 2L, NA, 2L, 1L, 2L, 2L, NA, 1L, 2L, 2L, 2L, 2L, 2L, 
1L, NA, 1L, 2L, NA, 1L, NA, 2L, 1L, 2L, 2L, 2L, NA, 1L, 1L, 1L, 
2L, 1L, NA, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 3L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, NA, 2L, 2L, 2L, 1L, 2L), .Label = c("Yes", 
"No", "Unknown"), class = "factor")), row.names = c(NA, -60L), problems = structure(list(
    row = c(34136L, 121773L, 121779L), col = c("1.01 Hospital identifier", 
    "1.01 Hospital identifier", "1.01 Hospital identifier"), 
    expected = c("value in level set", "value in level set", 
    "value in level set"), actual = c("CMH", "CMH", "CMH"), file = c("'../../data/changed/minap_2020_2021_second.csv'", 
    "'../../data/changed/minap_2020_2021_second.csv'", "'../../data/changed/minap_2020_2021_second.csv'"
    )), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"
)), class = c("tbl_df", "tbl", "data.frame"))

这是它的外观,但只有前 10 行,如果你仔细看,我在 3 列中有不同的匹配组

And this is how it looks, but only first 10 rows, if you look in detail, I have different groups of matches across the 3 columns

# A tibble: 60 x 4
       id previous_cabg previous_pci previous_ami
    <int> <fct>         <fct>        <fct>       
 1 112139 No            No           No          
 2  43919 No            No           No          
 3  92430 No            Yes          Yes         
 4  87137 NA            NA           No          
 5  95417 No            No           No          
 6  66955 NA            NA           NA          
 7  16293 NA            NA           No          
 8  61396 No            Yes          Yes         
 9  25379 No            Yes          No          
10  79229 No            No           No        

我希望只能使用 tidyverse 或 tidyverse 和 r base 的混合来解决这个问题.

I am hoping to solve this only with tidyverse or a mix of tidyverse and r base.

这是我尝试过的,但我觉得它不是那么明智.我认为这是不明智的,因为此代码将成为自动化过程的一部分,如果我将获得其他类别,而不是 Yes 和 No,例如 Unknown,因为 this then 出现在后面下一个数据集提取,那么我希望代码能够避免我上面给出的条件中的所有其他情况.

This is what I have tried, yet I feel it is not so wise. I believe it is not wise, since this code will be part of automation process and if I will get other categories, than Yes and No, like Unknown as thisn appeared later in the next dataset extracts, then I wish the code will avoid all the other cases from the conditions I have given above.

dplyr::mutate(first_attack = 
                  dplyr::case_when(previous_cabg == 'No'  | previous_pci == 'No'  | previous_ami == 'Yes' ~ 'Yes',
                                   previous_cabg == 'No'  | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
                                   previous_cabg == 'Yes' | previous_pci == 'No'  | previous_ami == 'Yes' ~ 'Yes', 
                                   previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'No' ~  'Yes', 
                                   previous_cabg == 'No'  | previous_pci == 'No'  | previous_ami == 'Yes' ~ 'Yes',
                                   previous_cabg == 'No'  | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
                                   previous_cabg == 'Yes' | previous_pci == 'No'  | previous_ami == 'Yes' ~ 'Yes', 
                                   previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'No' ~  'Yes'
                                   
                                   # deal with the unknown category
                                   previous_cabg == 'Unknown'  | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
                                   previous_cabg == 'Yes' | previous_pci == 'Unknown'  | previous_ami == 'Yes' ~ 'Yes', 
                                   previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'No' ~  'Yes', 
                                   previous_cabg == 'Unknown'  | previous_pci == 'Unknown'  | previous_ami == 'Yes' ~ 'Yes',
                                   previous_cabg == 'Unknown'  | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
                                   previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Yes' ~ 'Yes', 
                                   previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'Unknown' ~  'Yes', 
                                   
                                   
                                   previous_cabg == 'Yes' |  previous_pci == 'No'  | previous_ami == 'Yes' ~  'Yes', 
                                   previous_cabg == 'Yes' |  previous_pci == 'No'  | previous_ami == 'No'  ~  'Yes',
                                   previous_cabg == 'No'  |  previous_pci == 'No'  | previous_ami == 'Yes' ~  'Yes',
                                   previous_cabg == 'No'  | previous_pci == 'Yes'  | previous_ami == 'No'  ~ 'Yes', 
                                   
                                   
                                   previous_cabg == 'Yes' |  previous_pci == 'Unknown'   | previous_ami == 'Yes' ~  'Yes', 
                                   previous_cabg == 'Yes' |  previous_pci == 'Unknown'   | previous_ami == 'Unknown'   ~  'Yes',
                                   previous_cabg == 'Unknown'   |  previous_pci == 'Unknown'   | previous_ami == 'Yes' ~  'Yes',
                                   previous_cabg == 'Unknown'   | previous_pci == 'Yes'  | previous_ami == 'Unknown'   ~ 'Yes', 
                                   
                                   
                                   previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Unknown' ~ 'Yes', 
                                   previous_cabg == 'Unknown'  | previous_pci == 'Yes'| previous_ami == 'Unknown' ~ 'Yes', 
                                   previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes', 
                                   previous_cabg == 'Unknown'  | previous_pci == 'Yes'| previous_ami == 'Yes' ~ 'Yes', 
                                   
                                   previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'No' ~ 'Yes', 
                                   previous_cabg == 'No'  | previous_pci == 'Yes'| previous_ami == 'No' ~ 'Yes', 
                                   previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes', 
                                   previous_cabg == 'No'  | previous_pci == 'Yes'| previous_ami == 'Yes' ~ 'Yes', 
                                   
                                   previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Unknown' ~ 'Yes', 
                                   previous_cabg == 'Unknown'  | previous_pci == 'Yes'| previous_ami == 'Unknown' ~ 'Yes', 
                                   previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Yes' ~ 'Yes', 
                                   previous_cabg == 'Unknown' | previous_pci == 'Yes'| previous_ami == 'Yes' ~ 'Yes', 
                                   
                                   
                                   previous_cabg == 'No'  | previous_pci == 'No'  |  previous_ami == 'No' ~ 'No', 
                                   previous_cabg == 'Yes' | previous_pci == 'Yes' |  previous_ami == 'Yes' ~'Yes'
                                   
                  ))

推荐答案

这些操作是rowwise(),所以效率不是很高,但是tidyverse中的这个解决方案代码>应该干净地实现你想要的.

These operations are rowwise(), so they're not very efficient, but this solution in the tidyverse should cleanly achieve what you want.

让我们将您的示例数据集命名为 dataset.然后是下面的工作流程

Let us call your sample dataset by the name dataset. Then the following workflow

library(tidyverse)


# ...
# Code to generate your 'dataset'.
# ...


# Define custom logic across a single row.
get_first_attack <- function(values_across_row) {
  # "Yes" overrides all other values.
  if(isTRUE(any(values_across_row == "Yes"))){
    return("Yes")
  }
  # "No" overrides all missing values: 'NA' and "Unknown".
  else if(isTRUE(any(values_across_row == "No"))) {
    return("No")
  }
  # "Unknown" overrides all other missing values: 'NA'.
  else if(isTRUE(any(values_across_row == "Unknown"))) {
    return("Unknown")
  }
  # All values are missing: 'NA'.
  else {
    return(as.character(NA))
  }
}


dataset %>%
  # Examine row by row.
  dplyr::rowwise() %>%
  # Compare values across each row according to the logic in 'get_first_attack()'.
  dplyr::mutate(first_attack = get_first_attack(across(previous_cabg:previous_ami))) %>%
  # Exit row-wise approach, to restore efficiency.
  dplyr::ungroup() %>%
  # Factor 'first_attack' exactly like its neighboring column.
  dplyr::mutate(first_attack = factor(first_attack, levels = levels(previous_ami)))

应该给你这些结果

# A tibble: 60 x 5
       id previous_cabg previous_pci previous_ami first_attack
    <int> <fct>         <fct>        <fct>        <fct>       
 1 112139 No            No           No           No          
 2  43919 No            No           No           No          
 3  92430 No            Yes          Yes          Yes         
 4  87137 NA            NA           No           No          
 5  95417 No            No           No           No          
 6  66955 NA            NA           NA           NA          
 7  16293 NA            NA           No           No          
 8  61396 No            Yes          Yes          Yes         
 9  25379 No            Yes          No           Yes         
10  79229 No            No           No           No          
# ... with 50 more rows

其中 first_attack 列被恰当地定义为具有三个级别的 factor:Yes"No";未知".

where the first_attack column is fittingly defined as a factor with three levels: "Yes", "No", and "Unknown".

这篇关于仅使用 dplyr 在 3 列中应用条件时如何填充 Yes 和 No?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆