在 r 中使用多个条件将控件与案例匹配 [英] Matching controls to cases using multiple conditions in r

查看:25
本文介绍了在 r 中使用多个条件将控件与案例匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用两个条件为每个 case 匹配 2 个 controls:

I want to match 2 controls for every case with two conditions:

age 差值应在±2之间;

① the age difference should between ±2;

收入差值应在±2之间.

如果一个案例有超过 2 个 controls,我只需要随机选择 2 个 controls.有一个例子:

If there are more than 2 controls for a case, I just need select 2 controls randomly. There is an example:

dat = structure(list(id = c(1, 2, 3, 4, 111, 222, 333, 444, 555, 666, 
                     777, 888, 999, 1000), 
              age = c(10, 20, 44, 11, 12, 11, 8, 12,  11, 22, 21, 18, 21, 18), 
              income = c(35, 72, 11, 35, 37, 36, 33,  70, 34, 74, 70, 44, 76, 70), 
              group = c("case", "case", "case", "case", "control", "control", 
                        "control", "control", "control", "control", "control", 
                        "control", "control", "control")), 
         row.names = c(NA, -14L), class = c("tbl_df", "tbl", "data.frame"))

> dat
# A tibble: 14 x 4
      id   age income group  
   <dbl> <dbl>  <dbl> <chr>  
 1     1    10     35 case   
 2     2    20     72 case   
 3     3    44     11 case   
 4     4    11     35 case   
 5   111    12     37 control
 6   222    11     36 control
 7   333     8     33 control
 8   444    12     70 control
 9   555    11     34 control
10   666    22     74 control
11   777    21     70 control
12   888    18     44 control
13   999    21     76 control
14  1000    18     70 control

期待结果

对于id = 1,匹配的控件如下,我只需要在下表中随机选择2个控件即可.

EXPECT OUTCOME

For id = 1, the matched controls as below, and I just need select 2 controls randomly in the table below.

|id|age|income|group|
|:----|:----|:----|:----|
|111|12|37|control|
|222|11|36|control|
|333|8|33|control|
|555|11|34|control|

对于id = 2,匹配的控件如下,我只需要在下表中随机选择2个控件即可.

For id = 2,the matched controls as below, and I just need select 2 controls randomly in the table below.

|id|age|income|group|
|:----|:----|:----|:----|
|666|22|74|control|
|777|21|70|control|
|1000|18|70|control|

对于id = 3dat中没有匹配的controls.

对于id = 4,匹配的控件如下,我只需要在下表中随机选择2个控件即可.

For id = 4, the matched controls as below, and I just need select 2 controls randomly in the table below.

这里需要注意的一点是,我们可以发现id = 1id = 4 的控件有重叠部分.我不希望两个 cases 共享一个 control,我需要的是如果 id = 1 选择 id = 111id = 222 作为 control,那么 id = 4 只能选择 id = 555 作为 control,如果id = 1 选择id = 111id = 333 作为控件,则id= 4 只能选择id = 222id = 555 作为控件.

One thing to note here is that we can find that the controls for id = 1 and id = 4 have overlapping parts. I don't want two cases to share a control, what I need is that if id = 1 chooses id = 111 and id = 222 as control, then id = 4 can only choose id = 555 as control, and if id = 1 chooses id = 111 and id = 333 as control, then id = 4 can only choose id = 222 and id = 555 as controls.

|id|age|income|group|
|:----|:----|:----|:----|
|111|12|37|control|
|222|11|36|control|
|555|11|34|control|

最终的输出可能是这样的(control组中的id是从满足条件的id中随机选取的):>

The final output maybe like this(the id in control group is randomly selected from the id that meets the conditions):

|id|age|income|group|
|:----|:----|:----|:----|
|1|10|35|case|
|2|20|72|case|
|3|44|11|case|
|4|11|35|case|
|111|12|37|control|
|222|11|36|control|
|333|8|33|control|
|555|11|34|control|
|777|21|70|control|
|1000|18|70|control|

注意

我查阅了一些网站,但它们不能满足我的需求.我不知道如何使用 R 代码实现我的要求.

NOTE

I've looked up some websites, but they don't meet my needs. I don't know how to implement my requirements using R code.

任何帮助将不胜感激!

1.https://stackoverflow.com/questions/56026700/is-there-any-package-for-case-control-matching-individual-1n-matching-in-r-n

1.https://stackoverflow.com/questions/56026700/is-there-any-package-for-case-control-matching-individual-1n-matching-in-r-n

2.R(或spss)中的病例对照匹配,基于年龄、性别和种族?

3.使用 ccoptimalmatch 在 R 中匹配 case-controls包

4.R 中的精确匹配

推荐答案

根据修改后的需求,我提出如下for循环

As per modified requirement, I propose the following for loop

library(dplyr, warn.conflicts = F)

dat %>%
  split(.$group) %>%
  list2env(envir = .GlobalEnv)
#> <environment: R_GlobalEnv>

control$FILTER <- FALSE
control
#> # A tibble: 10 x 5
#>       id   age income group   FILTER
#>    <dbl> <dbl>  <dbl> <chr>   <lgl> 
#>  1   111    12     37 control FALSE 
#>  2   222    11     36 control FALSE 
#>  3   333     8     33 control FALSE 
#>  4   444    12     70 control FALSE 
#>  5   555    11     34 control FALSE 
#>  6   666    22     74 control FALSE 
#>  7   777    21     70 control FALSE 
#>  8   888    18     44 control FALSE 
#>  9   999    21     76 control FALSE 
#> 10  1000    18     70 control FALSE

set.seed(123)

for(i in seq_len(nrow(case))){
  x <- which(between(control$age, case$age[i] -2, case$age[i] +2) & 
               between(control$income, case$income[i] -2, case$income[i] + 2) & 
               !control$FILTER)
  control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}

control
#> # A tibble: 10 x 5
#>       id   age income group   FILTER
#>    <dbl> <dbl>  <dbl> <chr>   <lgl> 
#>  1   111    12     37 control TRUE  
#>  2   222    11     36 control TRUE  
#>  3   333     8     33 control TRUE  
#>  4   444    12     70 control FALSE 
#>  5   555    11     34 control TRUE  
#>  6   666    22     74 control FALSE 
#>  7   777    21     70 control TRUE  
#>  8   888    18     44 control FALSE 
#>  9   999    21     76 control FALSE 
#> 10  1000    18     70 control TRUE

bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)
#> # A tibble: 10 x 4
#>       id   age income group  
#>    <dbl> <dbl>  <dbl> <chr>  
#>  1     1    10     35 case   
#>  2     2    20     72 case   
#>  3     3    44     11 case   
#>  4     4    11     35 case   
#>  5   111    12     37 control
#>  6   222    11     36 control
#>  7   333     8     33 control
#>  8   555    11     34 control
#>  9   777    21     70 control
#> 10  1000    18     70 control

检查不同种子的结果

set.seed(234)
for(i in seq_len(nrow(case))){
  x <- which(between(control$age, case$age[i] -2, case$age[i] +2) & 
               between(control$income, case$income[i] -2, case$income[i] + 2) & 
               !control$FILTER)
  control$FILTER[sample(x, min(2, length(x)))] <- TRUE
}
control

bind_rows(case, control) %>% filter(FILTER | is.na(FILTER)) %>% select(-FILTER)

# A tibble: 10 x 4
      id   age income group  
   <dbl> <dbl>  <dbl> <chr>  
 1     1    10     35 case   
 2     2    20     72 case   
 3     3    44     11 case   
 4     4    11     35 case   
 5   111    12     37 control
 6   222    11     36 control
 7   333     8     33 control
 8   555    11     34 control
 9   777    21     70 control
10  1000    18     70 control


dat 在进行 id 3 之前已修改


dat modified before proceeding for id 3

  • 使用baseR的`split
  • 将数据分成两组casecontrol
  • 使用 list2env
  • 将两个保存为单独的 dfs
  • 使用 purrr::map_df 您可以为每个案例抽取 2 行样本
    • 一次age
    • 一次用于收入
    • split the data into two groups case and control using baseR's `split
    • save two as separate dfs using list2env
    • using purrr::map_df you can take sample 2 rows for each case
      • once for age
      • and once for income
      library(tidyverse)
      
      dat = structure(list(id = c(1, 2, 3, 111, 222, 333, 444, 555, 666, 777, 888, 999, 1000), 
                           age = c(10, 20, 44, 12, 11, 8, 12, 11, 22, 21, 18, 21, 18), 
                           income = c(35, 72, 11, 37, 36, 33, 70, 34, 74, 70, 44, 76, 70), 
                           group = c("case", "case", "case", "control", "control", "control", 
                                     "control", "control", "control", "control", "control", 
                                     "control", "control")),
                      row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"))
      
      dat
      #> # A tibble: 13 x 4
      #>       id   age income group  
      #>    <dbl> <dbl>  <dbl> <chr>  
      #>  1     1    10     35 case   
      #>  2     2    20     72 case   
      #>  3     3    44     11 case   
      #>  4   111    12     37 control
      #>  5   222    11     36 control
      #>  6   333     8     33 control
      #>  7   444    12     70 control
      #>  8   555    11     34 control
      #>  9   666    22     74 control
      #> 10   777    21     70 control
      #> 11   888    18     44 control
      #> 12   999    21     76 control
      #> 13  1000    18     70 control
      
      dat %>%
        split(.$group) %>%
        list2env(envir = .GlobalEnv)
      #> <environment: R_GlobalEnv>
      
      set.seed(123)
      bind_rows(case, map_dfr(case$age, ~ control %>% filter(between(age, .x -2, .x +2) ) %>%
             sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
             map_dfr(case$income, ~ control %>% filter(between(income, .x -2, .x +2)) %>%
                       sample_n(min(n(),2))) %>% sample_n(min(n(),2)))
      #> # A tibble: 7 x 4
      #>      id   age income group  
      #>   <dbl> <dbl>  <dbl> <chr>  
      #> 1     1    10     35 case   
      #> 2     2    20     72 case   
      #> 3     3    44     11 case   
      #> 4   222    11     36 control
      #> 5   777    21     70 control
      #> 6   111    12     37 control
      #> 7   333     8     33 control
      


      下面的代码也会做同样的事情而不保存单个 dfs


      the below code will also do the same without saving individual dfs

      dat %>%
        split(.$group) %>%
        {bind_rows(.$case, 
                   map_dfr(.$case$age, \(.x) .$control %>% filter(between(age, .x -2, .x +2) ) %>%
                             sample_n(min(n(),2))) %>% sample_n(min(n(),2)),
                   map_dfr(.$case$income, \(.x) .$control %>% filter(between(income, .x -2, .x +2)) %>%
                             sample_n(min(n(),2))) %>% sample_n(min(n(),2)))}
      

      这篇关于在 r 中使用多个条件将控件与案例匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆