R,如何根据多个条件在列表列中累积值 [英] R, How to accumulate values in a list column, based on multiple criteria

查看:39
本文介绍了R,如何根据多个条件在列表列中累积值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个患者在不同医院接受治疗的数据集(仅限住院患者),其中一些分析揭示了一些不一致之处.其中之一是 - 软件允许患者在不关闭之前打开的 case_id 的情况下入院.

I have a dataset of patients getting treatments in various hospitals (in-patient only) wherein some analysis has revealed several inconsistencies. One of these was that - software was allowing patients to get admission without closure of their previously open case_id.

为了更好地理解它,让我们考虑一下样本数据集

In order to understand it better, let us consider the sample dataset

dput(df)

df <- structure(list(case_id = 1:22, patient_id = c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 7L, 
8L, 8L), pack_id = c(12L, 62L, 59L, 68L, 77L, 86L, 20L, 55L, 
86L, 72L, 7L, 54L, 75L, 26L, 21L, 12L, 49L, 35L, 51L, 31L, 10L, 
54L), hosp_id = c(1L, 1L, 2L, 2L, 1L, 1L, 2L, 3L, 3L, 4L, 2L, 
3L, 3L, 3L, 4L, 5L, 6L, 6L, 7L, 7L, 8L, 8L), admn_date = structure(c(18262, 
18264, 18265, 18266, 18277, 18279, 18283, 18262, 18264, 18277, 
18287, 18275, 18301, 18291, 18366, 18374, 18309, 18319, 18364, 
18303, 18328, 18341), class = "Date"), discharge_date = structure(c(18275, 
18276, 18271, 18275, 18288, 18280, 18286, 18275, 18276, 18288, 
18291, 18283, 18309, 18297, 18375, 18381, 18347, 18328, 18367, 
18309, 18341, 18344), class = "Date")), row.names = c(NA, -22L
), class = "data.frame")

> df
   case_id patient_id pack_id hosp_id  admn_date discharge_date
1       1          1      12       1 2020-01-01     2020-01-14
2       2          1      62       1 2020-01-03     2020-01-15
3       3          1      59       2 2020-01-04     2020-01-10
4       4          1      68       2 2020-01-05     2020-01-14
5       5          1      77       1 2020-01-16     2020-01-27
6       6          1      86       1 2020-01-18     2020-01-19
7       7          1      20       2 2020-01-22     2020-01-25
8       8          2      55       3 2020-01-01     2020-01-14
9       9          2      86       3 2020-01-03     2020-01-15
10     10          2      72       4 2020-01-16     2020-01-27
11     11          1       7       2 2020-01-26     2020-01-30
12     12          3      54       3 2020-01-14     2020-01-22
13     13          3      75       3 2020-02-09     2020-02-17
14     14          3      26       3 2020-01-30     2020-02-05
15     15          4      21       4 2020-04-14     2020-04-23
16     16          4      12       5 2020-04-22     2020-04-29
17     17          5      49       6 2020-02-17     2020-03-26
18     18          5      35       6 2020-02-27     2020-03-07
19     19          6      51       7 2020-04-12     2020-04-15
20     20          7      31       7 2020-02-11     2020-02-17
21     21          8      10       8 2020-03-07     2020-03-20
22     22          8      54       8 2020-03-20     2020-03-23

如果我们在上面的数据中看到,id 1 的患者于 1 月 1 日入院_1(第 1 行),并于 1 月 14 日出院.本次出院前,患者再次入院(第2行);并在hospital_2 再次两次(第3 行和第4 行),最后在1 月15 日(第2 行)关闭了所有这四个记录.

If we see in the data above, patient with id 1 got admission in hospital_1 (row-1) on 1 January and took a discharge on 14 January. Before this discharge, the patient took admission in same hospital again (row-2) ; and in hospital_2 again two times (rows 3 & 4) before finally getting all these four records closed on 15 January (row-2).

我已经过滤了患者多次在多家医院/同一家医院住院的记录;通过以下代码

I have already filtered such records where the patient/s were admitted in multiple hospitals/same hospital multiple times; by the following code

df_2 <- df %>% arrange(patient_id, admn_date, discharge_date) %>%
  mutate(sort_key = row_number()) %>%
  pivot_longer(c(admn_date, discharge_date), names_to ="activity", 
               values_to ="date", names_pattern = "(.*)_date") %>%
  mutate(activity = factor(activity, ordered = T, 
                           levels = c("admn", "discharge")),
         admitted = ifelse(activity == "admn", 1, -1)) %>%
  group_by(patient_id) %>%
  arrange(date, sort_key, activity, .by_group = TRUE) %>% 
  mutate (admitted = cumsum(admitted)) %>%
  ungroup()
  
 > df_2
# A tibble: 44 x 8
   case_id patient_id pack_id hosp_id sort_key activity  date       admitted
    <int>      <int>   <int>   <int>    <int> <ord>     <date>        <dbl>
 1      1          1      12       1        1 admn      2020-01-01        1
 2      2          1      62       1        2 admn      2020-01-03        2
 3      3          1      59       2        3 admn      2020-01-04        3
 4      4          1      68       2        4 admn      2020-01-05        4
 5      3          1      59       2        3 discharge 2020-01-10        3
 6      1          1      12       1        1 discharge 2020-01-14        2
 7      4          1      68       2        4 discharge 2020-01-14        1
 8      2          1      62       1        2 discharge 2020-01-15        0
 9      5          1      77       1        5 admn      2020-01-16        1
10      6          1      86       1        6 admn      2020-01-18        2
# ... with 34 more rows

使用此代码 df_2 %>% filter(admitted >1 & activity == "admn") 我可以一次性过滤掉不一致的记录.

With this code df_2 %>% filter(admitted >1 & activity == "admn") I can filter out the inconsistent records at once.

但是,我想包含/生成一个 list 列 - 无论何时打开新记录/case_id 而不关闭任何以前的记录/case_id,只要 activity == 'admn' 并且当 activity == 'discharge' 时,hospital_id 从现有条目中删除.所以基本上我想要的 df_2 输出是这样的:

However, I want to include/generate one list column where-ever a new record/case_id has been opened without closing of any previous one, where the hsopital_ids get accumulated whenever activity == 'admn' and hospital_id is removed from existing entries whenever activity == 'discharge'. So basically my desired output for df_2 be something like:

# A tibble: 44 x 8
   case_id patient_id pack_id hosp_id sort_key activity  date       admitted    open_records
    <int>      <int>   <int>   <int>    <int> <ord>     <date>        <dbl>     <list>
 1      1          1      12       1        1 admn      2020-01-01        1     1
 2      2          1      62       1        2 admn      2020-01-03        2     1, 1
 3      3          1      59       2        3 admn      2020-01-04        3     1, 1, 2
 4      4          1      68       2        4 admn      2020-01-05        4     1, 1, 2, 2
 5      3          1      59       2        3 discharge 2020-01-10        3     1, 1, 2
 6      1          1      12       1        1 discharge 2020-01-14        2     1, 2
 7      4          1      68       2        4 discharge 2020-01-14        1     1,
 8      2          1      62       1        2 discharge 2020-01-15        0     <NULL>
 9      5          1      77       1        5 admn      2020-01-16        1     1
10      6          1      86       1        6 admn      2020-01-18        2     1, 1
# ... with 34 more rows

注意 我知道列表列不会像我为解释目的而显示的那样显示在 tibble/data.frame 中.但是,如果有任何可以打印的方法,我肯定想知道.

NOTE I am aware that list column won't be displayed in the tibble/data.frame like the one I have shown for explanation purpose only. However, if there is any method by which that can be printed I would like to know about that for sure.

MOREOVER 如果有更好的策略将医院 ID 存储在列中而不是生成列表列,我也想知道.

MOREOVER If there is any better strategy to store the hospital ids in the column instead of generating list column, I would also like to know about that for sure.

推荐答案

这里有一个不错的 tidyverse 解决方案:

Here is a decent tidyverse solution for this:

library(dplyr)
library(purrr)

df_2 %>%
  group_by(patient_id) %>%
  mutate(open_records = accumulate(2:n(), .init = paste0(hosp_id[1], ","), 
                                   ~ if(activity[.y] == "admn") {
                                     paste0(.x, hosp_id[.y], ",")
                                   } else {
                                     sub(paste0(hosp_id[.y], ","), "", .x)
                                   }),
         open_records = gsub("([d,]*)\\,$", "", open_records))

# A tibble: 44 x 9
# Groups:   patient_id [8]
   case_id patient_id pack_id hosp_id sort_key activity  date       admitted open_records
     <int>      <int>   <int>   <int>    <int> <ord>     <date>        <dbl> <chr>       
 1       1          1      12       1        1 admn      2020-01-01        1 "1"         
 2       2          1      62       1        2 admn      2020-01-03        2 "1,1"       
 3       3          1      59       2        3 admn      2020-01-04        3 "1,1,2"     
 4       4          1      68       2        4 admn      2020-01-05        4 "1,1,2,2"   
 5       3          1      59       2        3 discharge 2020-01-10        3 "1,1,2"     
 6       1          1      12       1        1 discharge 2020-01-14        2 "1,2"       
 7       4          1      68       2        4 discharge 2020-01-14        1 "1"         
 8       2          1      62       1        2 discharge 2020-01-15        0 ""          
 9       5          1      77       1        5 admn      2020-01-16        1 "1"         
10       6          1      86       1        6 admn      2020-01-18        2 "1,1"       
# ... with 34 more rows

这篇关于R,如何根据多个条件在列表列中累积值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆