R - 如果向量中的任何字符串出现在几列中的任何列中,则返回布尔值 [英] R - return boolean if any strings in a vector appear in any of several columns

查看:14
本文介绍了R - 如果向量中的任何字符串出现在几列中的任何列中,则返回布尔值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大数据框,其中每一行都代表入院.每次入院时在第 5 至 24 列中最多附有 20 个诊断代码.

I have a large data frame, each row of which refers to an admission to hospital. Each admission is accompanied by up to 20 diagnosis codes in columns 5 to 24.

Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3 ... Diag_20
data   data   data   data   J123    F456    H789       E468
data   data   data   data   T452    NA      NA         NA

另外,我有一个长度为 136 的向量 (risk_codes),所有字符串.这些字符串是风险代码,可以类似于截断的诊断代码(例如 J12 可以,F4 可以,H798 不行).

Separately, I have a vector (risk_codes) of length 136, all strings. These strings are risk codes that can be similar to the truncated diagnosis codes (e.g. J12 would be ok, F4 would be ok, H798 would not).

如果任何风险代码与任何诊断代码相似,我希望向数据框中添加一列返回1.我不需要知道有多少,只要至少有一个.

I wish to add a column to the data frame that returns 1 if any of the risk codes are similar to any of the diagnosis codes. I don't need to know how many, just that at least one is.

到目前为止,我已经尝试了以下方法,但比其他尝试取得了最大的成功:

So far, I've tried the following with the most success over other attempts:

for (in in 1:length(risk_codes){
    df$newcol <- apply(df,1,function(x) sum(grepl(risk_codes[i], x[c(5:24)])))
}

它适用于单个字符串,并在列中填充 0 表示没有类似的代码,1 表示类似的代码,但是当检查第二个代码时,所有内容都会被覆盖,对 risk_codes 向量的 136 个元素以此类推.

It works well for a single string and populates the column with 0 for no similar codes and 1 for a similar code, but then everything is overwritten when the second code is checked, and so on over the 136 elements of the risk_codes vector.

有什么想法吗?对每一行的每一列中的每个 risk_code 运行循环是不可行的.

Any ideas, please? Running a loop over every risk_code in every column for every row would not be feasible.

解决方案看起来像这样

Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3 ... Diag_20   newcol
data   data   data   data   J123    F456    H789       E468      1
data   data   data   data   T452    NA      NA         NA        0

如果我的 risk_codes 包含 J12、F4、T543,例如.

if my risk_codes contained J12, F4, T543, for example.

推荐答案

我们希望一次应用带有所有 risk_codes 的 grepl.所以我们每行一次得到一个结果.我们可以通过 sapplyany 做到这一点.

We want to apply the grepl with all the risk_codes at once. So we get one result per row at once. We can do that with sapply and any.

所以,我们可以去掉 for 循环,你的代码变成这样:

So, we can drop the for loop and your code becomes like this:

my_df <- read.table(text="Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3  Diag_20
data   data   data   data   J123    F456    H789       E468
data   data   data   data   T452    NA      NA         NA", header=TRUE)

risk_codes <- c("F456", "XXX") # test codes

my_df$newcol <- apply(my_df,1,function(x) 
                                  any(sapply(risk_codes, 
                                              function(codes) grepl(codes,
                                                              x[c(5:24)]))))

结果是一个逻辑向量.

如果你仍然想使用 1 和 0 而不是 TRUE/FALSE,你只需要结束:

If you still want to use 1 and 0 instead of the TRUE/FALSE, you just need to finish with:

my_df$new_col <- ifelse(my_df$newcol, 1, 0)

结果将是:

> my_df
  Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20 newcol
1 data data data data   J123   F456   H789    E468      1
2 data data data data   T452   <NA>   <NA>    <NA>      0

这篇关于R - 如果向量中的任何字符串出现在几列中的任何列中,则返回布尔值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆