如果向量中的任何字符串出现在几列中的任一列中,则R - 返回布尔值 [英] R - return boolean if any strings in a vector appear in any of several columns

查看:147
本文介绍了如果向量中的任何字符串出现在几列中的任一列中,则R - 返回布尔值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的数据框架,每一行都是指进入医院。每个录取都附有第5至24列中最多20个诊断代码。

  Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20 
数据数据数据数据J123 F456 H789 E468
数据数据数据数据T452 NA NA NA

另外,我有一个长度为136的所有字符串的向量( risk_codes )。这些字符串是可以类似于截断的诊断代码的风险代码(例如,J12可以,F4会OK,H798不会)。



我想添加如果风险代码的任何类似于任何诊断代码,则返回1的数据帧的列。我不需要知道有多少,至少有一个是。



到目前为止,我尝试了以下成功,超过其他尝试: / p>

  for(in 1:length(risk_codes){
df $ newcol< - apply(df,函数(x)sum(grepl(risk_codes [i],x [c(5:24)])))
}

它适用于单个字符串,并将列填充为0,没有类似的代码,1用于类似的代码,但是当检查第二个代码时,所有内容都将被覆盖,等等



任何想法,请为每一行的每列中的每个risk_code运行循环是不可行的。



解决方案看起来像这样

  Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20 newcol 
数据数据数据数据J123 F456 H789 E468 1
数据数据数据数据T452 NA NA NA 0

如果我的risk_codes包含J12,F4,T543,

解决方案

我们要同时将grepl应用于所有的risk_codes。所以我们一次得到一行结果。我们可以用 sapply 任何



所以,我们可以放弃for循环,你的代码就像这样:

  my_df<  -  read.table(text = Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20 
数据数据数据数据J123 F456 H789 E468
数据数据数据数据T452 NA NA NA,标题= TRUE)

risk_codes < c(F456,XXX)#测试代码

my_df $ newcol< - apply(my_df,1,function(x)
any(sapply(risk_codes,
函数(代码)grepl(代码,
x [c(5:24)])))))

结果是一个逻辑向量。



如果仍然要使用1和0而不是TRUE / FALSE,那么只需要完成: / p>

  my_df $ new_col<  -  ifelse(my_df $ newcol,1,0)

结果将是:

 > my_df 
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20 newcol
1数据数据数据数据J123 F456 H789 E468 1
2数据数据数据数据T452< NA> < NA> < NA> 0


I have a large data frame, each row of which refers to an admission to hospital. Each admission is accompanied by up to 20 diagnosis codes in columns 5 to 24.

Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3 ... Diag_20
data   data   data   data   J123    F456    H789       E468
data   data   data   data   T452    NA      NA         NA

Separately, I have a vector (risk_codes) of length 136, all strings. These strings are risk codes that can be similar to the truncated diagnosis codes (e.g. J12 would be ok, F4 would be ok, H798 would not).

I wish to add a column to the data frame that returns 1 if any of the risk codes are similar to any of the diagnosis codes. I don't need to know how many, just that at least one is.

So far, I've tried the following with the most success over other attempts:

for (in in 1:length(risk_codes){
    df$newcol <- apply(df,1,function(x) sum(grepl(risk_codes[i], x[c(5:24)])))
}

It works well for a single string and populates the column with 0 for no similar codes and 1 for a similar code, but then everything is overwritten when the second code is checked, and so on over the 136 elements of the risk_codes vector.

Any ideas, please? Running a loop over every risk_code in every column for every row would not be feasible.

The solution would look like this

Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3 ... Diag_20   newcol
data   data   data   data   J123    F456    H789       E468      1
data   data   data   data   T452    NA      NA         NA        0

if my risk_codes contained J12, F4, T543, for example.

解决方案

We want to apply the grepl with all the risk_codes at once. So we get one result per row at once. We can do that with sapply and any.

So, we can drop the for loop and your code becomes like this:

my_df <- read.table(text="Col1   Col2   Col3   Col4   Diag_1  Diag_2  Diag_3  Diag_20
data   data   data   data   J123    F456    H789       E468
data   data   data   data   T452    NA      NA         NA", header=TRUE)

risk_codes <- c("F456", "XXX") # test codes

my_df$newcol <- apply(my_df,1,function(x) 
                                  any(sapply(risk_codes, 
                                              function(codes) grepl(codes,
                                                              x[c(5:24)]))))

The result is a logical vector.

If you still want to use 1 and 0 instead of the TRUE/FALSE, you just need to finish with:

my_df$new_col <- ifelse(my_df$newcol, 1, 0)

The result will be:

> my_df
  Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20 newcol
1 data data data data   J123   F456   H789    E468      1
2 data data data data   T452   <NA>   <NA>    <NA>      0

这篇关于如果向量中的任何字符串出现在几列中的任一列中,则R - 返回布尔值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆