如果向量中的任何字符串出现在几列中的任一列中,则R - 返回布尔值 [英] R - return boolean if any strings in a vector appear in any of several columns
问题描述
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20
数据数据数据数据J123 F456 H789 E468
数据数据数据数据T452 NA NA NA
另外,我有一个长度为136的所有字符串的向量( risk_codes )。这些字符串是可以类似于截断的诊断代码的风险代码(例如,J12可以,F4会OK,H798不会)。
我想添加如果风险代码的任何类似于任何诊断代码,则返回1的数据帧的列。我不需要知道有多少,至少有一个是。
到目前为止,我尝试了以下成功,超过其他尝试: / p>
for(in 1:length(risk_codes){
df $ newcol< - apply(df,函数(x)sum(grepl(risk_codes [i],x [c(5:24)])))
}
它适用于单个字符串,并将列填充为0,没有类似的代码,1用于类似的代码,但是当检查第二个代码时,所有内容都将被覆盖,等等
任何想法,请为每一行的每列中的每个risk_code运行循环是不可行的。
解决方案看起来像这样
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20 newcol
数据数据数据数据J123 F456 H789 E468 1
数据数据数据数据T452 NA NA NA 0
如果我的risk_codes包含J12,F4,T543,
我们要同时将grepl应用于所有的risk_codes。所以我们一次得到一行结果。我们可以用 sapply
和任何
。
所以,我们可以放弃for循环,你的代码就像这样:
my_df< - read.table(text = Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20
数据数据数据数据J123 F456 H789 E468
数据数据数据数据T452 NA NA NA,标题= TRUE)
risk_codes < c(F456,XXX)#测试代码
my_df $ newcol< - apply(my_df,1,function(x)
any(sapply(risk_codes,
函数(代码)grepl(代码,
x [c(5:24)])))))
结果是一个逻辑向量。
如果仍然要使用1和0而不是TRUE / FALSE,那么只需要完成: / p>
my_df $ new_col< - ifelse(my_df $ newcol,1,0)
结果将是:
> my_df
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20 newcol
1数据数据数据数据J123 F456 H789 E468 1
2数据数据数据数据T452< NA> < NA> < NA> 0
I have a large data frame, each row of which refers to an admission to hospital. Each admission is accompanied by up to 20 diagnosis codes in columns 5 to 24.
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20
data data data data J123 F456 H789 E468
data data data data T452 NA NA NA
Separately, I have a vector (risk_codes) of length 136, all strings. These strings are risk codes that can be similar to the truncated diagnosis codes (e.g. J12 would be ok, F4 would be ok, H798 would not).
I wish to add a column to the data frame that returns 1 if any of the risk codes are similar to any of the diagnosis codes. I don't need to know how many, just that at least one is.
So far, I've tried the following with the most success over other attempts:
for (in in 1:length(risk_codes){
df$newcol <- apply(df,1,function(x) sum(grepl(risk_codes[i], x[c(5:24)])))
}
It works well for a single string and populates the column with 0 for no similar codes and 1 for a similar code, but then everything is overwritten when the second code is checked, and so on over the 136 elements of the risk_codes vector.
Any ideas, please? Running a loop over every risk_code in every column for every row would not be feasible.
The solution would look like this
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 ... Diag_20 newcol
data data data data J123 F456 H789 E468 1
data data data data T452 NA NA NA 0
if my risk_codes contained J12, F4, T543, for example.
We want to apply the grepl with all the risk_codes at once. So we get one result per row at once. We can do that with sapply
and any
.
So, we can drop the for loop and your code becomes like this:
my_df <- read.table(text="Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20
data data data data J123 F456 H789 E468
data data data data T452 NA NA NA", header=TRUE)
risk_codes <- c("F456", "XXX") # test codes
my_df$newcol <- apply(my_df,1,function(x)
any(sapply(risk_codes,
function(codes) grepl(codes,
x[c(5:24)]))))
The result is a logical vector.
If you still want to use 1 and 0 instead of the TRUE/FALSE, you just need to finish with:
my_df$new_col <- ifelse(my_df$newcol, 1, 0)
The result will be:
> my_df
Col1 Col2 Col3 Col4 Diag_1 Diag_2 Diag_3 Diag_20 newcol
1 data data data data J123 F456 H789 E468 1
2 data data data data T452 <NA> <NA> <NA> 0
这篇关于如果向量中的任何字符串出现在几列中的任一列中,则R - 返回布尔值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!