部分字符串匹配两列 R [英] Partial string match two columns R
问题描述
我一直在尝试根据两列通用的正则表达式列表来部分匹配两列内容:
I have been trying to partially match two column contents based on a list of regular expressions common to both columns:
dats<-data.frame(ID=c(1:3),species=c("dog","cat","rabbit"),
species.descriptor=c("all animal dog","all animal cat","rabbit exotic"),product=c(1,2,3),
product.authorise=c("all animal dog cat rabbit","cat horse pig","dog cat"))
为了实现这一目标:
goal<-data.frame(ID=c(1:3),species=c("dog","cat","rabbit"),
species.descriptor=c("all animal dog","all animal cat","rabbit exotic"),
product=c(1,2,3),product.authorise=c("all animal dog cat rabbit","cat horse pig",
"dog cat"), authorised=c("TRUE","TRUE","FALSE"))
所以为了进一步解释,如果dog"出现在两列中的任何一点,那么这在 $match 中将被视为TRUE" - 这将适用于任何单个物种描述符.如果没有找到匹配项,则返回 FALSE 或 na 都可以.
So to explain further, if 'dog' appears at any point in both columns, then this would be considered 'TRUE' in $match - and this would apply for any individual species descriptor.If no matches are found, then a return of either FALSE or an na would be fine.
到目前为止,我已经到了这一点:
So far I have gotten to this point:
library(stringr)
patts<-c("dog","cat","all animal")
reg.patts<-paste(patts,collapse="|")
dats$matched<-ifelse((str_extract(dats$species.descriptor,reg.patts) == str_extract(dats$product.authorise,reg.patts)),"TRUE","FALSE")
dats
ID species species.descriptor product product.authorise matched
1 dog all animal dog 1 all animal dog cat rabbit TRUE
2 cat all animal cat 2 cat horse pig FALSE
3 rabbit rabbit exotic 3 dog cat <NA>
如您所见,这正确地标识了第一行和最后一行,因为所有动物"首先出现在两个字符串中,而最后一行根本没有匹配项.但是,当 reg exp 没有首先出现在字符串中时,它似乎很困难(如第二行).我尝试过 str_extract_all,但到目前为止只导致错误消息.我想知道是否有人可以帮忙?
As you can see, this correctly identifies the first and last rows as 'all animal' appears first in both strings, and there is no match at all in the last. However, it seems to struggle (as in the second row) when the reg exp doesn't appear first in the string. I have tried str_extract_all, but have only resulted in error messages so far. I was wondering if anyone can help, please?
推荐答案
这是一个使用 dplyr
进行管道处理的解决方案.核心组件是使用 grepl
对 species.descriptor
和 product.authorised
中的 species
进行逻辑字符串匹配.
Here is a solution using dplyr
for piping. The core component is using grepl
for logical string matching of species
in both species.descriptor
and product.authorised
.
library(dplyr)
dats %>%
rowwise() %>%
mutate(authorised =
grepl(species, species.descriptor) &
grepl(species, product.authorise)
)
Source: local data frame [3 x 6]
Groups: <by row>
ID species species.descriptor product product.authorise authorised
(int) (fctr) (fctr) (dbl) (fctr) (lgl)
1 1 dog all animal dog 1 all animal dog cat rabbit TRUE
2 2 cat all animal cat 2 cat horse pig TRUE
3 3 rabbit rabbit exotic 3 dog cat FALSE
如果你真的喜欢 stringr
,你可以使用 str_detect
函数以获得更友好的语法.
If you really like stringr
you can use the str_detect
function for more user friendly syntax.
library(stringr)
dats %>%
mutate(authorised =
str_detect(species.descriptor, species) &
str_detect(product.authorise, species)
)
如果你不喜欢dplyr
,你可以直接添加列
And if you don't like dplyr
you can add the column directly
dats$authorised <-
with(dats,
str_detect(species.descriptor, species) &
str_detect(product.authorise, species)
)
这篇关于部分字符串匹配两列 R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!