部分字符串匹配两列 R [英] Partial string match two columns R

查看:53
本文介绍了部分字符串匹配两列 R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试根据两列通用的正则表达式列表来部分匹配两列内容:

I have been trying to partially match two column contents based on a list of regular expressions common to both columns:

dats<-data.frame(ID=c(1:3),species=c("dog","cat","rabbit"),
species.descriptor=c("all animal dog","all animal cat","rabbit exotic"),product=c(1,2,3),
product.authorise=c("all animal dog cat rabbit","cat horse pig","dog cat"))

为了实现这一目标:

goal<-data.frame(ID=c(1:3),species=c("dog","cat","rabbit"),
            species.descriptor=c("all animal dog","all animal cat","rabbit exotic"),
            product=c(1,2,3),product.authorise=c("all animal dog cat rabbit","cat horse pig",
            "dog cat"), authorised=c("TRUE","TRUE","FALSE"))    

所以为了进一步解释,如果dog"出现在两列中的任何一点,那么这在 $match 中将被视为TRUE" - 这将适用于任何单个物种描述符.如果没有找到匹配项,则返回 FALSE 或 na 都可以.

So to explain further, if 'dog' appears at any point in both columns, then this would be considered 'TRUE' in $match - and this would apply for any individual species descriptor.If no matches are found, then a return of either FALSE or an na would be fine.

到目前为止,我已经到了这一点:

So far I have gotten to this point:

library(stringr)
patts<-c("dog","cat","all animal")
reg.patts<-paste(patts,collapse="|")
dats$matched<-ifelse((str_extract(dats$species.descriptor,reg.patts) == str_extract(dats$product.authorise,reg.patts)),"TRUE","FALSE")
dats
  ID species species.descriptor product         product.authorise matched
   1     dog     all animal dog       1 all animal dog cat rabbit    TRUE
   2     cat     all animal cat       2             cat horse pig   FALSE
   3  rabbit      rabbit exotic       3                   dog cat    <NA>

如您所见,这正确地标识了第一行和最后一行,因为所有动物"首先出现在两个字符串中,而最后一行根本没有匹配项.但是,当 reg exp 没有首先出现在字符串中时,它似乎很困难(如第二行).我尝试过 str_extract_all,但到目前为止只导致错误消息.我想知道是否有人可以帮忙?

As you can see, this correctly identifies the first and last rows as 'all animal' appears first in both strings, and there is no match at all in the last. However, it seems to struggle (as in the second row) when the reg exp doesn't appear first in the string. I have tried str_extract_all, but have only resulted in error messages so far. I was wondering if anyone can help, please?

推荐答案

这是一个使用 dplyr 进行管道处理的解决方案.核心组件是使用 greplspecies.descriptorproduct.authorised 中的 species 进行逻辑字符串匹配.

Here is a solution using dplyr for piping. The core component is using grepl for logical string matching of species in both species.descriptor and product.authorised.

library(dplyr)
dats %>%
rowwise() %>%
mutate(authorised = 
           grepl(species, species.descriptor) & 
           grepl(species, product.authorise)
       )

Source: local data frame [3 x 6]
Groups: <by row>

     ID species species.descriptor product         product.authorise authorised
  (int)  (fctr)             (fctr)   (dbl)                    (fctr)      (lgl)
1     1     dog     all animal dog       1 all animal dog cat rabbit       TRUE
2     2     cat     all animal cat       2             cat horse pig       TRUE
3     3  rabbit      rabbit exotic       3                   dog cat      FALSE

如果你真的喜欢 stringr,你可以使用 str_detect 函数以获得更友好的语法.

If you really like stringr you can use the str_detect function for more user friendly syntax.

library(stringr)
dats %>%
mutate(authorised = 
           str_detect(species.descriptor, species) & 
           str_detect(product.authorise, species)
       )

如果你不喜欢dplyr,你可以直接添加列

And if you don't like dplyr you can add the column directly

dats$authorised <- 
    with(dats, 
         str_detect(species.descriptor, species) & 
             str_detect(product.authorise, species)
         )

这篇关于部分字符串匹配两列 R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆