R中的负面后视与多词分离 [英] Negative lookbehind in R with multi-word separation

查看:117
本文介绍了R中的负面后视与多词分离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用R来进行一些字符串处理,并且想要识别具有某个词根的字符串,而这些字符串不是由某个词根的另一个词语开头的。

这是一个简单的玩具示例。假设我想识别字符串中任何字符串中没有dog / s的单词cat / s。

 测试= c(
狗猫,
狗和猫,
狗和猫,
狗和蓬松的猫,
猫和狗,
猫和狗,
蓬松的猫和蓬松的狗)

使用这个模式,我可以把 cat:

  pattern =(dog(s |)。*)(cat(s |))
grep(pattern,tests,perl = TRUE,value = TRUE)

[1]dog catdog and catsdog and catdog and fluffy cats

我的负面lookbehind存在问题:

  neg_pattern =(?<!dog(s |)。*)(cat(s |))
grep(neg_pattern,tests,perl = TRUE,value = TRUE)




grep中的错误(neg_pattern,tests,perl = TRUE,value = TRUE):
无效正则表达式

另外:警告消息:
在grep(neg_pattern,tests,perl = TRUE ,value = TRUE):
PCRE模式编译错误
'lookbehind断言不是固定长度'
at')(cat(s |))'

据我所知,*不是固定长度,所以我怎样才能拒绝在cat之前有任何其他单词分隔的dog的字符串?

解决方案

我希望这可以帮助您:

 <$ c 
狗和猫,
狗和猫,
狗和蓬松的猫,
狗猫 b猫和狗,
猫和狗,
蓬松的猫和蓬松的狗


#删除有狗后有猫的琴弦
tests = tests [-grep(pattern =dog(?:s |)。* cat(?:s |),x = tests)]

#只选择包含cats
tests = tests [grep(pattern =cat(?:s |),x = tests)]

tests
$ b [1]cats和狗猫和狗
[3]蓬松的猫和蓬松的狗

我不确定您是否想用一个表达式来完成此操作,但是当迭代应用时,
Regex仍然非常有用。 / p>

I'm using R to do some string processing, and would like to identify the strings that have a certain word root that are not preceded by another word of a certain word root.

Here is a simple toy example. Say I would like to identify the strings that have the word "cat/s" not preceded by "dog/s" anywhere in the string.

 tests = c(
   "dog cat",
   "dogs and cats",
   "dog and cat", 
   "dog and fluffy cats",
   "cats and dogs", 
   "cat and dog",  
   "fluffy cats and fluffy dogs")  

Using this pattern, I can pull the strings that do have dog before cat:

 pattern = "(dog(s|).*)(cat(s|))"
 grep(pattern, tests, perl = TRUE, value = TRUE)

[1] "dog cat"  "dogs and cats"   "dog and cat"   "dog and fluffy cats"

My negative lookbehind is having problems:

 neg_pattern = "(?<!dog(s|).*)(cat(s|))"
 grep(neg_pattern, tests, perl = TRUE, value = TRUE)

Error in grep(neg_pattern, tests, perl = TRUE, value = TRUE) : invalid regular expression

In addition: Warning message: In grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE pattern compilation error 'lookbehind assertion is not fixed length' at ')(cat(s|))'

I understand that .* is not fixed length, so how can I reject strings that have "dog" before "cat" separated by any number of other words?

解决方案

I hope that this can help:

tests = c(
  "dog cat",
  "dogs and cats",
  "dog and cat", 
  "dog and fluffy cats",
  "cats and dogs", 
  "cat and dog",  
  "fluffy cats and fluffy dogs"
)

# remove strings that have cats after dogs
tests = tests[-grep(pattern = "dog(?:s|).*cat(?:s|)", x = tests)]

# select only strings that contain cats
tests = tests[grep(pattern = "cat(?:s|)", x = tests)]

tests

[1] "cats and dogs"               "cat and dog"                
[3] "fluffy cats and fluffy dogs"

I'm not sure if you wanted to do this with one expression, but Regex can still be very useful when applied iteratively.

这篇关于R中的负面后视与多词分离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆