R中的负面后视与多词分离 [英] Negative lookbehind in R with multi-word separation
问题描述
我使用R来进行一些字符串处理,并且想要识别具有某个词根的字符串,而这些字符串不是由某个词根的另一个词语开头的。
这是一个简单的玩具示例。假设我想识别字符串中任何字符串中没有dog / s的单词cat / s。
测试= c(
狗猫,
狗和猫,
狗和猫,
狗和蓬松的猫,
猫和狗,
猫和狗,
蓬松的猫和蓬松的狗)
使用这个模式,我可以把 cat:
pattern =(dog(s |)。*)(cat(s |))
grep(pattern,tests,perl = TRUE,value = TRUE)
[1]dog catdog and catsdog and catdog and fluffy cats
我的负面lookbehind存在问题:
neg_pattern =(?<!dog(s |)。*)(cat(s |))
grep(neg_pattern,tests,perl = TRUE,value = TRUE)
grep中的错误(neg_pattern,tests,perl = TRUE,value = TRUE):
无效正则表达式
另外:警告消息:
在grep(neg_pattern,tests,perl = TRUE ,value = TRUE):
PCRE模式编译错误
'lookbehind断言不是固定长度'
at')(cat(s |))'
据我所知,*不是固定长度,所以我怎样才能拒绝在cat之前有任何其他单词分隔的dog的字符串?
我希望这可以帮助您:
<$ c
狗和猫,
狗和猫,
狗和蓬松的猫,
狗猫 b猫和狗,
猫和狗,
蓬松的猫和蓬松的狗
)
#删除有狗后有猫的琴弦
tests = tests [-grep(pattern =dog(?:s |)。* cat(?:s |),x = tests)]
#只选择包含cats
tests = tests [grep(pattern =cat(?:s |),x = tests)]
tests
$ b [1]cats和狗猫和狗
[3]蓬松的猫和蓬松的狗
我不确定您是否想用一个表达式来完成此操作,但是当迭代应用时,
Regex仍然非常有用。 / p>
I'm using R to do some string processing, and would like to identify the strings that have a certain word root that are not preceded by another word of a certain word root.
Here is a simple toy example. Say I would like to identify the strings that have the word "cat/s" not preceded by "dog/s" anywhere in the string.
tests = c(
"dog cat",
"dogs and cats",
"dog and cat",
"dog and fluffy cats",
"cats and dogs",
"cat and dog",
"fluffy cats and fluffy dogs")
Using this pattern, I can pull the strings that do have dog before cat:
pattern = "(dog(s|).*)(cat(s|))"
grep(pattern, tests, perl = TRUE, value = TRUE)
[1] "dog cat" "dogs and cats" "dog and cat" "dog and fluffy cats"
My negative lookbehind is having problems:
neg_pattern = "(?<!dog(s|).*)(cat(s|))"
grep(neg_pattern, tests, perl = TRUE, value = TRUE)
Error in grep(neg_pattern, tests, perl = TRUE, value = TRUE) : invalid regular expression
In addition: Warning message: In grep(neg_pattern, tests, perl = TRUE, value = TRUE) : PCRE pattern compilation error 'lookbehind assertion is not fixed length' at ')(cat(s|))'
I understand that .* is not fixed length, so how can I reject strings that have "dog" before "cat" separated by any number of other words?
I hope that this can help:
tests = c(
"dog cat",
"dogs and cats",
"dog and cat",
"dog and fluffy cats",
"cats and dogs",
"cat and dog",
"fluffy cats and fluffy dogs"
)
# remove strings that have cats after dogs
tests = tests[-grep(pattern = "dog(?:s|).*cat(?:s|)", x = tests)]
# select only strings that contain cats
tests = tests[grep(pattern = "cat(?:s|)", x = tests)]
tests
[1] "cats and dogs" "cat and dog"
[3] "fluffy cats and fluffy dogs"
I'm not sure if you wanted to do this with one expression, but Regex can still be very useful when applied iteratively.
这篇关于R中的负面后视与多词分离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!