R regex编译器对于给定regex的工作方式有所不同 [英] R regex compiler working differently for the given regex

查看:41
本文介绍了R regex编译器对于给定regex的工作方式有所不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在完善 答案;并发现下面给出的正则表达式在 R 中不能正常工作(按其含义).

I was working on the refinement of this answer; and figured out that the regex given below is not working properly(as per its meaning) in R.

 +?on.*$

根据我对正则表达式的理解,上述正则表达式符合以下条件:

According to my understanding of regex, the above regex matches:

懒惰地间隔一遍或多遍,然后按 on ,然后再加上其他内容(换行符除外),直到最后.

lazily space one or more times followed by on followed by anything(except newline) till the end.

输入:

Posted by ondrej on 29 Feb 2020.
Posted by ona'je on 29 Feb 2020.

输出(根据我的说法,如果测试字符串中的上述正则表达式模式被替换为")

Posted by
Posted by 

当我尝试在 python(在此处实现)中进行测试时 javascript java(此处实现) ;我得到了预期的结果.

And when I'm trying to test it in python (implementation here), javascript and java (implementation here); I'm getting the result as I expected.

const myString = "Posted by ondrej on 29 Feb 2020.\nPosted by ona'je on";

console.log(myString.replace( new RegExp(" +?on.*$","gm"),""));

另一方面,如果我尝试在 R中实现相同的正则表达式(在此处实现) ;我得到的结果是

On the other hand, if I'm trying to implement the same regex in R (implementation here); I'm getting the result as

Posted by ondrej
Posted by ona'je

,这是意外的.

怀疑

我认为 R 的正则表达式解析器的工作方式可能有所不同(也许从右到左).我阅读了有关正则表达式如何在 R 中工作的文档,但发现与上述正则表达式的其他语言没有什么不同.我可能在这里错过了一些东西.我对 R 并不了解,但是就我的正则表达式知识而言;我相信上面的正则表达式应该可以在 java javascript python (可以在 pcre 中使用)中工作(据我所知).我的问题是为什么上述正则表达式在 R 中的工作方式不同?

I thought that maybe regex parser for R works differently(perhaps from right to left). I read the documentation of how regex work in R but found nothing different from other languages for the above regex. I may be missing something here. I am not well-versed with R but as far as my regex knowledge; I believe that the above regex should work as it works in java, javascript and python(may be in pcre too.) for every standard regex engines(as far as I know). My question is why the above regex is working differently in R?

推荐答案

它看起来像 TRE正则表达式引擎(默认情况下,在基本R regex函数中使用)(基于Henry Spencer最初于1986年编写的regex库),如果正则表达式中的第一个模式以懒惰的量词开头并以结尾,则匹配字符串末尾的最短匹配项. $ 锚点.

It looks like TRE regex engine (used by default in base R regex functions), based on the regex library initially written by Henry Spencer in 1986, matches the shortest match at the end of the string if the first pattern in the regular expression starts with a lazy quantifier and ends with $ anchor.

比较这些案例:

sub(" +?on.*$", "", Data)  # "Posted by ondrej" "Posted by ona'je"
sub(" +?on.*", "", Data)   # "Posted bydrej on 29 Feb 2020." "Posted bya'je on 29feb 2020"
sub(" +?on(.*)", "", Data) # as expected
sub(" +on.*", "", Data)    # as expected

这是怎么回事?

  • 第一种情况是 sub("+?on.* $",",Data),第一种模式将所有量词的贪婪程度设置为正则表达式.因此,第二个量词 * 即使没有?,也将被设置为lazy ,因为用 +对第一个空格进行了量化?,一个懒惰的量词.这是一个已知的TRE错误",也存在于其他基于Henry Spencer的regexl库的regex引擎中.

  • The first case is sub(" +?on.*$", "", Data) and the first pattern sets the greediness of all the quantifiers on the same level in the regex. So, the second quantifier, *, will be set to lazy even without ? after it as the first space was quantified with +?, a lazy quantifier. It is a known TRE "bug", also present in some other regex engines based on Henry Spencer's regexl library.

第二个 sub("+?on.*",",数据)匹配方式与写入"+?on.*?"的方式相同.(同样,由于第一个模式将贪婪级别设置为在该级别上是懒惰的),并且仅匹配 1个或多个空格,然后在模式末尾的 on .*?不匹配.

The second sub(" +?on.*", "", Data) matches the same way as if it were written " +?on.*?" (again, due to the first pattern setting the greediness level to lazy on that level) and that would only match 1 or more spaces and then on, .*? matches nothing when at the end of the pattern.

第三个, sub("+?on(.*)",",Data),产生了预期的结果,因为第二个量化模式.*在另一个级别(一个级别)上,并且它的贪婪不受另一个级别上的 +?的影响.因此,(.*)在这里贪婪地匹配.

The third one, sub(" +?on(.*)", "", Data), yields the expected results because the second quantified pattern, .*, is on the other level (one level deep) and its greediness is not affected by the +? that is on another level. So, (.*) matches greedily here.

第四个 sub("+ on.*",",Data)会产生预期的结果,因为第一个模式是贪婪的,因此下一个量化模式的贪婪是也很贪心.

The fourth one, sub(" +on.*", "", Data), yields the expected results because the first pattern is greedy, so the next quantified pattern greediness is also greedy.

这篇关于R regex编译器对于给定regex的工作方式有所不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆