Java正则表达式交替运算符“|”行为似乎破碎了 [英] Java regex alternation operator "|" behavior seems broken

查看:149
本文介绍了Java正则表达式交替运算符“|”行为似乎破碎了的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试为罗马数字编写正则表达式匹配器。在sed中(我认为它被认为是正则表达式的标准),如果你有多个由交替运算符分隔的选项,它将匹配最长的。即,I | II | III | IV将匹配IV代表IV和III代表III

Trying to write a regex matcher for roman numerals. In sed (which I think is considered 'standard' for regex?), if you have multiple options delimited by the alternation operator, it will match the longest. Namely, "I|II|III|IV" will match "IV" for "IV" and "III" for "III"

在Java中,相同的模式匹配I表示IV,I表示III。从左到右,交替匹配之间的Java选择;也就是说,因为I出现在正则表达式中的III之前,所以它匹配。如果我将正则表达式更改为IV | III | II | I,行为将得到纠正,但这显然不是一般的解决方案。

In Java, the same pattern matches "I" for "IV" and "I" for "III". Turns out Java chooses between alternation matches left-to-right; that is, because "I" appears before "III" in the regex, it matches. If I change the regex to "IV|III|II|I", the behavior is corrected, but this obviously isn't a solution in general.

有没有办法让Java选择交替组中最长的匹配,而不是选择第一个?

Is there a way to make Java choose the longest match out of an alternation group, instead of choosing the 'first'?

为清晰起见,代码示例:

A code sample for clarity:

public static void main(String[] args)
{
    Pattern p = Pattern.compile("six|sixty");
    Matcher m = p.matcher("The year was nineteen sixty five.");
    if (m.find())
    {
        System.out.println(m.group());
    }
    else
    {
        System.out.println("wtf?");
    }
}

输出six

推荐答案

不,它的行为正确。 Java使用NFA或正则表达式的风格,如Perl,.NET,JavaScript等,而 sed,grep或awk不同。一旦替代方案匹配,预计交替将退出,而不是最长匹配。

No, it's behaving correctly. Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.

您可以通过在之后添加条件来强制它继续,在整个令牌消耗之前无法满足。这种情况可能取决于具体情况;最简单的选项是锚点( $ )或字边界( \b )。

You can force it to continue by adding a condition after the alternation that can't be met until the whole token has been consumed. What that condition might be depends on the context; the simplest option would be an anchor ($) or a word boundary (\b).

"\\b(I|II|III|IV)\\b"

编辑:我应该提一下,虽然grep,sed,awk和其他传统上使用文本导向(或DFA) )引擎,你也可以找到一些使用NFA引擎的版本,甚至是两者的混合版本。

I should mention that, while grep, sed, awk and others traditionally use text-directed (or DFA) engines, you can also find versions of some of them that use NFA engines, or even hybrids of the two.

这篇关于Java正则表达式交替运算符“|”行为似乎破碎了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆