Python正则表达式模块使用BRE还是ERE? [英] Does the Python regular expression module use BRE or ERE?

查看:253
本文介绍了Python正则表达式模块使用BRE还是ERE?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

POSIX将正则表达式实现分为两种:基本正则表达式(BRE)和扩展正则表达式(ERE).

It appears that POSIX splits regular expression implementations into two kinds: Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE).

Python re模块引用似乎未指定.

Python re module reference does not seem to specify.

推荐答案

除了语法上有些相似外,re模块不遵循POSIX标准的正则表达式.

Except for some similarity in the syntax, re module doesn't follow POSIX standard for regular expressions.

POSIX正则表达式(可以使用DFA/NFA甚至是回溯引擎来实现)总是找到最长时间最长的匹配,而re模块是一个回溯引擎,可以找到 >最左侧的最早的"匹配项(根据正则表达式定义的搜索顺序为最早的").

POSIX regular expression (which can be implemented with a DFA/NFA or even a backtracking engine) always finds the leftmost longest match, while re module is a backtracking engine which finds the leftmost "earliest" match ("earliest" according to the search order defined by the regular expression).

在将(Prefix|PrefixSuffix)PrefixSuffix进行匹配的情况下,可以看到匹配语义上的差异.

The difference in the matching semantics can be observed in the case of matching (Prefix|PrefixSuffix) against PrefixSuffix.

  • 在POSIX正则表达式的POSIX兼容实现中(不是仅借用语法的那些),正则表达式将匹配PrefixSuffix.
  • 相比之下,由于Prefix是在交替中首先指定的,因此re引擎(以及许多其他回溯正则表达式引擎)将仅匹配Prefix.
  • In POSIX-complaint implementation of POSIX regex (not those which only borrows the syntax), the regex will match PrefixSuffix.
  • In contrast, re engine (and many other backtracking regex engines) will match Prefix only, since Prefix is specified first in the alternation.

在将(xxx|xxxxx)*xxxxxxxxxx(由10个x组成的字符串)匹配的情况下,也可以看出差异:

The difference can also be seen in the case of matching (xxx|xxxxx)* against xxxxxxxxxx (a string of 10 x's):

  • 在Cygwin上:

  • On Cygwin:

$ [[ "xxxxxxxxxx" =~ (xxx|xxxxx)* ]] && echo "${BASH_REMATCH[0]}"
xxxxxxxxxx

所有10个x都匹配.

在Python中:

>>> re.search(r'(?:xxx|xxxxx)*', 'xxxxxxxxxxx').group(0)
'xxxxxxxxx'

只有9个x被匹配,因为它在所有3个重复中都选择了交替出现的第一项xxx,并且没有任何强迫它回溯并尝试交替出现的第二项的情况.

Only 9 x's are matched, since it picks the first item in alternation xxx in all 3 repetitions, and nothing forces it to backtrack and try the second item in alternation)

除了匹配语义上的差异外,POSIX正则表达式还定义了校对符号对等类表达式基于校对的字符范围的语法.强>.这些功能大大提高了正则表达式的表达能力.

Apart from the difference in matching semantics, POSIX regular expression also define syntax for collating symbols, equivalence class expressions, and collation-based character range. These features greatly increase the expressive power of the regex.

以等效类表达式为例,来自文档:

Taking equivalence class expression as example, from the documentation:

等价类表达式应表示属于等价类的整理元素集,如整理顺序"中所述. [...].该类应通过将等价类中的任何归类元素放在方括号等号("[=""=]")分隔符内来表示. 例如,如果'a''à''â'属于相同的等效类,则"[[=a=]b]""[[=à=]b]""[[=â=]b]"分别等效于"[aàâb]" . [...]

An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. [...]. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( "[=" and "=]" ) delimiters. For example, if 'a', 'à', and 'â' belong to the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each equivalent to "[aàâb]". [...]

由于这些功能在很大程度上取决于语言环境设置,因此相同的正则表达式在不同的语言环境中的行为可能会有所不同.排序顺序还取决于系统上的语言环境数据.

Since these features heavily depend on the locale settings, the same regex may behave differently on different locale. It also depends on the locale data on the system for the collation order.

re是从Perl借用的语法,但是不是所有的Perl regex功能都在re中实现.以下是re中可用的一些正则表达式功能,而POSIX正则表达式中不提供这些功能:

re borrows the syntax from Perl, but not all features in Perl regex are implemented in re. Below are some regex features available in re which is unavailable in POSIX regular expression:

  • 贪婪/惰性量词,用于指定扩展量词的顺序.

  • Greedy/lazy quantifier, which specifies the order to expand a quantifier.

虽然人们通常在POSIX贪婪中称呼*,但实际上它仅指定POSIX中重复的下限和上限.所谓的贪婪"行为是由于最左边的最长匹配规则所致.

While people usually call the * in POSIX greedy, it actually only specifies the lower bound and upper bound of the repetition in POSIX. The so-called "greedy" behavior is due to the leftmost longest match rule.

这篇关于Python正则表达式模块使用BRE还是ERE?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆