Python 中的 Perl 兼容正则表达式 (PCRE) [英] Perl compatible regular expression (PCRE) in Python

查看:41
本文介绍了Python 中的 Perl 兼容正则表达式 (PCRE)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须在 Python 中基于 PCRE 解析一些字符串,但我不知道该怎么做.

我想解析的字符串如下:

匹配mysql m/^.
(4.[-.w]+).../s p/MySQL/i/$1/

在这个例子中,我必须得到不同的项目:

"m/^.
(4.[-.w]+).../s" ;"p/MySQL/" ;我/$1/"

我发现与 Python 中 PCRE 操作相关的唯一一件事是这个模块:http://pydoc.org/2.2.3/pcre.html(但它写的是一个 .so 文件......)

你知道是否存在一些 Python 模块来解析这种字符串吗?

解决方案

在 Python 中使用非 ASCII 时要特别小心

Python 如何处理或无法处理模式和字符串中的非 ASCII 有一些非常微妙的问题.更糟糕的是,这些差异不仅取决于您使用的 Python 版本,还取决于您是否拥有广泛构建".

一般来说,当你在做 Unicode 的事情时,具有广泛构建的 Python 3 效果最好而具有狭窄构建的 Python 2 效果最差,但所有组合仍然相去甚远Perl 正则表达式与相对于 Unicode 的工作方式相去甚远.如果您正在 Python 中寻找 ᴘᴄʀᴇ 模式,您可能需要比旧的 re 模块更远一些.

烦人的广泛构建"问题已经最终得到修复一劳永逸——前提是您使用了足够先进的 Python 版本.以下是v3.3 发行说明的摘录:

<块引用>

功能

PEP 393 引入的更改如下:

  • Python 现在始终支持所有 Unicode 代码点,包括非 BMP 代码点(即从 U+0000 到 U+10FFFF).窄构建和宽构建之间的区别不再存在,Python 现在的行为类似于宽构建,即使在 Windows 下也是如此.
  • 随着窄构建的消亡,窄构建特有的问题也得到了修复,例如:
    • len() 现在总是为非 BMP 字符返回 1,所以 len('U0010FFFF') == 1;
    • 代理对不会在字符串文字中重新组合,所以 'uDBFFuDFFF' != 'U0010FFFF';
    • 索引或切片非 BMP 字符会返回预期值,因此 'U0010FFFF'[0] 现在返回 'U0010FFFF' 而不是 'uDBFF';
    • 标准库中的所有其他函数现在都可以正确处理非 BMP 代码点.
  • sys.maxunicode 的值现在总是 1114111(十六进制的 0x10FFFF).PyUnicode_GetMax() 函数仍会返回 0xFFFF 或 0x10FFFF 以实现向后兼容性,并且不应与新的 Unicode API 一起使用(请参阅 问题 13054).
  • ./configure 标志 --with-wide-unicode 已被删除.

Python 正则表达式的未来

与标准 Python 发行版的 re 库中当前可用的内容相比,MatthewBarnett 的用于 Python 2 和 Python 3 的 regex 模块 在几乎所有可能的方式上都好得多,并且很可能最终会取代 re.与您的问题特别相关的是,他的 regex 库在各方面都比 re 现在是,这将使您更容易将 Perl 正则表达式移植到 Python.因为它是一个彻底的重写(如从头开始,而不是像在汉堡包中 :),它是在考虑非 ASCII 的情况下编写的,而 re 不是.

regex 库因此更接近于UTS#的(当前)建议18:Unicode 正则表达式处理事物的方式.它满足或超过 UTS#18 1 级要求,在大多数方面(如果不是全部),您通常必须使用 ICU 正则表达式库或 Perl 本身来实现——或者如果您特别勇敢,则可以使用新的 Java7 更新其正则表达式,因为这也符合 UTS#18 的一级要求.

除了满足一级要求之外,这些要求对于基本的 Unicode 支持都是绝对必要的,但是 Python 当前的 re 库无法满足这些要求, 很棒的 regex 库还满足 RL2.5 命名字符(<代码>N{...})), RL2.2 扩展Grapheme Clusters (X),以及来自 UTS#18 的第 14 版.

Matthew 的 regex 模块也进行 Unicode 大小写折叠,以便不区分大小写的匹配在 Unicode 上可靠地工作,re 不能.

以下不再正确,因为 regex 现在支持完整的 Unicode 大小写折叠,如 Perl 和 Ruby.

<块引用><块引用>

一个非常小的区别是,目前 Perl 的不区分大小写的模式使用完整的面向字符串的 casefolds 而他的 regex 模块仍然使用简单的面向单字符的 casefolds,但是这是他正在研究的东西.这实际上是一个非常困难的问题,除了Perl,只有Ruby 会尝试.

在完整的 casefolding 下,这意味着(例如)"ß" 现在正确匹配 "SS", "ss", <代码>ss",ss"(等)选择不区分大小写匹配时.(这在希腊文字中无疑比拉丁文字更重要.)

另请参阅我的第三次 OSCON2011 演讲的幻灯片或文档源代码,标题为Unicode Support Shootout: The Good, the Bad, and the (mostly) Ugly" 针对 JavaScript、PHP、Go、Ruby、Python、Java 和珀尔.如果不能使用 Perl 正则表达式或 ICU 正则表达式库(它没有命名捕获,唉!),那么 Matthew 的 regex for Python 可能是你最好的选择.

<小时>

Nᴏᴛᴀ Bᴇɴᴇ s.ᴠ.ᴘ.(= s'il vous plaît, et même s'il ne vous plaît pas :) 以下未经请求的非商业非广告不是 实际上是 Python regex 库的作者放在这里的.:)

regex 功能

Python regex 库有一个超级功能的聚宝盆,其中一些在任何其他正则表达式系统中都找不到.无论您是否碰巧将它用于它的 ᴘᴄʀᴇ-ness 或它出色的 Unicode 支持,这些都非常值得一试.

该模块的一些突出特点是:

  • 可变宽度后视,这是一个在正则表达式引擎中非常罕见的功能,当您真正想要它时却没有它会令人非常沮丧.这很可能是正则表达式中最常被请求的功能.
  • 向后搜索,因此您不必先自己反转字符串.
  • 作用域 ismx-type 选项,以便 (?i:foo) 仅用于 foo 的 casefolds,而不是整体,或 (?-i:foo) 仅在 foo 上将其关闭.这就是 Perl 的工作方式(或可以).
  • 基于编辑距离的模糊匹配(Udi Manber的agrepglimpse也有)
  • 通过L插值隐式最短到最长排序的命名列表
  • 仅匹配单词的开头或结尾而不匹配任一侧的元字符 (m, M)
  • 支持所有 Unicode 行分隔符(Java 可以做到这一点,Perl 也可以做到这一点,尽管对 R 有点不情愿,每个 RL1.6.
  • 在每个 RL1 的括号字符类上的完整集合操作 - 并集、交集、差异和对称差异.3,这比在 Perl 中获得要容易得多.
  • 允许重复捕获组,例如 (w+s+)+ ,您可以在其中获取第一组的所有单独匹配,而不仅仅是最后一个匹配.(我相信 C# 也可能会这样做.)
  • 比前瞻中的偷偷摸摸的捕获组更直接地获得重叠匹配的方法.
  • 所有组的开始和结束位置,用于以后的切片/子字符串操作,很像 Perl 的 @+@- 数组.
  • 分支重置操作符通过 (?|...|...|...|) 重置每个分支中的组编号,就像它在 Perl 中的工作方式一样.
  • 可以配置为让您的咖啡在早上等您.
  • 支持 RL2.3 中更复杂的词边界.
  • 默认采用 Unicode 字符串,并完全支持 RL1.2a 以便 ws 等在 Unicode 上的工作.
  • 支持 X 用于字素.
  • 支持 G 延续点断言.
  • 适用于 64 位构建(re 只有 32 位索引).
  • 支持多线程.

好吧,炒作够了.:)

另一个很好的替代正则表达式引擎

如果您是正则表达式极客,最后一个值得考虑的替代方案是 Python 库绑定 到 Russ Cox 很棒的 RE2 库.它还本机支持 Unicode,包括简单的基于字符的 casefolding,并且与 re 不同,它特别提供了 Unicode 通用类别和 Unicode 脚本字符属性,这是您最常需要的两个关键属性更简单的 Unicode 处理类型.

尽管 RE2 遗漏了一些 Unicode 特性,例如 N{...} 在 ICU、Perl 和 Python 中发现的命名字符支持,但它具有极其重要的计算优势,使其首选正则表达式引擎,每当您担心通过 Web 查询中的正则表达式等进行基于饥饿的拒绝服务攻击时.它通过禁止反向引用来管理这一点,这会导致正则表达式不再是正则表达式,并在时间和空间上冒着超指数爆炸的风险.

RE2 的库绑定不仅适用于 C/C++ 和 Python,还适用于 Perl,尤其适用于 Go,它将很快取代那里的标准正则表达式库.

I have to parse some strings based on PCRE in Python, and I've no idea how to do that.

Strings I want to parse looks like:

match mysql m/^.
(4.[-.w]+).../s p/MySQL/ i/$1/

In this example, I have to get this different items:

"m/^.
(4.[-.w]+).../s" ; "p/MySQL/" ; "i/$1/"

The only thing I've found relating to PCRE manipulation in Python is this module: http://pydoc.org/2.2.3/pcre.html (but it's written it's a .so file ...)

Do you know if some Python module exists to parse this kind of string?

解决方案

Be Especially Careful with non‐ASCII in Python

There are some really subtle issues with how Python deals with, or fails to deal with, non-ASCII in patterns and strings. Worse, these disparities vary substantially according, not just to which version of Python you are using, but also whether you have a "wide build".

In general, when you’re doing Unicode stuff, Python 3 with a wide build works best and Python 2 with a narrow build works worst, but all combinations are still a pretty far cry far from how Perl regexes work vis‐à‐vis Unicode. If you’re looking for ᴘᴄʀᴇ patterns in Python, you may have to look a bit further afield than its old re module.

The vexing "wide-build" issues have finally been fixed once and for all — provided you use a sufficiently advanced release of Python. Here’s an excerpt from the v3.3 release notes:

Functionality

Changes introduced by PEP 393 are the following:

  • Python now always supports the full range of Unicode codepoints, including non-BMP ones (i.e. from U+0000 to U+10FFFF). The distinction between narrow and wide builds no longer exists and Python now behaves like a wide build, even under Windows.
  • With the death of narrow builds, the problems specific to narrow builds have also been fixed, for example:
    • len() now always returns 1 for non-BMP characters, so len('U0010FFFF') == 1;
    • surrogate pairs are not recombined in string literals, so 'uDBFFuDFFF' != 'U0010FFFF';
    • indexing or slicing non-BMP characters returns the expected value, so 'U0010FFFF'[0] now returns 'U0010FFFF' and not 'uDBFF';
    • all other functions in the standard library now correctly handle non-BMP codepoints.
  • The value of sys.maxunicode is now always 1114111 (0x10FFFF in hexadecimal). The PyUnicode_GetMax() function still returns either 0xFFFF or 0x10FFFF for backward compatibility, and it should not be used with the new Unicode API (see issue 13054).
  • The ./configure flag --with-wide-unicode has been removed.

The Future of Python Regexes

In contrast to what’s currently available in the standard Python distribution’s re library, Matthew Barnett’s regex module for both Python 2 and Python 3 alike is much, much better in pretty much all possible ways and will quite probably replace re eventually. Its particular relevance to your question is that his regex library is far more ᴘᴄʀᴇ (i.e. it’s much more Perl‐compatible) in every way than re now is, which will make porting Perl regexes to Python easier for you. Because it is a ground‐up rewrite (as in from‐scratch, not as in hamburger :), it was written with non-ASCII in mind, which re was not.

The regex library therefore much more closely follows the (current) recommendations of UTS#18: Unicode Regular Expressions in how it approaches things. It meets or exceeds the UTS#18 Level 1 requirements in most if not all regards, something you normally have to use the ICU regex library or Perl itself for — or if you are especially courageous, the new Java 7 update to its regexes, as that also conforms to the Level One requirements from UTS#18.

Beyond meeting those Level One requirements, which are all absolutely essential for basic Unicode support, but which are not met by Python’s current re library, the awesome regex library also meets the Level Two requirements for RL2.5 Named Characters (N{...})), RL2.2 Extended Grapheme Clusters (X), and the new RL2.7 on Full Properties from revision 14 of UTS#18.

Matthew’s regex module also does Unicode casefolding so that case insensitive matches work reliably on Unicode, which re does not.

The following is no longer true, because regex now supports full Unicode casefolding, like Perl and Ruby.

One super‐tiny difference is that for now, Perl’s case‐insensitive patterns use full string‐oriented casefolds while his regex module still uses simple single‐char‐oriented casefolds, but this is something he’s looking into. It’s actually a very hard problem, one which apart from Perl, only Ruby even attempts.

Under full casefolding, this means that (for example) "ß" now correct matches "SS", "ss", "ſſ", "ſs" (etc.) when case-insensitive matching is selected. (This is admittedly more important in the Greek script than the Latin one.)

See also the slides or doc source code from my third OSCON2011 talk entitled "Unicode Support Shootout: The Good, the Bad, and the (mostly) Ugly" for general issues in Unicode support across JavaScript, PHP, Go, Ruby, Python, Java, and Perl. If can’t use either Perl regexes or possibly the ICU regex library (which doesn’t have named captures, alas!), then Matthew’s regex for Python is probably your best shot.


Nᴏᴛᴀ Bᴇɴᴇ s.ᴠ.ᴘ. (= s’il vous plaît, et même s’il ne vous plaît pas :) The following unsolicited noncommercial nonadvertisement was not actually put here by the author of the Python regex library. :)

Cool regex Features

The Python regex library has a cornucopeia of superneat features, some of which are found in no other regex system anywhere. These make it very much worth checking out no matter whether you happen to be using it for its ᴘᴄʀᴇ‐ness or its stellar Unicode support.

A few of this module’s outstanding features of interest are:

  • Variable‐width lookbehind, a feature which is quite rare in regex engines and very frustrating not to have when you really want it. This may well be the most frequently requested feature in regexes.
  • Backwards searching so you don’t have to reverse your string yourself first.
  • Scoped ismx‐type options, so that (?i:foo) only casefolds for foo, not overall, or (?-i:foo) to turn it off just on foo. This is how Perl works (or can).
  • Fuzzy matching based on edit‐distance (which Udi Manber’s agrep and glimpse also have)
  • Implicit shortest‐to‐longest sorted named lists via L<list> interpolation
  • Metacharacters that specifically match only the start or only the end of a word rather than either side (m, M)
  • Support for all Unicode line separators (Java can do this, as can Perl albeit somewhat begrudgingly with R per RL1.6.
  • Full set operations — union, intersection, difference, and symmetric difference — on bracketed character classes per RL1.3, which is much easier than getting at it in Perl.
  • Allows for repeated capture groups like (w+s+)+ where you can get all separate matches of the first group not just its last match. (I believe C# might also do this.)
  • A more straightforward way to get at overlapping matches than sneaky capture groups in lookaheads.
  • Start and end positions for all groups for later slicing/substring operations, much like Perl’s @+ and @- arrays.
  • The branch‐reset operator via (?|...|...|...|) to reset group numbering in each branch the way it works in Perl.
  • Can be configured to have your coffee waiting for you in the morning.
  • Support for the more sophisticated word boundaries from RL2.3.
  • Assumes Unicode strings by default, and fully supports RL1.2a so that w, , s, and such work on Unicode.
  • Supports X for graphemes.
  • Supports the G continuation point assertion.
  • Works correctly for 64‐bit builds (re only has 32‐bit indices).
  • Supports multithreading.

Ok, that’s enough hype. :)

Yet Another Fine Alternate Regex Engine

One final alternative that is worth looking at if you are a regex geek is the Python library bindings to Russ Cox’s awesome RE2 library. It also supports Unicode natively, including simple char‐based casefolding, and unlike re it notably provides for both the Unicode General Category and the Unicode Script character properties, which are the two key properties you most often need for the simpler kinds of Unicode processing.

Although RE2 misses out on a few Unicode features like N{...} named character support found in ICU, Perl, and Python, it has extremely serious computational advantages that make it the regex engine of choice whenever you’re concern with starvation‐based denial‐of‐service attacks through regexes in web queries and such. It manages this by forbidding backreferences, which cause a regex to stop being regular and risk super‐exponential explosions in time and space.

Library bindings for RE2 are available not just for C/C++ and Python, but also for Perl and most especially for Go, where it is slated to very shortly replace the standard regex library there.

这篇关于Python 中的 Perl 兼容正则表达式 (PCRE)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆