Python中的Perl兼容正则表达式(PCRE) [英] Perl compatible regular expression (PCRE) in Python

查看:86
本文介绍了Python中的Perl兼容正则表达式(PCRE)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须基于Python中的PCRE解析一些字符串,而且我不知道该怎么做.

I have to parse some strings based on PCRE in Python, and I've no idea how to do that.

我想解析的字符串如下:

Strings I want to parse looks like:

match mysql m/^.\0\0\0\n(4\.[-.\w]+)\0...\0/s p/MySQL/ i/$1/

在此示例中,我必须获得以下不同的项目:

In this example, I have to get this different items:

"m/^.\0\0\0\n(4\.[-.\w]+)\0...\0/s" ; "p/MySQL/" ; "i/$1/"

我发现与Python中的PCRE操作有关的唯一内容是此模块: http://pydoc.org/2.2.3/pcre.html (但它写的是一个.so文件...)

The only thing I've found relating to PCRE manipulation in Python is this module: http://pydoc.org/2.2.3/pcre.html (but it's written it's a .so file ...)

您知道是否存在一些Python模块可以解析这种字符串吗?

Do you know if some Python module exists to parse this kind of string?

推荐答案

要特别小心Python中的非ASCII

Python如何处理模式和字符串中的非ASCII或无法处理非ASCII方面确实存在一些细微的问题.更糟糕的是,这些差异不仅取决于您使用的是哪个版本的Python,还取决于您是否拥有广泛的构建".

Be Especially Careful with non‐ASCII in Python

There are some really subtle issues with how Python deals with, or fails to deal with, non-ASCII in patterns and strings. Worse, these disparities vary substantially according, not just to which version of Python you are using, but also whether you have a "wide build".

通常,当您处理Unicode时,具有广泛构建的Python  3效果最好,而具有狭窄构建的Python  2效果最差,但是所有组合仍然相去甚远相对于 Unicode而言,Perl正则表达式的工作方式远非如此.如果您要在Python中寻找ᴘᴄʀᴇ模式,则可能需要比旧的re模块更进一步.

In general, when you’re doing Unicode stuff, Python 3 with a wide build works best and Python 2 with a narrow build works worst, but all combinations are still a pretty far cry far from how Perl regexes work vis‐à‐vis Unicode. If you’re looking for ᴘᴄʀᴇ patterns in Python, you may have to look a bit further afield than its old re module.

一劳永逸的 最终 这个令人烦恼的广泛构建"问题—只要使用了足够高级的Python版本即可.以下摘录自 v3.3发行说明:

功能

PEP 393 引入的更改如下:

Functionality

Changes introduced by PEP 393 are the following:

  • Python现在始终支持全部Unicode代码点,包括非BMP代码点(即,从U + 0000到U + 10FFFF).窄和宽版本之间的区别不再存在,即使在Windows下,Python现在的行为也像宽版本一样.
  • 随着狭窄版本的终止,狭窄版本所特有的问题也已修复,例如:
    • len()现在对于非BMP字符总是返回1,因此len('\U0010FFFF') == 1;
    • 代理对不重新组合为字符串文字,因此'\uDBFF\uDFFF' != '\U0010FFFF';
    • 索引或切片非BMP字符将返回预期值,因此'\U0010FFFF'[0]现在返回'\U0010FFFF'而不是'\uDBFF';
    • 标准库中的所有其他函数现在可以正确处理非BMP代码点.
    • Python now always supports the full range of Unicode codepoints, including non-BMP ones (i.e. from U+0000 to U+10FFFF). The distinction between narrow and wide builds no longer exists and Python now behaves like a wide build, even under Windows.
    • With the death of narrow builds, the problems specific to narrow builds have also been fixed, for example:
      • len() now always returns 1 for non-BMP characters, so len('\U0010FFFF') == 1;
      • surrogate pairs are not recombined in string literals, so '\uDBFF\uDFFF' != '\U0010FFFF';
      • indexing or slicing non-BMP characters returns the expected value, so '\U0010FFFF'[0] now returns '\U0010FFFF' and not '\uDBFF';
      • all other functions in the standard library now correctly handle non-BMP codepoints.

      Python正则表达式的未来

      与标准Python发行版的re库中当前可用的库相比, Matthew Barnett的regex模块对于Python 2和Python 3同样,在几乎所有可能的方式上都好得多,并且最终很可能会取代re.与您的问题特别相关的是,他的regex库在各个方面都比现在的re更加ᴘᴄʀᴇ( ie 与Perl兼容),这将使您更轻松地将Perl regexes移植到Python.因为它是完全重写(如从头开始,而不是汉堡包中那样),所以它是在考虑非ASCII的情况下编写的,而re则不是.

      The Future of Python Regexes

      In contrast to what’s currently available in the standard Python distribution’s re library, Matthew Barnett’s regex module for both Python 2 and Python 3 alike is much, much better in pretty much all possible ways and will quite probably replace re eventually. Its particular relevance to your question is that his regex library is far more ᴘᴄʀᴇ (i.e. it’s much more Perl‐compatible) in every way than re now is, which will make porting Perl regexes to Python easier for you. Because it is a ground‐up rewrite (as in from‐scratch, not as in hamburger :), it was written with non-ASCII in mind, which re was not.

      因此regex库更加严格地遵循 UTS#18:Unicode正则表达式的(当前)建议了解如何处理问题.在大多数情况下(即使不是全部),它达到或超过了UTS#18 Level 1的要求,您通常需要使用ICU regex库或Perl本身来实现这些功能;或者,如果您特别勇敢,则可以使用新的Java 7更新了其正则表达式,因为它也符合UTS#18的一级要求.

      The regex library therefore much more closely follows the (current) recommendations of UTS#18: Unicode Regular Expressions in how it approaches things. It meets or exceeds the UTS#18 Level 1 requirements in most if not all regards, something you normally have to use the ICU regex library or Perl itself for — or if you are especially courageous, the new Java 7 update to its regexes, as that also conforms to the Level One requirements from UTS#18.

      除了满足这些第一级要求,这些要求对于基本的Unicode支持都是绝对必要的,但是 Python当前的re库无法满足,令人敬畏的regex库也符合该级别 RL2.5 命名字符(\N{...})), UTS#18的修订版14 .

      Beyond meeting those Level One requirements, which are all absolutely essential for basic Unicode support, but which are not met by Python’s current re library, the awesome regex library also meets the Level Two requirements for RL2.5 Named Characters (\N{...})), RL2.2 Extended Grapheme Clusters (\X), and the new RL2.7 on Full Properties from revision 14 of UTS#18.

      Matthew的regex模块还执行Unicode大小写折叠,以便区分大小写的匹配在Unicode上可靠地起作用, re则不然.

      Matthew’s regex module also does Unicode casefolding so that case insensitive matches work reliably on Unicode, which re does not.

      以下内容不再成立,因为regex现在支持完整的Unicode大小写折叠,例如Perl和Ruby.

      The following is no longer true, because regex now supports full Unicode casefolding, like Perl and Ruby.

      一个非常小的区别是,目前,Perl的不区分大小写的模式使用完整的面向字符串的大小写折叠,而他的regex模块仍使用简单的单字符定向的大小写折叠,但这是他正在研究的东西.这实际上是一个非常棘手的问题,除了Perl之外,甚至只有Ruby都尝试过.

      One super‐tiny difference is that for now, Perl’s case‐insensitive patterns use full string‐oriented casefolds while his regex module still uses simple single‐char‐oriented casefolds, but this is something he’s looking into. It’s actually a very hard problem, one which apart from Perl, only Ruby even attempts.

      在完全折叠的情况下,这意味着(例如)"ß"现在可以在选择不区分大小写的匹配时正确匹配"SS""ss""ſſ""ſs"等. (这在希腊语中比拉丁语更重要.)

      Under full casefolding, this means that (for example) "ß" now correct matches "SS", "ss", "ſſ", "ſs" (etc.) when case-insensitive matching is selected. (This is admittedly more important in the Greek script than the Latin one.)

      另请参阅我的第三次OSCON2011演讲中的幻灯片或文档源代码,标题为" Unicode支持大战:好,坏和(主要是)丑陋" ,以解决JavaScript,PHP,Go,Ruby,Python,Java和Perl.如果既不能使用Perl正则表达式,也可能无法使用ICU正则表达式库(a,它没有命名捕获),那么Matthew的regex for Python可能是最好的选择.

      See also the slides or doc source code from my third OSCON2011 talk entitled "Unicode Support Shootout: The Good, the Bad, and the (mostly) Ugly" for general issues in Unicode support across JavaScript, PHP, Go, Ruby, Python, Java, and Perl. If can’t use either Perl regexes or possibly the ICU regex library (which doesn’t have named captures, alas!), then Matthew’s regex for Python is probably your best shot.

      NᴏᴛᴀBᴇɴᴇs.ᴠ.ᴘ. (= s'il vousplaît,等等,mémes'il nevousplaîtpas :)以下未经请求的非商业性非广告是 not 实际上是Python regex库的作者写在这里的. :)

      Nᴏᴛᴀ Bᴇɴᴇ s.ᴠ.ᴘ. (= s’il vous plaît, et même s’il ne vous plaît pas :) The following unsolicited noncommercial nonadvertisement was not actually put here by the author of the Python regex library. :)

      Python regex库具有超级功能的功能,其中一些在其他正则表达式系统中都找不到.无论您是偶然使用它还是其出色的Unicode支持,这些都非常值得一试.

      The Python regex library has a cornucopeia of superneat features, some of which are found in no other regex system anywhere. These make it very much worth checking out no matter whether you happen to be using it for its ᴘᴄʀᴇ‐ness or its stellar Unicode support.

      该模块感兴趣的一些突出功能是:

      A few of this module’s outstanding features of interest are:

      • 可变宽度向后看,此功能在正则表达式引擎中很少见,并且在您真正想要它时非常沮丧.这很可能是正则表达式中最常请求的功能.
      • 向后搜索,因此您不必自己先反转字符串.
      • 作用域内的ismx类型选项,因此(?i:foo)仅对foo进行折叠,而不对整体进行折叠,或者(?-i:foo)仅对foo进行折叠.这就是Perl(或可以)的工作方式.
      • 基于编辑距离的模糊匹配(Udi Manber的agrepglimpse也具有)
      • 通过\L<list>内插法隐式表示从最短到最长排序的命名列表
      • 仅与单词的开头或结尾(而不是任一侧)(\m\M)匹配的元字符
      • 支持所有Unicode行分隔符(Java可以做到这一点,Perl可以做到,尽管每个 RL1进行的带括号的字符类上的全集运算(并集,交集,差和对称差). 3 ,这比在Perl中学习要容易得多.
      • 允许重复捕获组,例如(\w+\s+)+,您可以在其中获得第一个组的所有单独匹配,而不仅仅是最后一个匹配. (我相信C#也可以这样做.)
      • 比先行者偷偷摸摸的捕获组更容易获得重叠的比赛.
      • 所有组的开始和结束位置,以便以后进行切片/子字符串操作,这与Perl的@+@-数组非常相似.
      • 通过(?|...|...|...|)的分支重置运算符,以其在Perl中的工作方式重置每个分支中的组编号.
      • 可以配置为让咖啡在早上等您.
      • RL2.3 支持更复杂的单词边界.
      • 默认情况下假设Unicode字符串,并且完全支持 RL1.2a ,以便\w\b\s等在Unicode上的工作.
      • 支持字素的\X.
      • 支持\G连续点断言.
      • 对于64位版本(re仅具有32位索引)可以正常工作.
      • 支持多线程.
      • Variable‐width lookbehind, a feature which is quite rare in regex engines and very frustrating not to have when you really want it. This may well be the most frequently requested feature in regexes.
      • Backwards searching so you don’t have to reverse your string yourself first.
      • Scoped ismx‐type options, so that (?i:foo) only casefolds for foo, not overall, or (?-i:foo) to turn it off just on foo. This is how Perl works (or can).
      • Fuzzy matching based on edit‐distance (which Udi Manber’s agrep and glimpse also have)
      • Implicit shortest‐to‐longest sorted named lists via \L<list> interpolation
      • Metacharacters that specifically match only the start or only the end of a word rather than either side (\m, \M)
      • Support for all Unicode line separators (Java can do this, as can Perl albeit somewhat begrudgingly with \R per RL1.6.
      • Full set operations — union, intersection, difference, and symmetric difference — on bracketed character classes per RL1.3, which is much easier than getting at it in Perl.
      • Allows for repeated capture groups like (\w+\s+)+ where you can get all separate matches of the first group not just its last match. (I believe C# might also do this.)
      • A more straightforward way to get at overlapping matches than sneaky capture groups in lookaheads.
      • Start and end positions for all groups for later slicing/substring operations, much like Perl’s @+ and @- arrays.
      • The branch‐reset operator via (?|...|...|...|) to reset group numbering in each branch the way it works in Perl.
      • Can be configured to have your coffee waiting for you in the morning.
      • Support for the more sophisticated word boundaries from RL2.3.
      • Assumes Unicode strings by default, and fully supports RL1.2a so that \w, \b, \s, and such work on Unicode.
      • Supports \X for graphemes.
      • Supports the \G continuation point assertion.
      • Works correctly for 64‐bit builds (re only has 32‐bit indices).
      • Supports multithreading.

      好的,这已经足够炒作了. :)

      Ok, that’s enough hype. :)

      如果您是正则表达式极客,那么值得一看的最后一个替代方法是 Python库绑定对Russ Cox很棒的 RE2库.它还本地支持Unicode,包括简单的基于字符的大小写折叠,并且与re不同,它特别提供了Unicode常规类别和Unicode脚本字符属性,这是更简单的Unicode经常需要的两个关键属性加工.

      One final alternative that is worth looking at if you are a regex geek is the Python library bindings to Russ Cox’s awesome RE2 library. It also supports Unicode natively, including simple char‐based casefolding, and unlike re it notably provides for both the Unicode General Category and the Unicode Script character properties, which are the two key properties you most often need for the simpler kinds of Unicode processing.

      尽管RE2错过了一些Unicode特性,例如ICU,Perl和Python中的\N{...}命名字符支持,但它具有非常重要的计算优势,因此使其成为首选的正则表达式引擎您担心通过Web查询等中的正则表达式进行基于饥饿的拒绝服务攻击.它通过禁止反向引用来管理此问题,这会导致正则表达式停止正常运行,并有可能在时间和空间上发生超指数爆炸.

      Although RE2 misses out on a few Unicode features like \N{...} named character support found in ICU, Perl, and Python, it has extremely serious computational advantages that make it the regex engine of choice whenever you’re concern with starvation‐based denial‐of‐service attacks through regexes in web queries and such. It manages this by forbidding backreferences, which cause a regex to stop being regular and risk super‐exponential explosions in time and space.

      RE2的库绑定不仅适用于C/C ++和Python,还适用于Perl,尤其是Go(打算在不久的将来替换那里的标准正则表达式库)中使用.

      Library bindings for RE2 are available not just for C/C++ and Python, but also for Perl and most especially for Go, where it is slated to very shortly replace the standard regex library there.

      这篇关于Python中的Perl兼容正则表达式(PCRE)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆