执行 \G 锚定解析循环的 Python 方式是什么? [英] What is the Python way of doing a \G anchored parsing loop?

查看:37
本文介绍了执行 \G 锚定解析循环的 Python 方式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

接下来是我多年前写的一个perl函数.它是一个智能标记器,可以识别一些可能不应该粘在一起的东西.例如,给定左侧的输入,它将字符串划分为右侧所示:

<块引用>

abc123 ->abc|123abcABC ->abc|ABCABC123 ->ABC|123123abc ->123|美国广播公司123ABC ->123|ABCAbcDef ->Abc|Def(例如 CamelCase)ABCDef ->ABC|定义1stabc ->1st|abc(识别有效序数)1ndabc ->1|ndabc(但不是无效的序数)11thabc ->11th|abc(认识到 11th - 13th 不同于 1st - 3rd)11stabc ->11|stabc

我现在正在做一些机器学习实验,我想做一些使用这个分词器的实验.但首先,我需要将它从 Perl 移植到 Python.这段代码的关键是使用 \G 锚点的循环,我听说在 python 中不存在的东西.我尝试在 Google 上搜索如何在 Python 中完成此操作,但我不确定要搜索的内容,因此我无法找到答案.

你会如何用 Python 编写这个函数?

sub 标记化# 使用特殊规则将字符串分解为标记,# 其中标记是任何字符序列,无论是字母序列,# 数字序列,或非字母数字字符序列# 找到的令牌列表返回给调用者{我的 $value = shift;我的@list = ();我的$word;while ( $value ne '' && $value =~ m/\G # 从上一个停止的地方开始([^a-zA-Z0-9]*) # 捕获非字母数字字符,如果有的话([a-zA-Z0-9]*?) # 捕获直到标记边界的所有内容(?: # 识别令牌边界(?=[^a-zA-Z0-9]) # 下一个字符不是单词字符|(?=[A-Z][a-z]) # 后两个字符高低|(?<=[a-z])(?=[A-Z]) # 低后高|(?<=[a-zA-Z])(?=[0-9]) # 字母后跟数字#序数边界|(?<=^1(?i:st)) # 第一个|(?<=[^1][1](?i:st)) # 第一个但不是第 11 个|(?<=^2(?i:nd)) # 秒|(?<=[^1]2(?i:nd)) # 第二个但不是第 12 个|(?<=^3(?i:rd)) # 第三个|(?<=[^1]3(?i:rd)) # 第三个但不是第 13 个|(?<=1[123](?i:th)) # 11 - 13|(?<=[04-9](?i:th)) # 其他序数# 无序数字字母边界|(?<=^1)(?=[a-zA-Z])(?!(?i)st) # 数字字母但不是第一个|(?<=[^1]1)(?=[a-zA-Z])(?!(?i)st) # 数字字母但不是第 11 个|(?<=^2)(?=[a-zA-Z])(?!(?i)nd) # 数字字母但不是第一个|(?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd) # 数字字母但不是第 12 个|(?<=^3)(?=[a-zA-Z])(?!(?i)rd) # 数字字母但不是第一个|(?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd) # 数字字母但不是第 13 个|(?<=1[123])(?=[a-zA-Z])(?!(?i)th) # 数字字母,但不是第 11-13 个|(?<=[04-9])(?=[a-zA-Z])(?!(?i)th) # 数字字母但不是序数|(?=$) # 字符串结束)/xg ){推@list, $1 如果 $1 ne '';推@list, $2 if $2 ne '';}返回@list;}

我确实尝试使用 re.split() 和上述的变体.但是, split() 拒绝在零宽度匹配上拆分(如果人们真的知道自己在做什么,这种能力应该是可能的).

我确实想出了针对这个特定问题的解决方案,但不是针对我如何使用基于 \G 的解析"的一般问题——我有一些示例代码在使用 \G 和锚定的循环中执行正则表达式然后在正文中它使用另一个锚定在 \G 的匹配来查看继续解析的方式.所以我还在寻找答案.

也就是说,这是我将上述内容翻译成 Python 的最终工作代码:

导入重新IsA = lambda s: '[' + s + ']'IsNotA = lambda s: '[^' + s + ']'上 = IsA('A-Z')下 = IsA('a-z')字母 = IsA('a-zA-Z')数字 = IsA('0-9')字母数字 = IsA('a-zA-Z0-9')NotAlphaNumeric = IsNotA('a-zA-Z0-9')EndOfString = '$'或 = '|'ZeroOrMore = lambda s: s + '*'ZeroOrMoreNonGreedy = lambda s: s + '*?'OneOrMore = lambda s: s + '+'OneOrMoreNonGreedy = lambda s: s + '+?'StartsWith = lambda s: '^' + s捕获 = lambda s: '(' + s + ')'PreceededBy = lambda s: '(?<=' + s + ')'FollowedBy = lambda s: '(?=' + s + ')'NotFollowedBy = lambda s: '(?!' + s + ')'StopWhen = lambda s: sCaseInsensitive = lambda s: '(?i:' + s + ')'ST = '(?:st|ST)'ND = '(?:nd|ND)'RD = '(?:rd|RD)'TH = '(?:th|TH)'def OneOf( *args ):返回 '​​(?:' + '|'.join( args ) + ')'模式 = '(.+?)' + \一个(#ABC |!!!- 在空格或非字母数字边界处中断PreceededBy( AlphaNumeric ) + FollowedBy( NotAlphaNumeric ),PreceededBy( NotAlphaNumeric ) + FollowedBy( AlphaNumeric ),#ABC |Abc - 在看起来像单词或句子开头的地方打断其次是(上 + 下),# abc |ABC - 当小写字母后跟大写字母时中断PreceededBy(Lower)+FollowedBy(Upper),# abc |123 - 在单词和数字之间换行先行者(字母)+后继者(数字),# 第一 |橡木 - 识别字符串何时以序数开头PreceededBy( StartsWith( '1' + ST ) ),PreceededBy( StartsWith( '2' + ND ) ),PreceededBy( StartsWith( '3' + RD ) ),# 第一 |abc - 包含一个序数PreceededBy( IsNotA( '1' ) + '1' + ST),PreceededBy( IsNotA( '1' ) + '2' + ND ),PreceededBy( IsNotA( '1' ) + '3' + RD ),PreceededBy('1' + IsA('123') + TH),PreceededBy( IsA( '04-9' ) + TH ),# 1 |abcde - 识别何时以无序数字/字母边界开头或包含无序数字/字母边界PreceededBy(StartsWith('1'))+FollowedBy(Letter)+NotFollowedBy(ST),PreceededBy( StartsWith( '2' ) ) + FollowedBy( Letter ) + NotFollowedBy( ND ),PreceededBy( StartsWith( '3' ) ) + FollowedBy( Letter ) + NotFollowedBy( RD ),PreceededBy( IsNotA( '1' ) + '1' ) + FollowedBy( Letter ) + NotFollowedBy( ST),PreceededBy( IsNotA( '1' ) + '2' ) + FollowedBy( Letter ) + NotFollowedBy( ND ),PreceededBy( IsNotA( '1' ) + '3' ) + FollowedBy( Letter ) + NotFollowedBy( RD ),PreceededBy( '1' + IsA( '123' ) ) + FollowedBy( Letter ) + NotFollowedBy( TH ),PreceededBy( IsA( '04-9' ) ) + FollowedBy( Letter ) + NotFollowedBy( TH ),# abcde |$ - 字符串的结尾跟随者( EndOfString ))匹配器 = 重新编译(模式)def tokenize( s ):返回 matcher.findall( s )

解决方案

Emulate \G 在正则表达式的开头使用 re.RegexObject.match

您可以使用 re 模块在正则表达式的开头模拟 \G 的效果,方法是跟踪 re.RegexObject.match,强制匹配从 pos 中的指定位置开始.

def tokenize(w):指数 = 0m = matcher.match(w, index)o = []# 虽然 index != m.end() 检查零长度匹配,但更多的是# 防止意外的无限循环.# 不要指望一个可以匹配空字符串的正则表达式工作.# 见警告部分.而 m 和索引 != m.end():o.append(m.group(1))索引 = m.end()m = matcher.match(w, index)返回 o

警告

这个方法的一个警告是它不能很好地与匹配主匹配中的空字符串的正则表达式一起使用,因为 Python 没有任何设施来强制正则表达式重试匹配,同时防止零长度匹配.

例如,re.findall(r'(.??)', 'abc') 返回一个包含 4 个空字符串的数组 ['', '', '', ''],而在 PCRE 中,您可以找到 7 个匹配项 ['', 'a', '', 'b', '', 'c' ''] 其中第 2、第 4 和第 6 场比赛的开始索引分别与第 1、第 3 和第 5 场比赛的索引相同.PCRE 中的其他匹配项是通过使用防止空字符串匹配的标志在相同索引处重试来找到的.

我知道问题是关于 Perl,而不是 PCRE,但全局匹配行为应该是相同的.否则,原始代码将无法工作.

([^a-zA-Z0-9]*)([a-zA-Z0-9]*?)重写为(.+?),正如问题中所做的那样,避免了这个问题,尽管您可能想使用 re.S 标志.

关于正则表达式的其他评论

由于 Python 中不区分大小写的标志会影响整个模式,因此必须重写不区分大小写的子模式.我会将 (?i:st) 重写为 [sS][tT] 以保留原始含义,但使用 (?:st|ST) 如果它是您要求的一部分.

由于 Python 支持带有 re 的 自由间距模式.X 标志,您可以编写类似于您在 Perl 代码中所做的正则表达式:

matcher = re.compile(r'''(.+?)(?: # 识别令牌边界(?=[^a-zA-Z0-9]) # 下一个字符不是单词字符|(?=[A-Z][a-z]) # 后两个字符高低|(?<=[a-z])(?=[A-Z]) # 低后高|(?<=[a-zA-Z])(?=[0-9]) # 字母后跟数字#序数边界|(?<=^1[sS][tT]) # 第一个|(?<=[^1][1][sS][tT]) # 第一个但不是第 11 个|(?<=^2[nN][dD]) # 秒|(?<=[^1]2[nN][dD]) # 第二个但不是第 12 个|(?<=^3[rR][dD]) # 第三个|(?<=[^1]3[rR][dD]) # 第三个但不是第 13 个|(?<=1[123][tT][hH]) # 11 - 13|(?<=[04-9][tT][hH]) # 其他序数# 无序数字字母边界|(?<=^1)(?=[a-zA-Z])(?![sS][tT]) # 数字字母但不是第一个|(?<=[^1]1)(?=[a-zA-Z])(?![sS][tT]) # 数字字母但不是第 11 个|(?<=^2)(?=[a-zA-Z])(?![nN][dD]) # 数字字母但不是第一个|(?<=[^1]2)(?=[a-zA-Z])(?![nN][dD]) # 数字字母但不是第 12 个|(?<=^3)(?=[a-zA-Z])(?![rR][dD]) # 数字字母但不是第一个|(?<=[^1]3)(?=[a-zA-Z])(?![rR][dD]) # 数字字母但不是第 13 个|(?<=1[123])(?=[a-zA-Z])(?![tT][hH]) # 数字字母但不是第 11-13 个|(?<=[04-9])(?=[a-zA-Z])(?![tT][hH]) # 数字字母但不是序数|(?=$) # 字符串结束)''', re.X)

What follows is a perl function I wrote years ago. It is a smart tokenizer that recognizes some instances of things being stuck together that maybe shouldn't be. For example, given the input on the left, it divides the string as shown on the right:

abc123  -> abc|123
abcABC  -> abc|ABC
ABC123  -> ABC|123
123abc  -> 123|abc
123ABC  -> 123|ABC
AbcDef  -> Abc|Def    (e.g. CamelCase)
ABCDef  -> ABC|Def    
1stabc  -> 1st|abc    (recognize valid ordinals)
1ndabc  -> 1|ndabc    (but not invalid ordinals)
11thabc -> 11th|abc   (recognize that 11th - 13th are different than 1st - 3rd)
11stabc -> 11|stabc

I'm now doing some machine learning experimentation, and I'd like to do some experiments that use this tokenizer. But first, I will need to port it from Perl to Python. The key to this code is the loop that uses the \G anchor, something which I hear does not exist in python. I've tried googling for how this is done in Python, but I am not sure what exactly to search for, so I'm having trouble finding an answer.

How would you write this function in Python?

sub Tokenize
# Breaks a string into tokens using special rules,
# where a token is any sequence of characters, be they a sequence of letters, 
# a sequence of numbers, or a sequence of non-alpha-numeric characters
# the list of tokens found are returned to the caller
{
    my $value = shift;
    my @list = ();
    my $word;

    while ( $value ne '' && $value =~ m/
        \G                # start where previous left off
        ([^a-zA-Z0-9]*)   # capture non-alpha-numeric characters, if any
        ([a-zA-Z0-9]*?)   # capture everything up to a token boundary
        (?:               # identify the token boundary
            (?=[^a-zA-Z0-9])       # next character is not a word character 
        |   (?=[A-Z][a-z])         # Next two characters are upper lower
        |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
        |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
                # ordinal boundaries
        |   (?<=^1(?i:st))         # first
        |   (?<=[^1][1](?i:st))    # first but not 11th
        |   (?<=^2(?i:nd))         # second
        |   (?<=[^1]2(?i:nd))      # second but not 12th
        |   (?<=^3(?i:rd))         # third
        |   (?<=[^1]3(?i:rd))      # third but not 13th
        |   (?<=1[123](?i:th))     # 11th - 13th
        |   (?<=[04-9](?i:th))     # other ordinals
                # non-ordinal digit-letter boundaries
        |   (?<=^1)(?=[a-zA-Z])(?!(?i)st)       # digit-letter but not first
        |   (?<=[^1]1)(?=[a-zA-Z])(?!(?i)st)    # digit-letter but not 11th
        |   (?<=^2)(?=[a-zA-Z])(?!(?i)nd)       # digit-letter but not first
        |   (?<=[^1]2)(?=[a-zA-Z])(?!(?i)nd)    # digit-letter but not 12th
        |   (?<=^3)(?=[a-zA-Z])(?!(?i)rd)       # digit-letter but not first
        |   (?<=[^1]3)(?=[a-zA-Z])(?!(?i)rd)    # digit-letter but not 13th
        |   (?<=1[123])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not 11th - 13th
        |   (?<=[04-9])(?=[a-zA-Z])(?!(?i)th)   # digit-letter but not ordinal
        |   (?=$)                               # end of string
        )
    /xg )
    {
        push @list, $1 if $1 ne '';
        push @list, $2 if $2 ne '';
    }
    return @list;
}

I did try using re.split() with a variation on the above. However, split() refuses to split on a zero-width match (an ability that should be possible if one really knows what one is doing).

I did come up with a solution to this specific problem, but not to the general problem of "how do I use \G based parsing" - I have some sample code that does regexes within loops that are anchored using \G and then in the body it uses another match anchored at \G to see which way to proceed with the parse. So I'm still looking for an answer.

That said, here is my final working code for translating the above to Python:

import re

IsA                 = lambda s: '['  + s + ']'
IsNotA              = lambda s: '[^' + s + ']'

Upper               = IsA( 'A-Z' )
Lower               = IsA( 'a-z' )
Letter              = IsA( 'a-zA-Z' )
Digit               = IsA( '0-9' )
AlphaNumeric        = IsA( 'a-zA-Z0-9' )
NotAlphaNumeric     = IsNotA( 'a-zA-Z0-9' ) 

EndOfString         = '$'
OR                  = '|'

ZeroOrMore          = lambda s: s + '*'
ZeroOrMoreNonGreedy = lambda s: s + '*?'
OneOrMore           = lambda s: s + '+'
OneOrMoreNonGreedy  = lambda s: s + '+?'

StartsWith          = lambda s: '^' + s
Capture             = lambda s: '('    + s + ')'
PreceededBy         = lambda s: '(?<=' + s + ')'
FollowedBy          = lambda s: '(?='  + s + ')'
NotFollowedBy       = lambda s: '(?!'  + s + ')'
StopWhen            = lambda s: s
CaseInsensitive     = lambda s: '(?i:' + s + ')'

ST                  = '(?:st|ST)'
ND                  = '(?:nd|ND)'
RD                  = '(?:rd|RD)'
TH                  = '(?:th|TH)'

def OneOf( *args ):
  return '(?:' + '|'.join( args ) + ')'

pattern = '(.+?)' + \
  OneOf( 
    # ABC | !!! - break at whitespace or non-alpha-numeric boundary
    PreceededBy( AlphaNumeric ) + FollowedBy( NotAlphaNumeric ),
    PreceededBy( NotAlphaNumeric ) + FollowedBy( AlphaNumeric ),

    # ABC | Abc - break at what looks like the start of a word or sentence
    FollowedBy( Upper + Lower ),

    # abc | ABC - break when a lower-case letter is followed by an upper case
    PreceededBy( Lower )  + FollowedBy( Upper ),

    # abc | 123 - break between words and digits
    PreceededBy( Letter ) + FollowedBy( Digit ),

    # 1st | oak - recognize when the string starts with an ordinal
    PreceededBy( StartsWith( '1' + ST ) ),
    PreceededBy( StartsWith( '2' + ND ) ),
    PreceededBy( StartsWith( '3' + RD ) ),

    # 1st | abc - contains an ordinal
    PreceededBy( IsNotA( '1' ) + '1' + ST ),
    PreceededBy( IsNotA( '1' ) + '2' + ND ),
    PreceededBy( IsNotA( '1' ) + '3' + RD ),
    PreceededBy( '1' + IsA( '123' )  + TH ),
    PreceededBy( IsA( '04-9' )       + TH ),

    # 1 | abcde - recognize when it starts with or contains a non-ordinal digit/letter boundary
    PreceededBy( StartsWith( '1' ) ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
    PreceededBy( StartsWith( '2' ) ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
    PreceededBy( StartsWith( '3' ) ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
    PreceededBy( IsNotA( '1' ) + '1' ) + FollowedBy( Letter ) + NotFollowedBy( ST ),
    PreceededBy( IsNotA( '1' ) + '2' ) + FollowedBy( Letter ) + NotFollowedBy( ND ),
    PreceededBy( IsNotA( '1' ) + '3' ) + FollowedBy( Letter ) + NotFollowedBy( RD ),
    PreceededBy( '1' + IsA( '123' ) )  + FollowedBy( Letter ) + NotFollowedBy( TH ),
    PreceededBy( IsA( '04-9' ) )       + FollowedBy( Letter ) + NotFollowedBy( TH ),

    # abcde | $ - end of the string
    FollowedBy( EndOfString )
  )

matcher = re.compile( pattern )

def tokenize( s ):
  return matcher.findall( s )

解决方案

Emulate \G at beginning of a regex with re.RegexObject.match

You can emulate the effect of \G at the beginning of a regex with re module by keeping track of and providing the starting position to re.RegexObject.match, which forces the match to begin at the specified position in pos.

def tokenize(w):
    index = 0
    m = matcher.match(w, index)
    o = []
    # Although index != m.end() check zero-length match, it's more of
    # a guard against accidental infinite loop.
    # Don't expect a regex which can match empty string to work.
    # See Caveat section.
    while m and index != m.end():
        o.append(m.group(1))
        index = m.end()
        m = matcher.match(w, index)
    return o

Caveat

A caveat to this method is that it doesn't play well with regex which matches empty string in the main match, since Python doesn't have any facility to force the regex to retry the match while preventing zero-length match.

As an example, re.findall(r'(.??)', 'abc') returns an array of 4 empty strings ['', '', '', ''], whereas in PCRE, you can find 7 matches ['', 'a', '', 'b', '', 'c' ''] where the 2nd, 4th, and 6th matches start at the same indices as the 1st, 3rd and 5th matches respectively. The additional matches in PCRE are found by retrying at the same indices with a flag which prevents empty string match.

I know the question is about Perl, not PCRE, but the global matching behavior should be the same. Otherwise, the original code couldn't have worked.

Rewriting ([^a-zA-Z0-9]*)([a-zA-Z0-9]*?) to (.+?), as done in the question, avoids this issue, though you might want to use re.S flag.

Other comments on the regex

Since case-insensitive flag in Python affects the whole pattern, the case insensitive sub-patterns have to be rewritten. I would rewrite (?i:st) as [sS][tT] to preserve the original meaning, but go with (?:st|ST) if it's part of your requirement.

Since Python supports the free-spacing mode with re.X flag, you can write your regex similar to what you did in Perl code:

matcher = re.compile(r'''
    (.+?)
    (?:               # identify the token boundary
        (?=[^a-zA-Z0-9])       # next character is not a word character 
    |   (?=[A-Z][a-z])         # Next two characters are upper lower
    |   (?<=[a-z])(?=[A-Z])    # lower followed by upper
    |   (?<=[a-zA-Z])(?=[0-9]) # letter followed by digit
            # ordinal boundaries
    |   (?<=^1[sS][tT])         # first
    |   (?<=[^1][1][sS][tT])    # first but not 11th
    |   (?<=^2[nN][dD])         # second
    |   (?<=[^1]2[nN][dD])      # second but not 12th
    |   (?<=^3[rR][dD])         # third
    |   (?<=[^1]3[rR][dD])      # third but not 13th
    |   (?<=1[123][tT][hH])     # 11th - 13th
    |   (?<=[04-9][tT][hH])     # other ordinals
            # non-ordinal digit-letter boundaries
    |   (?<=^1)(?=[a-zA-Z])(?![sS][tT])       # digit-letter but not first
    |   (?<=[^1]1)(?=[a-zA-Z])(?![sS][tT])    # digit-letter but not 11th
    |   (?<=^2)(?=[a-zA-Z])(?![nN][dD])       # digit-letter but not first
    |   (?<=[^1]2)(?=[a-zA-Z])(?![nN][dD])    # digit-letter but not 12th
    |   (?<=^3)(?=[a-zA-Z])(?![rR][dD])       # digit-letter but not first
    |   (?<=[^1]3)(?=[a-zA-Z])(?![rR][dD])    # digit-letter but not 13th
    |   (?<=1[123])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not 11th - 13th
    |   (?<=[04-9])(?=[a-zA-Z])(?![tT][hH])   # digit-letter but not ordinal
    |   (?=$)                               # end of string
    )
''', re.X)

这篇关于执行 \G 锚定解析循环的 Python 方式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆