python:正则表达式匹配字符串的数字范围 [英] python: number range to regex matching string

查看:130
本文介绍了python:正则表达式匹配字符串的数字范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能是一个已经解决的问题,但我想不通.我有两个较大的整数,我们称它们为 start_numberend_number(它们代表一个连续的电话号码块).其他数字(表示为字符串)将输入到我的系统中,我需要使用正则表达式将其与范围正则表达式"进行匹配,以查看数字字符串是否落在 start_number 之间或之间>end_number.

This may be a problem that has already been solved, but I can't figure it out. I have two largish integers, lets call them start_number and end_number (they represent a contiguous block of telephone numbers). Other numbers (represented as strings) will be input into my system and I need to use regex to match this against the 'range regex' to see if the number string falls either on or between start_number and end_number.

例如:

  • start_number = 99519000
  • end_number = 99519099

因此

  • expression = "^995190[0-9][0-9]$"

以便我最终可以匹配以下示例数字(一次一个到达我的系统,并且可以随时到达):

so that I can eventually match the following example numbers (which arrive at my system one at a time, and could arrive at any time):

  • "99519000" <- MATCH
  • "99519055" <- MATCH
  • "99519099" <- MATCH
  • "99519100" <- 不匹配
  • "99512210" <- 不匹配
  • "41234123" <- 不匹配
  • "99519000" <- MATCH
  • "99519055" <- MATCH
  • "99519099" <- MATCH
  • "99519100" <- NOT MATCH
  • "99512210" <- NOT MATCH
  • "41234123" <- NOT MATCH

在给定任何合理的start_numberend_number 的情况下,我如何使用python 创建正则表达式字符串模式expression">?我有几个开始/结束编号块",我必须为其创建正则表达式模式,我只需要一种以编程方式制作这些模式的方法.

How can I use python to create the regex string pattern "expression" given any reasonable start_number and end_number? I have several start/end number 'blocks' I have to create regex patterns for, I just need a way to make these patterns programatically.

可以公平地假设:

  • Start_number 总是小于 end_number
  • 始终为正整数.
  • 就我而言,start_numberend_number 将始终具有相同的长度"(即,当表示为字符串),如果它能让生活更轻松.
  • Start_number will always be less than end_number
  • Always will be a positive integer.
  • In my case, start_number and end_number will always be the same 'length' (i.e. always will have the same number of base 10 'characters' when represented as a string), if it'll make life easier.

为清晰起见

推荐答案

[假设你需要这个,因为它是一些需要正则表达式的奇怪的 3rd 方系统]

[assuming you need this because it's some weird 3rd party system that requires regexp]

新方法

我越想弗雷德里克的评论,我就越同意.regexp 引擎应该能够将其编译为紧凑的 DFA,即使输入字符串很长.在许多情况下,以下是一个明智的解决方案:

the more i think about Frederik's comment, the more i agree. the regexp engine should be able to compile this down to a compact DFA, even if the input string is long. for many cases, the following is a sensible solution:

import re

def regexp(lo, hi):
    fmt = '%%0%dd' % len(str(hi))
    return re.compile('(%s)' % '|'.join(fmt % i for i in range(lo, hi+1)))

(它适用于以下测试中的所有数字范围,包括 99519000 - 99519099.粗略的粗略计算表明 9 位数字大约是 1GB 内存的限制.如果大多数数字那个尺寸是匹配的;如果只有几个匹配,你可以更大.).

(it works fine with all the numerical ranges in the tests below, including 99519000 - 99519099. a rough back-of-the-envelope calculation suggests that 9 digit numbers are about the limit with 1GB of memory. that's if most numbers that size are matched; if only a few are matched you can go much larger.).

旧方法

[再次更新以提供更短的结果 - 除了合并偶尔的 \d\d 之外,它与手工生成的效果一样好]

[updated again to give even shorter results - apart from coalescing the occasional \d\d it's about as good as hand-generated]

假设所有数字的长度相同(即必要时在左侧填充零),这是有效的:

assuming all numbers are the same length (ie you zero pad on the left if necessary), this works:

import re

def alt(*args):
    '''format regexp alternatives'''
    if len(args) == 1: return args[0]
    else: return '(%s)' % '|'.join(args)

def replace(s, c): 
     '''replace all characters in a string with a different character'''
    return ''.join(map(lambda x: c, s))

def repeat(s, n):
    '''format a regexp repeat'''
    if n == 0: return ''
    elif n == 1: return s
    else: return '%s{%d}' % (s, n)

def digits(lo, hi): 
    '''format a regexp digit range'''
    if lo == 0 and hi == 9: return r'\d'
    elif lo == hi: return str(lo)
    else: return '[%d-%d]' % (lo, hi)

def trace(f):
    '''for debugging'''
    def wrapped(lo, hi):
        result = f(lo, hi)
        print(lo, hi, result)
        return result
    return wrapped

#@trace  # uncomment to get calls traced to stdout (explains recursion when bug hunting)
def regexp(lo, hi):
    '''generate a regexp that matches integers from lo to hi only.
       assumes that inputs are zero-padded to the length of hi (like phone numbers).
       you probably want to surround with ^ and $ before using.'''

    assert lo <= hi
    assert lo >= 0

    slo, shi = str(lo), str(hi)
    # zero-pad to same length
    while len(slo) < len(shi): slo = '0' + slo
    # first digits and length
    l, h, n = int(slo[0]), int(shi[0]), len(slo)

    if l == h:
        # extract common prefix
        common = ''
        while slo and slo[0] == shi[0]:
            common += slo[0]
            slo, shi = slo[1:], shi[1:]
        if slo: return common + regexp(int(slo), int(shi))
        else: return common

    else:
        # the core of the routine.
        # split into 'complete blocks' like 200-599 and 'edge cases' like 123-199
        # and handle each separately.

        # are these complete blocks?
        xlo = slo[1:] == replace(slo[1:], '0')
        xhi = shi[1:] == replace(shi[1:], '9')

        # edges of possible complete blocks
        mlo = int(slo[0] + replace(slo[1:], '9'))
        mhi = int(shi[0] + replace(shi[1:], '0'))

        if xlo:
            if xhi:
                # complete block on both sides
                # this is where single digits are finally handled, too.
                return digits(l, h) + repeat('\d', n-1)
            else:
                # complete block to mhi, plus extra on hi side
                prefix = '' if l or h-1 else '0'
                return alt(prefix + regexp(lo, mhi-1), regexp(mhi, hi))
        else:
            prefix = '' if l else '0'
            if xhi:
                # complete block on hi side plus extra on lo
                return alt(prefix + regexp(lo, mlo), regexp(mlo+1, hi))
            else:
                # neither side complete, so add extra on both sides
                # (and maybe a complete block in the middle, if room)
                if mlo + 1 == mhi:
                    return alt(prefix + regexp(lo, mlo), regexp(mhi, hi))
                else:
                    return alt(prefix + regexp(lo, mlo), regexp(mlo+1, mhi-1), regexp(mhi, hi))


# test a bunch of different ranges
for (lo, hi) in [(0, 0), (0, 1), (0, 2), (0, 9), (0, 10), (0, 11), (0, 101),
                 (1, 1), (1, 2), (1, 9), (1, 10), (1, 11), (1, 101),
                 (0, 123), (111, 123), (123, 222), (123, 333), (123, 444),
                 (0, 321), (111, 321), (222, 321), (321, 333), (321, 444),
                 (123, 321), (111, 121), (121, 222), (1234, 4321), (0, 999),
                 (99519000, 99519099)]:
    fmt = '%%0%dd' % len(str(hi))
    rx = regexp(lo, hi)
    print('%4s - %-4s  %s' % (fmt % lo, fmt % hi, rx))
    m = re.compile('^%s$' % rx)
    for i in range(0, 1+int(replace(str(hi), '9'))):
        if m.match(fmt % i):
            assert lo <= i <= hi, i
        else:
            assert i < lo or i > hi, i

函数 regexp(lo, hi) 构建一个匹配 lohi 之间的值的正则表达式(零填充到最大长度).您可能需要在之前放置一个 ^ 和一个 $ 之后(如在测试代码中)以强制匹配为整个字符串.

the function regexp(lo, hi) builds a regexp that matches values between lo and hi (zero padded to the maximum length). you probably need to put a ^ before and a $ after (as in the test code) to force the match to be the whole string.

该算法实际上非常简单——它递归地将事物划分为公共前缀和完整块".一个完整的块类似于 200-599 并且可以可靠地匹配(在这种情况下使用 [2-5]\d{2}).

the algorithm is actually quite simple - it recursively divides things into common prefixes and "complete blocks". a complete block is something like 200-599 and can be matched reliably (in this case with [2-5]\d{2}).

所以 123-599 被分成 123-199 和 200-599.后半部分是一个完整的块,前半部分有1和23-99的公共前缀,递归处理为23-29(公共前缀)和30-99(完整块)(我们最终终止,因为参数每次调用都比初始输入短).

so 123-599 is split into 123-199 and 200-599. the last half is a complete block, the first half has a common prefix of 1 and 23-99, which is recursively handled as 23-29 (common prefix) and 30-99 (complete block) (and we terminate eventually, because arguments to each call are shorter than the initial input).

唯一令人讨厌的细节是 prefix,这是必需的,因为 regexp() 的参数是整数,所以当被调用以生成 00 的正则表达式时-09 它实际上生成了 0-9 的正则表达式,没有前导 0.

the only nasty detail is the prefix, which is needed because the arguments to regexp() are integers, so when called to generate, say, the regexp for 00-09 it actually generates the regexp for 0-9, without the leading 0.

输出是一堆测试用例,显示范围和正则表达式:

the output is a bunch of test cases, showing the range and regexp:

   0 - 0     0
   0 - 1     [0-1]
   0 - 2     [0-2]
   0 - 9     \d
  00 - 10    (0\d|10)
  00 - 11    (0\d|1[0-1])
 000 - 101   (0\d\d|10[0-1])
   1 - 1     1
   1 - 2     [1-2]
   1 - 9     [1-9]
  01 - 10    (0[1-9]|10)
  01 - 11    (0[1-9]|1[0-1])
 001 - 101   (0(0[1-9]|[1-9]\d)|10[0-1])
 000 - 123   (0\d\d|1([0-1]\d|2[0-3]))
 111 - 123   1(1[1-9]|2[0-3])
 123 - 222   (1(2[3-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
 123 - 333   (1(2[3-9]|[3-9]\d)|2\d\d|3([0-2]\d|3[0-3]))
 123 - 444   (1(2[3-9]|[3-9]\d)|[2-3]\d{2}|4([0-3]\d|4[0-4]))
 000 - 321   ([0-2]\d{2}|3([0-1]\d|2[0-1]))
 111 - 321   (1(1[1-9]|[2-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
 222 - 321   (2(2[2-9]|[3-9]\d)|3([0-1]\d|2[0-1]))
 321 - 333   3(2[1-9]|3[0-3])
 321 - 444   (3(2[1-9]|[3-9]\d)|4([0-3]\d|4[0-4]))
 123 - 321   (1(2[3-9]|[3-9]\d)|2\d\d|3([0-1]\d|2[0-1]))
 111 - 121   1(1[1-9]|2[0-1])
 121 - 222   (1(2[1-9]|[3-9]\d)|2([0-1]\d|2[0-2]))
1234 - 4321  (1(2(3[4-9]|[4-9]\d)|[3-9]\d{2})|[2-3]\d{3}|4([0-2]\d{2}|3([0-1]\d|2[0-1])))
 000 - 999   \d\d{2}
99519000 - 99519099  995190\d\d

因为最后一个测试循环超过 99999999 个数字,所以运行需要一段时间.

it takes a while to run as the last test loops over 99999999 numbers.

表达式应该足够紧凑以避免任何缓冲区限制(我猜最坏情况下内存大小与最大数字位数的平方成正比).

the expressions should be compact enough to avoid any buffer limits (i would guess that memory size worst case is proportional to the square of the number of digits in the largest number).

ps 我使用的是 python 3,但我认为它在这里没有太大区别.

ps i am using python 3, but i don't think it makes much difference here.

这篇关于python:正则表达式匹配字符串的数字范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆