有没有一种方法可以真正在python中腌制已编译的正则表达式? [英] Is there a way to really pickle compiled regular expressions in python?

查看:78
本文介绍了有没有一种方法可以真正在python中腌制已编译的正则表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含300多个正则表达式的python控制台应用程序.对于每个发行版,正则表达式集都是固定的.当用户运行该应用程序时,整个正则表达式集将被应用于一次(非常短的工作)到数千次(长的工作)的任何地方.

I have a python console application that contains 300+ regular expressions. The set of regular expressions is fixed for each release. When users run the app, the entire set of regular expressions will be applied anywhere from once (a very short job) to thousands of times (a long job).

我想通过预先编译正则表达式,将编译后的正则表达式腌制到文件中,然后在运行应用程序时加载该文件,来加快较短的作业的速度.

I would like to speed up the shorter jobs by compiling the regular expressions up front, pickle the compiled regular expressions to a file, and then load that file when the application is run.

python re模块非常有效,而正则表达式的编译开销对于长时间的工作来说是可以接受的.但是,对于短期工作,它占整个运行时间的很大一部分.一些用户将需要运行许多小型作业以适合他们现有的工作流程.编译正则表达式大约需要80毫秒.简短的工作可能需要20毫秒至100毫秒(不包括正则表达式编译).因此,对于短期工作,开销可能会达到100%或更多. Windows和Linux都使用Python27.

The python re module is efficient and the regex compilation overhead is quite acceptable for long jobs. For short jobs, however, it is a large proportion of the overall run-time. Some users will want to run many small jobs to fit into their existing workflows. Compiling the regular expressions takes about 80ms. A short job might take 20ms-100ms excluding regular expression compilation. So for short jobs, the overhead can be 100% or more. This is with Python27 under both Windows and Linux.

正则表达式必须与DOTALL标志一起应用,因此需要在使用前进行编译.大型编译缓存显然在这种情况下无济于事.正如某些人指出的那样,序列化已编译正则表达式的默认方法实际上并没有太大作用.

The regular expressions must be applied with the DOTALL flag, so need to be compiled prior to use. A large compilation cache clearly doesn't help in this instances. As some have pointed out, the default method to serialise the compiled regular expression doesn't actually do much.

re和sre模块使用其自己的操作码和一些辅助数据结构(例如,用于表达式中的字符集)将模式编译成一种小的自定义语言. re.py中的泡菜功能可以轻松解决.它是:

The re and sre modules compile the patterns into a little custom language with its own opcodes and some auxiliary data structures (e.g., for charsets used in an expression). The pickle function in re.py takes the easy way out. It is:

def _pickle(p):
    return _compile, (p.pattern, p.flags)

copy_reg.pickle(_pattern_type, _pickle, _compile)

我认为解决该问题的一个好方法是更新re.py中的_pickle定义,该定义实际上使已编译的模式对象腌制了.不幸的是,这超出了我的python技能.不过,我敢打赌,这里有人知道怎么做.

I think that a good solution to the problem would be an update to the definition of _pickle in re.py that actually pickled the compiled pattern object. Unfortunately, this goes beyond my python skills. I bet, however, that someone here knows how to do it.

我意识到我不是第一个提出这个问题的人-但也许您可以成为第一个对此问题做出准确而有用的答复的人!

I realise that I am not the first person to ask this question - but perhaps you can be the first person to give an accurate and useful response to it!

您的建议将不胜感激.

推荐答案

好的,这并不漂亮,但这可能正是您想要的.我查看了Python 2.6中的sre_compile.py模块,将其撕下一部分,将其切成两半,然后使用这两部分来腌制和解开已编译的正则表达式:

OK, this isn't pretty, but it might be what you want. I looked at the sre_compile.py module from Python 2.6, and ripped out a bit of it, chopped it in half, and used the two pieces to pickle and unpickle compiled regexes:

import re, sre_compile, sre_parse, _sre
import cPickle as pickle

# the first half of sre_compile.compile    
def raw_compile(p, flags=0):
    # internal: convert pattern list to internal format

    if sre_compile.isstring(p):
        pattern = p
        p = sre_parse.parse(p, flags)
    else:
        pattern = None

    code = sre_compile._code(p, flags)

    return p, code

# the second half of sre_compile.compile
def build_compiled(pattern, p, flags, code):
    # print code

    # XXX: <fl> get rid of this limitation!
    if p.pattern.groups > 100:
        raise AssertionError(
            "sorry, but this version only supports 100 named groups"
            )

    # map in either direction
    groupindex = p.pattern.groupdict
    indexgroup = [None] * p.pattern.groups
    for k, i in groupindex.items():
        indexgroup[i] = k

    return _sre.compile(
        pattern, flags | p.pattern.flags, code,
        p.pattern.groups-1,
        groupindex, indexgroup
        )

def pickle_regexes(regexes):
    picklable = []
    for r in regexes:
        p, code = raw_compile(r, re.DOTALL)
        picklable.append((r, p, code))
    return pickle.dumps(picklable)

def unpickle_regexes(pkl):
    regexes = []
    for r, p, code in pickle.loads(pkl):
        regexes.append(build_compiled(r, p, re.DOTALL, code))
    return regexes

regexes = [
    r"^$",
    r"a*b+c*d+e*f+",
    ]

pkl = pickle_regexes(regexes)
print pkl
print unpickle_regexes(pkl)

我真的不知道这是否有效,或者是否可以加快速度.我知道当我尝试它时会打印一个正则表达式列表.它可能对2.6版非常特定,我也不知道.

I don't really know if this works, or if it speeds things up. I know it prints a list of regexes when I try it. It might be very specific to version 2.6, I also don't know that.

这篇关于有没有一种方法可以真正在python中腌制已编译的正则表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆