在 100 多个不同的正则表达式上循环时,Python re 模块会变慢 20 倍 [英] Python re module becomes 20 times slower when looping on more than 100 different regex

查看:97
本文介绍了在 100 多个不同的正则表达式上循环时,Python re 模块会变慢 20 倍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是关于解析日志文件并删除每一行的可变部分以将它们分组.例如:

s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)s = re.sub(r'(?i)消息被拒绝,因为:(.*?) \(.+\)', r'消息被拒绝,因为:\1 (...)', s)

我有大约 120 多个与上述类似的匹配规则.

我在连续搜索 100 个不同的正则表达式时没有发现任何性能问题.但是在应用 101 个正则表达式时会出现巨大的减速.

替换我的规则时会发生完全相同的行为

 for a in range(100):s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)

改用 range(101) 时速度慢了 20 倍.

# range(100)% ./dashlog.py 文件.bz2== 耗时 2.1 秒.==# 范围(101)% ./dashlog.py 文件.bz2== 耗时 47.6 秒.==

为什么会发生这样的事情?是否有任何已知的解决方法?

(在 Linux/Windows 上的 Python 2.6.6/2.7.2 上发生.)

解决方案

Python 为已编译的正则表达式保留内部缓存.每当您使用采用正则表达式的顶级函数之一时,Python 首先编译该表达式,然后缓存该编译的结果.

猜猜缓存可以容纳多少项?

<预><代码>>>>进口重新>>>re._MAXCACHE100

当您超过缓存大小时,Python 2 清除所有缓存的表达式并从一个干净的缓存开始.Python 3 将限制增加到 512,但仍然完全清除.

解决方法是让您自己缓存编译:

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')compiled_expression.sub(r"User .. is ", s)

您可以使用 functools.partial()sub() 调用与替换表达式捆绑在一起:

from functools import partialcompiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')ready_to_use_sub = partial(compiled_expression.sub, r"User .. is ")

然后使用 ready_to_use_sub(s) 将编译后的正则表达式模式与特定的替换模式一起使用.

My problem is about parsing log files and removing variable parts on each line in order to group them. For instance:

s = re.sub(r'(?i)User [_0-9A-z]+ is ', r"User .. is ", s)
s = re.sub(r'(?i)Message rejected because : (.*?) \(.+\)', r'Message rejected because : \1 (...)', s)

I have about 120+ matching rules like the above.

I have found no performance issues while searching successively on 100 different regexes. But a huge slow down occurs when applying 101 regexes.

The exact same behavior happens when replacing my rules with

for a in range(100):
    s = re.sub(r'(?i)caught here'+str(a)+':.+', r'( ... )', s)

It got 20 times slower when using range(101) instead.

# range(100)
% ./dashlog.py file.bz2
== Took  2.1 seconds.  ==

# range(101)
% ./dashlog.py file.bz2
== Took  47.6 seconds.  ==

Why is such a thing happening? And is there any known workaround ?

(Happens on Python 2.6.6/2.7.2 on Linux/Windows.)

解决方案

Python keeps an internal cache for compiled regular expressions. Whenever you use one of the top-level functions that takes a regular expression, Python first compiles that expression, and the result of that compilation is cached.

Guess how many items the cache can hold?

>>> import re
>>> re._MAXCACHE
100

The moment you exceed the cache size, Python 2 clears all cached expressions and starts with a clean cache. Python 3 increased the limit to 512 but still does a full clear.

The work-around is for you to cache the compilation yourself:

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')

compiled_expression.sub(r"User .. is ", s)

You could use functools.partial() to bundle the sub() call together with the replacement expression:

from functools import partial

compiled_expression = re.compile(r'(?i)User [_0-9A-z]+ is ')
ready_to_use_sub = partial(compiled_expression.sub, r"User .. is ")

then later on use ready_to_use_sub(s) to use the compiled regular expression pattern together with a specific replacement pattern.

这篇关于在 100 多个不同的正则表达式上循环时,Python re 模块会变慢 20 倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆