为什么未编译的、重复使用的正则表达式在 Python 3 中如此之慢? [英] Why are uncompiled, repeatedly used regexes so much slower in Python 3?

查看:14
本文介绍了为什么未编译的、重复使用的正则表达式在 Python 3 中如此之慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在回答这个问题时(并阅读了this answer to a similar question),我以为我知道Python如何缓存正则表达式.

但后来我想我会测试它,比较两种情况:

  1. 简单正则表达式的一次编译,然后是该编译正则表达式的 10 个应用程序.
  2. 未编译正则表达式的 10 个应用程序(我预计性能会稍差一些,因为正则表达式必须编译一次,然后缓存,然后在缓存中查找 9 次).

然而,结果令人震惊(在 Python 3.3 中):

<预><代码>>>>导入时间>>>timeit.timeit(setup="import re",... stmt='r=re.compile(r"w+") for i in range(10): r.search(" jkdhf ")')18.547793477671938>>>timeit.timeit(setup="import re",... stmt='for i in range(10): re.search(r"w+"," jkdhf ")')106.47892003890324

那慢了 5.7 倍以上!在 Python 2.7 中,仍然增加了 2.5 倍,这也超出了我的预期.

正则表达式的缓存在 Python 2 和 3 之间有变化吗?文档 似乎没有暗示.>

解决方案

代码已经改变了.

在 Python 2.7 中,缓存是一个简单的字典;如果其中存储了多个 _MAXCACHE 项,则在存储新项之前清除整个缓存.缓存查找只需要构建一个简单的键并测试字典,参见 2.7 _compile()

的实现

在 Python 3.x 中,缓存已被替换为 @functools.lru_cache(maxsize=500, typed=True) 装饰器.这个装饰器做了更多的工作,包括一个线程锁、调整缓存 LRU 队列和维护缓存统计信息(可通过 re._compile.cache_info() 访问).请参阅 3.3.0 实现()代码>functools.lru_cache().

其他人也注意到了同样的放缓,并在 Python 错误跟踪器中提交了issue 16389.我希望 3.4 再次快得多;要么改进 lru_cache 实现,要么 re 模块将再次移动到自定义缓存.

更新:使用 修订版 4b4dddd670d0 (hg)/0f6a> (0f6a>) 缓存

更改已恢复到 3.1 中的简单版本.Python 版本 3.2.4 和 3.3.1 包含该修订版.

从那时起,在 Python 3.7 中,模式缓存更新为 自定义 FIFO 缓存实现基于常规 dict(依赖于插入顺序,与 LRU 不同,不考虑缓存中已有项目的最近使用时间驱逐).

When answering this question (and having read this answer to a similar question), I thought that I knew how Python caches regexes.

But then I thought I'd test it, comparing two scenarios:

  1. a single compilation of a simple regex, then 10 applications of that compiled regex.
  2. 10 applications of an uncompiled regex (where I would have expected slightly worse performance because the regex would have to be compiled once, then cached, and then looked up in the cache 9 times).

However, the results were staggering (in Python 3.3):

>>> import timeit
>>> timeit.timeit(setup="import re", 
... stmt='r=re.compile(r"w+")
for i in range(10):
 r.search("  jkdhf  ")')
18.547793477671938
>>> timeit.timeit(setup="import re", 
... stmt='for i in range(10):
 re.search(r"w+","  jkdhf  ")')
106.47892003890324

That's over 5.7 times slower! In Python 2.7, there is still an increase by a factor of 2.5, which is also more than I would have expected.

Has caching of regexes changed between Python 2 and 3? The docs don't seem to suggest that.

解决方案

The code has changed.

In Python 2.7, the cache is a simple dictionary; if more than _MAXCACHE items are stored in it, the whole the cache is cleared before storing a new item. A cache lookup only takes building a simple key and testing the dictionary, see the 2.7 implementation of _compile()

In Python 3.x, the cache has been replaced by the @functools.lru_cache(maxsize=500, typed=True) decorator. This decorator does much more work and includes a thread-lock, adjusting the cache LRU queue and maintaining the cache statistics (accessible via re._compile.cache_info()). See the 3.3.0 implementation of _compile() and of functools.lru_cache().

Others have noticed the same slowdown, and filed issue 16389 in the Python bugtracker. I'd expect 3.4 to be a lot faster again; either the lru_cache implementation is improved or the re module will move to a custom cache again.

Update: With revision 4b4dddd670d0 (hg) / 0f606a6 (git) the cache change has been reverted back to the simple version found in 3.1. Python versions 3.2.4 and 3.3.1 include that revision.

Since then, in Python 3.7 the pattern cache was updated to a custom FIFO cache implementation based on a regular dict (relying on insertion order, and unlike a LRU, does not take into account how recently items already in the cache were used when evicting).

这篇关于为什么未编译的、重复使用的正则表达式在 Python 3 中如此之慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆