为什么未编译,反复使用的正则表达式在Python 3中这么慢? [英] Why are uncompiled, repeatedly used regexes so much slower in Python 3?

查看:98
本文介绍了为什么未编译,反复使用的正则表达式在Python 3中这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

回答此问题时(并已阅读对于类似问题的答案),我以为我知道Python如何缓存正则表达式。

When answering this question (and having read this answer to a similar question), I thought that I knew how Python caches regexes.

但是后来我认为我应该进行测试比较两个场景:

But then I thought I'd test it, comparing two scenarios:


  1. 单个正则表达式的单个编译,然后是该正则表达式的10个应用程序。

  2. 未编译的正则表达式的10个应用程序(我原本希望性能会稍差一些,因为该正则表达式必须先编译一次,然后缓存,然后在缓存中查找9次)。

但是,结果令人震惊(在Python 3.3中):

However, the results were staggering (in Python 3.3):

>>> import timeit
>>> timeit.timeit(setup="import re", 
... stmt='r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")')
18.547793477671938
>>> timeit.timeit(setup="import re", 
... stmt='for i in range(10):\n re.search(r"\w+","  jkdhf  ")')
106.47892003890324

这慢了5.7倍!在Python 2.7中,仍然增加了2.5倍,这也超出了我的预期。

That's over 5.7 times slower! In Python 2.7, there is still an increase by a factor of 2.5, which is also more than I would have expected.

在Python 2和3之间更改了正则表达式的缓存? 文档似乎不建议这样做。

Has caching of regexes changed between Python 2 and 3? The docs don't seem to suggest that.

推荐答案

代码更改。

在Python 2.7中,缓存是一个简单的字典。如果其中存储了 _MAXCACHE 个项目,则在存储新项目之前会清除整个缓存。缓存查找仅需要构建一个简单的键并测试字典,请参见 _compile()

In Python 2.7, the cache is a simple dictionary; if more than _MAXCACHE items are stored in it, the whole the cache is cleared before storing a new item. A cache lookup only takes building a simple key and testing the dictionary, see the 2.7 implementation of _compile()

的实现在Python 3.x中,缓存已由 @ functools.lru_cache( maxsize = 500,类型为True)装饰器。这个装饰器完成了更多的工作,包括线程锁定,调整缓存LRU队列和维护缓存统计信息(可通过 re._compile.cache_info())。请参见 _compile() functools.lru_cache()

In Python 3.x, the cache has been replaced by the @functools.lru_cache(maxsize=500, typed=True) decorator. This decorator does much more work and includes a thread-lock, adjusting the cache LRU queue and maintaining the cache statistics (accessible via re._compile.cache_info()). See the 3.3.0 implementation of _compile() and of functools.lru_cache().

其他人也注意到了同样的放缓,并在Python Bugtracker中提交了 issue 16389 。我希望3.4会再快得多; lru_cache 实现已得到改进,或者 re 模块将再次移至自定义缓存。

Others have noticed the same slowdown, and filed issue 16389 in the Python bugtracker. I'd expect 3.4 to be a lot faster again; either the lru_cache implementation is improved or the re module will move to a custom cache again.

更新:使用修订版4b4dddd670d0(hg)/ 0f606a6(git) 缓存更改已还原为3.1中的简单版本。从那时起,Python版本3.2.4和3.3.1包括了该修订版。

Update: With revision 4b4dddd670d0 (hg) / 0f606a6 (git) the cache change has been reverted back to the simple version found in 3.1. Python versions 3.2.4 and 3.3.1 include that revision.

此后,在Python 3.7中,模式缓存已更新为基于常规 dict的自定义FIFO缓存 code>(依赖于插入顺序,并且与LRU不同,它不考虑驱逐时最近使用缓存中的项目的时间)。

Since then, in Python 3.7 the pattern cache was updated to a custom FIFO cache implementation based on a regular dict (relying on insertion order, and unlike a LRU, does not take into account how recently items already in the cache were used when evicting).

这篇关于为什么未编译,反复使用的正则表达式在Python 3中这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆