为什么未编译,反复使用的正则表达式在Python 3中这么慢? [英] Why are uncompiled, repeatedly used regexes so much slower in Python 3?
问题描述
回答此问题时(并已阅读对于类似问题的答案),我以为我知道Python如何缓存正则表达式。
When answering this question (and having read this answer to a similar question), I thought that I knew how Python caches regexes.
但是后来我认为我应该进行测试比较两个场景:
But then I thought I'd test it, comparing two scenarios:
- 单个正则表达式的单个编译,然后是该正则表达式的10个应用程序。
- 未编译的正则表达式的10个应用程序(我原本希望性能会稍差一些,因为该正则表达式必须先编译一次,然后缓存,然后在缓存中查找9次)。
但是,结果令人震惊(在Python 3.3中):
However, the results were staggering (in Python 3.3):
>>> import timeit
>>> timeit.timeit(setup="import re",
... stmt='r=re.compile(r"\w+")\nfor i in range(10):\n r.search(" jkdhf ")')
18.547793477671938
>>> timeit.timeit(setup="import re",
... stmt='for i in range(10):\n re.search(r"\w+"," jkdhf ")')
106.47892003890324
这慢了5.7倍!在Python 2.7中,仍然增加了2.5倍,这也超出了我的预期。
That's over 5.7 times slower! In Python 2.7, there is still an increase by a factor of 2.5, which is also more than I would have expected.
在Python 2和3之间更改了正则表达式的缓存? 文档似乎不建议这样做。
Has caching of regexes changed between Python 2 and 3? The docs don't seem to suggest that.
推荐答案
代码已更改。
在Python 2.7中,缓存是一个简单的字典。如果其中存储了 _MAXCACHE
个项目,则在存储新项目之前会清除整个缓存。缓存查找仅需要构建一个简单的键并测试字典,请参见 _compile()
In Python 2.7, the cache is a simple dictionary; if more than _MAXCACHE
items are stored in it, the whole the cache is cleared before storing a new item. A cache lookup only takes building a simple key and testing the dictionary, see the 2.7 implementation of _compile()
的实现在Python 3.x中,缓存已由 @ functools.lru_cache( maxsize = 500,类型为True)
装饰器。这个装饰器完成了更多的工作,包括线程锁定,调整缓存LRU队列和维护缓存统计信息(可通过 re._compile.cache_info()$ c访问)。 $ c>)。请参见
_compile()
和 functools.lru_cache()
。
In Python 3.x, the cache has been replaced by the @functools.lru_cache(maxsize=500, typed=True)
decorator. This decorator does much more work and includes a thread-lock, adjusting the cache LRU queue and maintaining the cache statistics (accessible via re._compile.cache_info()
). See the 3.3.0 implementation of _compile()
and of functools.lru_cache()
.
其他人也注意到了同样的放缓,并在Python Bugtracker中提交了 issue 16389 。我希望3.4会再快得多; lru_cache
实现已得到改进,或者 re
模块将再次移至自定义缓存。
Others have noticed the same slowdown, and filed issue 16389 in the Python bugtracker. I'd expect 3.4 to be a lot faster again; either the lru_cache
implementation is improved or the re
module will move to a custom cache again.
更新:使用修订版4b4dddd670d0(hg)/ 0f606a6(git) 缓存更改已还原为3.1中的简单版本。从那时起,Python版本3.2.4和3.3.1包括了该修订版。
Update: With revision 4b4dddd670d0 (hg) / 0f606a6 (git) the cache change has been reverted back to the simple version found in 3.1. Python versions 3.2.4 and 3.3.1 include that revision.
此后,在Python 3.7中,模式缓存已更新为基于常规 dict的自定义FIFO缓存
code>(依赖于插入顺序,并且与LRU不同,它不考虑驱逐时最近使用缓存中的项目的时间)。
Since then, in Python 3.7 the pattern cache was updated to a custom FIFO cache implementation based on a regular dict
(relying on insertion order, and unlike a LRU, does not take into account how recently items already in the cache were used when evicting).
这篇关于为什么未编译,反复使用的正则表达式在Python 3中这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!