为什么未编译，反复使用的正则表达式在Python 3中这么慢？ [英] Why are uncompiled, repeatedly used regexes so much slower in Python 3?

查看：98 发布时间：2020/9/28 4:49:40 python regex caching

本文介绍了为什么未编译，反复使用的正则表达式在Python 3中这么慢？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

回答此问题时（并已阅读对于类似问题的答案），我以为我知道Python如何缓存正则表达式。

When answering this question (and having read this answer to a similar question), I thought that I knew how Python caches regexes.

但是后来我认为我应该进行测试比较两个场景：

But then I thought I'd test it, comparing two scenarios:

单个正则表达式的单个编译，然后是该正则表达式的10个应用程序。

未编译的正则表达式的10个应用程序（我原本希望性能会稍差一些，因为该正则表达式必须先编译一次，然后缓存，然后在缓存中查找9次）。

但是，结果令人震惊（在Python 3.3中）：

However, the results were staggering (in Python 3.3):

>>> import timeit
>>> timeit.timeit(setup="import re", 
... stmt='r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")')
18.547793477671938
>>> timeit.timeit(setup="import re", 
... stmt='for i in range(10):\n re.search(r"\w+","  jkdhf  ")')
106.47892003890324

这慢了5.7倍！在Python 2.7中，仍然增加了2.5倍，这也超出了我的预期。

That's over 5.7 times slower! In Python 2.7, there is still an increase by a factor of 2.5, which is also more than I would have expected.

在Python 2和3之间更改了正则表达式的缓存？文档似乎不建议这样做。

Has caching of regexes changed between Python 2 and 3? The docs don't seem to suggest that.

推荐答案

代码已更改。

在Python 2.7中，缓存是一个简单的字典。如果其中存储了 _MAXCACHE 个项目，则在存储新项目之前会清除整个缓存。缓存查找仅需要构建一个简单的键并测试字典，请参见 _compile（）

In Python 2.7, the cache is a simple dictionary; if more than _MAXCACHE items are stored in it, the whole the cache is cleared before storing a new item. A cache lookup only takes building a simple key and testing the dictionary, see the 2.7 implementation of _compile()

的实现在Python 3.x中，缓存已由 @ functools.lru_cache（ maxsize = 500，类型为True）装饰器。这个装饰器完成了更多的工作，包括线程锁定，调整缓存LRU队列和维护缓存统计信息（可通过 re._compile.cache_info（））。请参见 _compile（） 和 functools.lru_cache（） 。

In Python 3.x, the cache has been replaced by the @functools.lru_cache(maxsize=500, typed=True) decorator. This decorator does much more work and includes a thread-lock, adjusting the cache LRU queue and maintaining the cache statistics (accessible via re._compile.cache_info()). See the 3.3.0 implementation of _compile() and of functools.lru_cache().

其他人也注意到了同样的放缓，并在Python Bugtracker中提交了 issue 16389 。我希望3.4会再快得多； lru_cache 实现已得到改进，或者 re 模块将再次移至自定义缓存。

Others have noticed the same slowdown, and filed issue 16389 in the Python bugtracker. I'd expect 3.4 to be a lot faster again; either the lru_cache implementation is improved or the re module will move to a custom cache again.

更新：使用修订版4b4dddd670d0（hg）/ 0f606a6（git）缓存更改已还原为3.1中的简单版本。从那时起，Python版本3.2.4和3.3.1包括了该修订版。

Update: With revision 4b4dddd670d0 (hg) / 0f606a6 (git) the cache change has been reverted back to the simple version found in 3.1. Python versions 3.2.4 and 3.3.1 include that revision.

此后，在Python 3.7中，模式缓存已更新为基于常规 dict的自定义FIFO缓存 code>（依赖于插入顺序，并且与LRU不同，它不考虑驱逐时最近使用缓存中的项目的时间）。


Since then, in Python 3.7 the pattern cache was updated to a custom FIFO cache implementation based on a regular dict (relying on insertion order, and unlike a LRU, does not take into account how recently items already in the cache were used when evicting).

                        这篇关于为什么未编译，反复使用的正则表达式在Python 3中这么慢？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

为什么未编译，反复使用的正则表达式在Python 3中这么慢？ [英] Why are uncompiled, repeatedly used regexes so much slower in Python 3?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么未编译，反复使用的正则表达式在Python 3中这么慢？ [英] Why are uncompiled, repeatedly used regexes so much slower in Python 3?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭