在Python中编译正则表达式 [英] Compiling Regular Expressions in Python

查看：84 发布时间：2016/12/22 13:55:44 python regex compilation

本文介绍了在Python中编译正则表达式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在通过Doug Hellman的Python标准库示例，并遇到了这个问题：

I'm working through Doug Hellman's "The Python Standard Library by Example" and came across this:

1.3.2编译表达式
re模块级函数，用于将正则表达式作为文本字符串使用，但是对于程序经常使用的表达式进行编译更为高效。

"1.3.2 Compiling Expressions re includes module-level functions for working with regular expressions as text strings, but it is more efficient to compile the expressions a program uses frequently."

我不能跟随他的解释为什么是这样的情况。他说模块级函数维护编译表达式的缓存，并且由于缓存大小有限，使用编译表达式直接避免了缓存查找开销。

I couldn't follow his explanation for why this is the case. He says that the "module-level functions maintain a cache of compiled expressions" and that since the "size of the cache" is limited, "using compiled expressions directly avoids the cache lookup overhead."

我非常感谢，如果有人可以请解释或指示我一个解释，我可以更好地理解为什么更有效地编译程序常用的正则表达式，以及这个过程如何实际工作。

I'd greatly appreciate it if someone could please explain or direct me to an explanation that I could better understand for why it is more efficient to compile the regular expressions a program uses frequently, and how this process actually works.

推荐答案

Hm。这很奇怪。我的知识迄今为止（从其他来源获得，来自此问题））建议我的初始答案：

Hm. This is strange. My knowledge so far (gained, among other source, from this question) suggested my initial answer:

Python缓存您使用的最后100个正则表达式，因此即使您不明确编译它们，也不必在每次使用时重新编译。

Python caches the last 100 regexes that you used, so even if you don't compile them explicitly, they don't have to be recompiled at every use.

但是，有两个缺点：当达到100个正则表达式的限制时，整个缓存被拒绝，因此如果在一行中使用101个不同的正则表达式，每次都将重新编译。

However, there are two drawbacks: When the limit of 100 regexes is reached, the entire cache is nuked, so if you use 101 different regexes in a row, each one will be recompiled every time. Well, that's rather unlikely, but still.

其次，为了找出regex是否已经编译，解释器需要在缓存中查找正则表达式每次都需要一点额外的时间（但不是很多，因为字典查找非常快）。

Second, in order to find out if a regex has been compiled already, the interpreter needs to look up the regex in the cache every time which does take a little extra time (but not much since dictionary lookups are very fast).

因此，如果您明确编译正则表达式，就可以避免这个额外的查找步骤。

So, if you explicitly compile your regexes, you avoid this extra lookup step.

我只是做了一些测试（Python 3.3）：

I just did some testing (Python 3.3):

>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")''')
18.547793477671938
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+","  jkdhf  ")''')
106.47892003890324

所以看来没有进行缓存。也许这是 timeit.timeit（）运行的特殊条件的一个怪癖？

So it would appear that no caching is being done. Perhaps that's a quirk of the special conditions under which timeit.timeit() runs?

另一方面，在Python 2.7中，区别不是很明显：

On the other hand, in Python 2.7, the difference is not as noticeable:

>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")''')
7.248294908492429
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+","  jkdhf  ")''')
18.26713670282241

这篇关于在Python中编译正则表达式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Python中编译正则表达式 [英] Compiling Regular Expressions in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在Python中编译正则表达式 [英] Compiling Regular Expressions in Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭