在Python中编译正则表达式 [英] Compiling Regular Expressions in Python

查看:84
本文介绍了在Python中编译正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在通过Doug Hellman的Python标准库示例,并遇到了这个问题:

I'm working through Doug Hellman's "The Python Standard Library by Example" and came across this:

1.3.2编译表达式
re模块级函数,用于将正则表达式作为文本字符串使用,但是对于程序经常使用的表达式进行编译更为高效。

"1.3.2 Compiling Expressions re includes module-level functions for working with regular expressions as text strings, but it is more efficient to compile the expressions a program uses frequently."

我不能跟随他的解释为什么是这样的情况。他说模块级函数维护编译表达式的缓存,并且由于缓存大小有限,使用编译表达式直接避免了缓存查找开销。

I couldn't follow his explanation for why this is the case. He says that the "module-level functions maintain a cache of compiled expressions" and that since the "size of the cache" is limited, "using compiled expressions directly avoids the cache lookup overhead."

我非常感谢,如果有人可以请解释或指示我一个解释,我可以更好地理解为什么更有效地编译程序常用的正则表达式,以及这个过程如何实际工作。

I'd greatly appreciate it if someone could please explain or direct me to an explanation that I could better understand for why it is more efficient to compile the regular expressions a program uses frequently, and how this process actually works.

推荐答案

Hm。这很奇怪。我的知识迄今为止(从其他来源获得,来自此问题) )建议我的初始答案:

Hm. This is strange. My knowledge so far (gained, among other source, from this question) suggested my initial answer:

Python缓存您使用的最后100个正则表达式,因此即使您不明确编译它们,也不必在每次使用时重新编译。

Python caches the last 100 regexes that you used, so even if you don't compile them explicitly, they don't have to be recompiled at every use.

但是,有两个缺点:当达到100个正则表达式的限制时,整个缓存被拒绝,因此如果在一行中使用101个不同的正则表达式,每次都将重新编译。

However, there are two drawbacks: When the limit of 100 regexes is reached, the entire cache is nuked, so if you use 101 different regexes in a row, each one will be recompiled every time. Well, that's rather unlikely, but still.

其次,为了找出regex是否已经编译,解释器需要在缓存中查找正则表达式每次都需要一点额外的时间(但不是很多,因为字典查找非常快)。

Second, in order to find out if a regex has been compiled already, the interpreter needs to look up the regex in the cache every time which does take a little extra time (but not much since dictionary lookups are very fast).

因此,如果您明确编译正则表达式,就可以避免这个额外的查找步骤。

So, if you explicitly compile your regexes, you avoid this extra lookup step.

我只是做了一些测试(Python 3.3):

I just did some testing (Python 3.3):

>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")''')
18.547793477671938
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+","  jkdhf  ")''')
106.47892003890324

所以看来没有进行缓存。也许这是 timeit.timeit()运行的特殊条件的一个怪癖?

So it would appear that no caching is being done. Perhaps that's a quirk of the special conditions under which timeit.timeit() runs?

另一方面,在Python 2.7中,区别不是很明显:

On the other hand, in Python 2.7, the difference is not as noticeable:

>>> import timeit
>>> timeit.timeit(setup="import re", stmt='''r=re.compile(r"\w+")\nfor i in range(10):\n r.search("  jkdhf  ")''')
7.248294908492429
>>> timeit.timeit(setup="import re", stmt='''for i in range(10):\n re.search(r"\w+","  jkdhf  ")''')
18.26713670282241

这篇关于在Python中编译正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆