“等效"的时间完全不同永远不匹配的正则表达式? [英] Drastically different timing on "equivalent" never-matching regex?

查看:60
本文介绍了“等效"的时间完全不同永远不匹配的正则表达式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近为问题计时了一堆正则表达式永远不会被任何东西匹配的正则表达式"(我的回答在这里,有关详细信息,请参阅).

I recently timed a bunch of regexes for the question "A Regex that will never be matched by anything" (my answer here, see for more information).

然而,在我的测试之后,我注意到正则表达式 'a^''x^' 花费了截然不同的时间来检查,尽管它们应该是相同的.(我什至只是偶然地切换了角色.)这些时间如下.

However, after my testing I noticed that the regex 'a^' and 'x^' took drastically different amounts of time to check, although they should be identical. (It was only by chance that I even switched the character.) These timings are below.

In [1]: import re

In [2]: with open('/tmp/longfile.txt') as f:
   ...:     longfile = f.read()
   ...:     

In [3]: len(re.findall('\n',longfile))
Out[3]: 275000

In [4]: len(longfile)
Out[4]: 24733175

...

In [45]: %timeit re.search('x^',longfile)
6.89 ms ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [46]: %timeit re.search('a^',longfile)
37.2 ms ± 739 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [47]: %timeit re.search(' ^',longfile)
49.8 ms ± 844 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

在线测试(仅前 50 行)显示相同的行为(1441880 步和约 710 毫秒与仅 40858 步和约 113 毫秒):https://regex101.com/r/AwaHmK/1

Online testing (with just the first 50 lines) shows the same behavior (1441880 steps and ~710ms vs only 40858 steps and ~113ms): https://regex101.com/r/AwaHmK/1

Python 在这里做什么使 'a^''x^' 花费的时间长这么多?

What is Python doing here that makes 'a^' take so much longer than 'x^'?

只是为了看看 timeit 或 IPython 内部是否发生了什么,我自己编写了一个简单的计时函数,然后一切都检查了:

Just to see if there was something going on inside timeit or IPython, I wrote a simple timing function myself, and everything checks out:

In [57]: import time

In [59]: import numpy as np

In [62]: def timing(regex,N=7,n=100):
    ...:     tN = []
    ...:     for i in range(N):
    ...:         t0 = time.time()
    ...:         for j in range(n):
    ...:             re.search(regex,longfile)
    ...:         t1 = time.time()
    ...:         tN.append((t1-t0)/n)
    ...:     return np.mean(tN)*1000, np.std(tN)*1000
    ...: 

In [63]: timing('a^')
Out[63]: (37.414282049451558, 0.33898056279589844)

In [64]: timing('x^')
Out[64]: (7.2061508042471756, 0.22062989840321218)

我还在 IPython 之外的标准 3.5.2 shell 中复制了我的结果.因此,奇怪之处不限于 IPython 或 timeit.

I also replicated my results outside of IPython, in a standard 3.5.2 shell. So the oddity is not constrained to either IPython or timeit.

推荐答案

如链接问题中所述,此正则表达式扫描整个文本.

这里看到的时间差异只是因为a"是英文文本中的一个常见字母,并且您使用了可读数据.因此,如果您研究了正则表达式引擎的工作原理,您就会明白:使用模式 a^ 会导致更多的延迟,因为在第一个a"上找到了临时匹配,然后会被拒绝.引擎有两个读取头",它们都从左向右移动 - 一个在字符串中移动,另一个在正则表达式模式中移动.将 a^ 模式与您选择的人类可读数据结合使用,正则表达式引擎必须做更多的工作.由于x"在语料库中是一个不常见的字母,使用模式 x^ 浪费更少的时间 - 文本中的更多位置可以立即被拒绝.

The timing discrepancies seen here are just because "a" is such a common letter in English text, and you used readable data. So, if you study how regex engines work, you will understand: using the pattern a^ causes many more delays due to finding the tentative matches on the first "a", which then get rejected later. The engine has two "reading heads" which both move from left to right - one moves in the string, the other moves in the regex pattern. Using the a^ pattern in combination with your choice of human-readable data, the regex engine has to do more work. Since "x" is an uncommon letter in the corpus, using the pattern x^ wastes less time - more positions in the text can be rejected immediately.

  • 如果你用另一个常见的字母开始模式,比如e",它也会很慢(使用e^甚至比a^还要慢,因为e"在英语中出现的频率更高).
  • 如果您使用随机 ascii 字节作为语料库,而不是真实文本,a^x^ 模式的表现将类似.
  • If you start the pattern with another common letter, such as "e", it will be just as slow (using e^ will be even slower than a^, because "e" is more frequently appearing in English).
  • If you use random ascii bytes for the corpus, instead of real text, both a^ and x^ patterns will perform similarly.

总而言之,考虑到更广泛的正则表达式上下文时,这两个等效的"从不匹配的正则表达式模式 a^x^ 实际上并不是那么等效引擎的内部工作原理和所选的测试数据.

In summary, these two "equivalent" never-matching regex patterns a^ and x^ are actually not so equivalent when taking into account the wider context of the regex engine's inner workings and the chosen test data.

这篇关于“等效"的时间完全不同永远不匹配的正则表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆