Python,迭代正则表达式但在第一次匹配时停止的最快方法 [英] Python, fastest way to iterate over regular expressions but stop on first match
问题描述
我有一个函数,如果字符串至少匹配一个,则返回 True列表中的正则表达式,否则为 False.该函数被称为性能往往是一个问题.
I have a function that returns True if a string matches at least one regular expression in a list and False otherwise. The function is called often enough that performance is an issue.
当通过 cProfile 运行时,该函数花费了大约 65%其时间进行匹配,35% 的时间迭代列表.
When running it through cProfile, the function is spending about 65% of its time doing matches and 35% of its time iterating over the list.
我认为会有一种方法可以使用 map() 或其他东西,但我不能想办法让它在找到匹配项后停止迭代.
I would think there would be a way to use map() or something but I can't think of a way to have it stop iterating after it finds a match.
有没有办法让函数更快,同时仍然返回在找到第一个匹配项时?
Is there a way to make the function faster while still having it return upon finding the first match?
def matches_pattern(str, patterns):
for pattern in patterns:
if pattern.match(str):
return True
return False
推荐答案
首先想到的是使用生成器表达式将循环推到 C 端:
The first thing that comes to mind is pushing the loop to the C side by using a generator expression:
def matches_pattern(s, patterns):
return any(p.match(s) for p in patterns)
可能你甚至不需要一个单独的函数.
Probably you don't even need a separate function for that.
您应该尝试的另一件事是使用 |
交替运算符构建单个复合正则表达式,以便引擎有机会为您优化它.如果有必要,您还可以从字符串模式列表动态创建正则表达式:
Another thing you should try out is to build a single, composite regex using the |
alternation operator, so that the engine has a chance to optimize it for you. You can also create the regex dynamically from a list of string patterns, if this is necessary:
def matches_pattern(s, patterns):
return re.match('|'.join('(?:%s)' % p for p in patterns), s)
当然,您需要使用字符串形式的正则表达式才能正常工作.只需分析这两个并检查哪个更快:)
Of course you need to have your regexes in string form for that to work. Just profile both of these and check which one is faster :)
您可能还想查看在 Python 中调试正则表达式的一般提示.这也有助于寻找优化机会.
You might also want to have a look at a general tip for debugging regular expressions in Python. This can also help to find opportunities to optimize.
更新:我很好奇并写了一个小基准:
UPDATE: I was curious and wrote a little benchmark:
import timeit
setup = """
import re
patterns = [".*abc", "123.*", "ab.*", "foo.*bar", "11010.*", "1[^o]*"]*10
strings = ["asdabc", "123awd2", "abasdae23", "fooasdabar", "111", "11010100101", "xxxx", "eeeeee", "dddddddddddddd", "ffffff"]*10
compiled_patterns = list(map(re.compile, patterns))
def matches_pattern(str, patterns):
for pattern in patterns:
if pattern.match(str):
return True
return False
def test0():
for s in strings:
matches_pattern(s, compiled_patterns)
def test1():
for s in strings:
any(p.match(s) for p in compiled_patterns)
def test2():
for s in strings:
re.match('|'.join('(?:%s)' % p for p in patterns), s)
def test3():
r = re.compile('|'.join('(?:%s)' % p for p in patterns))
for s in strings:
r.match(s)
"""
import sys
print(timeit.timeit("test0()", setup=setup, number=1000))
print(timeit.timeit("test1()", setup=setup, number=1000))
print(timeit.timeit("test2()", setup=setup, number=1000))
print(timeit.timeit("test3()", setup=setup, number=1000))
我机器上的输出:
1.4120500087738037
1.662621021270752
4.729579925537109
0.1489570140838623
所以 any
似乎并不比你原来的方法快.动态构建正则表达式也不是很快.但是,如果您能够预先建立一个正则表达式并多次使用它,这可能会带来更好的性能.您还可以调整此基准测试以测试其他一些选项:)
So any
doesn't seem to be faster than your original approach. Building up a regex dynamically also isn't really fast. But if you can manage to build up a regex upfront and use it several times, this might result in better performance. You can also adapt this benchmark to test some other options :)
这篇关于Python,迭代正则表达式但在第一次匹配时停止的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!