正则表达式性能与纯粹迭代的最佳实践 [英] Best practices for regex performance VS sheer iteration
问题描述
我想知道什么时候使用正则表达式VS "string".contains("anotherString")
和/或其他String API调用有任何通用准则?
I was wondering if there are any general guidelines for when to use regex VS "string".contains("anotherString")
and/or other String API calls?
虽然上面给定的.contains()
决定是微不足道的(如果可以在单个调用中执行此操作,那么为什么要对正则表达式进行打扰),现实生活中会做出更复杂的选择.例如,最好执行两个.contains()
调用或单个正则表达式?
While above given decision for .contains()
is trivial (why bother with regex if you can do this in a single call), real life brings more complex choices to make. For example, is it better to do two .contains()
calls or a single regex?
我的经验法则是始终使用正则表达式,除非可以用单个API调用代替它.这可以防止代码膨胀,但是从代码可读性的角度来看可能不是那么好,尤其是如果正则表达式倾向于变大的话.
My rule of thumb was to always use regex, unless this can be replaced with a single API call. This prevents code against bloating, but is probably not so good from code readability point of view, especially if regex tends to get big.
另一个经常被忽略的论点是性能.我如何知道此正则表达式需要进行多少次迭代(如"Big O"中所述)?它会比纯粹的迭代更快吗?每个人都以某种方式认为,一旦正则表达式看起来少于5个if
语句,它肯定会更快.但是,总是这样吗?如果无法预先预编译正则表达式,则尤其重要.
Another, often overlooked, argument is performance. How do I know how many iterations (as in "Big O") does this regex require? Would it be faster than sheer iteration? Somehow everybody assumes that once regex looks shorter than 5 if
statements, it must be quicker. But is this always the case? This is especially relevant if regex cannot be pre-compiled in advance.
推荐答案
答案(通常)取决于它.
The answer (as usual) is that it depends.
在您的特定情况下,我想替代方法是执行正则表达式"this | that",然后进行查找.这种特殊的构造确实戳了正则表达式的弱点.在这种情况下,"OR"并不真正知道子模式要尝试执行的操作,因此无法轻松地进行优化.最终完成了(用伪代码)的等效操作:
In your particular case, I guess the alternative would be to do the regex "this|that" and then do a find. This particular construct really pokes at regex's weaknesses. The "OR" in this case doesn't really know what the sub-patterns are trying to do and so can't easily optimize. It ends up doing the equivalent of (in pseudo code):
for( i = 0; i < stringLength; i++ ) {
if( stringAt pos i starts with "this" )
found!
if( stringAt pos i starts with "that" )
found!
}
几乎没有较慢的方法可以做到这一点.在这种情况下,两个contains()调用会更快.
There almost isn't a slower way to do it. In this case, two contains() calls will be much faster.
另一方面,完全匹配:".*this.*|.*that.*"
可能会更好地优化.
On the other hand, a full match on: ".*this.*|.*that.*"
may optimize better.
对我来说,当代码复杂或笨拙时,应使用正则表达式.因此,如果要在目标字符串中找到两个或三个字符串之一,则只需使用contains即可.但是,如果您想查找以'A'或'B'开头并以'g'-'m'结尾的单词,请使用正则表达式.
To me, regex should be used when the code to do otherwise is complicated or unwieldy. So if you want to find one of two or three strings in a target string then just use contains. But if you wanted to find words starting with 'A' or 'B' and ending in 'g'-'m'... then use regex.
然后您就不必担心这里和那里的几个周期了.
And then you won't be so worried about a few cycles here and there.
这篇关于正则表达式性能与纯粹迭代的最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!