字符串操作与Regexps [英] String manipulation vs Regexps
问题描述
我们经常被告知Regexps很慢,应该尽可能避免。
We are often told that Regexps are slow and should be avoided whenever possible.
然而,考虑到自己做一些字符串操作的开销(不是在谈论算法错误 - 这是另一回事),特别是在 PHP
或 Perl
(也许 Java
)什么是限制,在这种情况下,我们可以认为字符串操作是更好的选择吗?什么regexp特别是CPU贪心?
However, taking into account the overhead of doing some string manipulation oneself (not talking about algorithm mistakes - this is a different matter), especially in PHP
or Perl
(maybe Java
) what is the limit, in which case can we consider string manipulation to be a better alternative? What regexps are particularly CPU greedy?
例如,对于以下内容,在 C ++
中, Java
, PHP
或 Perl
,你会推荐什么
For instance, for the following, in C++
, Java
, PHP
or Perl
, what would you recommend
正则表达式可能会更快:
The regexps would probably be faster:
-
s / abc / def / g
或... while((i = index(abc,$ x)> = 0)... $ y。= substr().. 。
基于解决方案? -
s /(\d)+ / N / g
或扫描算法
s/abc/def/g
or a... while((i=index("abc",$x)>=0) ...$y .= substr()...
based solution?s/(\d)+/N/g
or a scanning algorithm
但是怎么样
- an电子邮件验证regexp?
-
s /((0 | \ w)+?[xy] * [^ xy]){2,7} / u / g
- an email validation regexp?
s/((0|\w)+?[xy]*[^xy]){2,7}/u/g
手工和特定算法不会更快(写入时间更长)?
wouldn't a handmade and specific algorithm be faster (while longer to write)?
编辑
问题的关键在于确定哪种正则表达式会更好通过字符串操作专门针对给定问题重写?
The point of the question is to determine what kind of regexp would better be rewritten specifically for a given problem via string manipulation?
edit2
一个常见的实现是Perl regexp。例如在Perl中 - 需要知道它们是如何实现的 - 要避免使用正则表达式的种,因为实现会使进程冗长且无效?它可能不是一个复杂的正则表达式...
A common implementation is Perl regexp. For instance in Perl - that requires to know how they are implemented - what kind of regexp is to be avoided, because the implementation will make the process lengthy and ineffective? It may not be a complex regexp...
编辑2011年7月(基于评论)
我不是说所有正则表达式都很慢。一些特定的正则表达式模式已知由于它们的特殊处理以及由于它们的实现而变慢。
例如,在最近的Perl / PHP实现中,已知的是相当慢的 - 应该避免?
I'm not saying all regexps are slow. Some particular regexps patterns are known to be slow, due to the particular processing their and due to their implementation.
In recent Perl / PHP implementations for instance, what is known to be rather slow - and should be avoided?
The answer is expected from people who did already their own research (profiler...) and who are able to provide a kind of general guidelines about what is recommended/to be avoided.
推荐答案
使用正则表达式操作文本的一个很好的特性是模式是高级的和声明性的。这为实施留下了相当大的优化空间,例如分解出最长的公共前缀或使用 Boyer-Moore 用于静态字符串。简洁的符号使专家更快地阅读。我立刻明白了什么
A nice feature of manipulating text with regular expressions is that patterns are high-level and declarative. This leaves the implementation considerable room for optimization such as factoring out the longest common prefix or using Boyer-Moore for static strings. Concise notation makes for quicker reading by experts. I understand immediately what
if (s/^(.)//) {
...
}
正在做,指数($ _,0,1) =
相比看起来很吵。
is doing, and index($_, 0, 1) = ""
looks noisy in comparison.
正则表达式的重要考虑因素是上限<而不是下限/ em>的。它是一个强大的工具,因此人们相信它能够从XML,电子邮件地址或C ++程序中正确提取令牌,并且没有意识到需要一个更强大的工具,如解析器。
Rather than the lower bound, the important consideration for regular expressions is the upper bound. It's a powerful tool, so people believe it's capable of correctly extracting tokens from XML, email addresses, or C++ programs and don't realize that an even more powerful tool such as a parser is necessary.
这篇关于字符串操作与Regexps的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!