Java正则表达式库是否针对任何字符进行优化。*? [英] Does the Java regex library optimize for any characters .*?

查看:116
本文介绍了Java正则表达式库是否针对任何字符进行优化。*?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用于匹配正则表达式的包装类。显然,你将正则表达式编译成像这样的 Pattern

I have a wrapper class for matching regular expressions. Obviously, you compile a regular expression into a Pattern like this.

Pattern pattern = Pattern.compile(regex);

但假设我使用。* 来指定任意数量的字符。所以它基本上是一个通配符。

But suppose I used a .* to specify any number of characters. So it's basically a wildcard.

Pattern pattern = Pattern.compile(".*");

模式是否优化以始终返回true而不是真正计算任何东西?或者我应该让我的包装器实现该优化?我这样做是因为我可以在一个过程中轻松处理数十万个正则表达式操作。如果正则表达式参数为null,我将其合并为。*

Does the pattern optimize to always return true and not really calculate anything? Or should I have my wrapper implement that optimization? I am doing this because I could easily process hundreds of thousands of regex operations in a process. If a regex parameter is null I coalesce it to a .*

推荐答案

在你的情况下,我可以使用占有量词来避免任何回溯:

In your case, I could just use a possessive quantifier to avoid any backtracking:

.*+

Java模式匹配引擎可以使用多种优化,并可以自动应用它们。

The Java pattern-matching engine has several optimizations at its disposal and can apply them automatically.

这是 Cristian Mocanu在他的使用Java优化正则表达式 中写了一篇类似的案例。*


Java正则表达式引擎无法优化表达式。* abc。* 。我预计它会在输入字符串中搜索 abc 并很快报告失败,但事实并非如此。在相同的输入字符串上,使用 String.indexOf(abc)比我改进的正则表达式快三倍。似乎只有当已知字符串位于其开头或其内部的预定位置时,引擎才能优化此表达式。例如,如果我将表达式重新编写为。{100} abc。* ,引擎将匹配它的速度提高十倍以上。为什么?因为现在强制字符串 abc 位于字符串内的已知位置(之前应该只有一百个字符)。

Java regex engine was not able to optimize the expression .*abc.*. I expected it would search for abc in the input string and report a failure very quickly, but it didn't. On the same input string, using String.indexOf("abc") was three times faster then my improved regular expression. It seems that the engine can optimize this expression only when the known string is right at its beginning or at a predetermined position inside it. For example, if I re-write the expression as .{100}abc.* the engine will match it more than ten times faster. Why? Because now the mandatory string abc is at a known position inside the string (there should be exactly one hundred characters before it).

一些来自同一来源的Java正则表达式优化提示


  • 如果是正则表达式包含一个必须存在于输入字符串中的字符串(否则整个表达式将不匹配),引擎有时可以首先搜索该字符串,如果找不到匹配则报告失败,而不检查整个正则表达式。

  • 自动优化正则表达式的另一种非常有用的方法是让引擎根据正则表达式检查输入字符串的长度与预期长度的关系。例如,表达式 \d {100} 在内部进行了优化,如果输入字符串的长度不是100个字符,引擎将报告失败而不评估整个正则表达式。

  • 不要在分组或替换中隐藏强制字符串,因为引擎将无法识别它们。如果可能,指定要匹配的输入字符串的长度也很有帮助。

  • 如果要在程序中多次使用正则表达式,请务必编译模式使用 Pattern.compile()而不是更直接的 Pattern.matches()

  • 还要记住,通过调用方法 reset,可以为不同的输入字符串重用 Matcher 对象()

  • 谨防交替。像(X | Y | Z)这样的正则表达式因为速度慢而闻名,所以请注意它们。首先,交替顺序计数,因此将更常见的选项放在前面,这样它们可以更快地匹配。另外,尝试提取常见的模式;例如,而不是(abcd | abef)使用 ab(cd | ef)

  • 每当你使用否定的字符类来匹配别的东西时,使用所有格量词:而不是 [^ a] * a 使用 [^ a] * + a

  • 不匹配的字符串可能会导致代码比包含的代码更频繁地冻结一场比赛。 请务必先使用不匹配的字符串测试正则表达式!

  • 小心已知错误#5050507 (当正则表达式时模式类抛出 StackOverflowError ),如果遇到此错误,请尝试重写正则表达式或将其拆分为多个子表达式并单独运行。后一种技术有时甚至可以提高性能。

  • 使用淬火贪婪令牌(例如(?:(?!something))而不是懒惰点匹配。)* )或展开循环技术(得到了downvoted对于它今天,不知道为什么)。

  • If the regular expression contains a string that must be present in the input string (or else the whole expression won't match), the engine can sometimes search that string first and report a failure if it doesn't find a match, without checking the entire regular expression.
  • Another very useful way to automatically optimize a regular expression is to have the engine check the length of the input string against the expected length according to the regular expression. For example, the expression \d{100} is internally optimized such that if the input string is not 100 characters in length, the engine will report a failure without evaluating the entire regular expression.
  • Don't hide mandatory strings inside groupings or alternations because the engine won't be able to recognize them. When possible, it is also helpful to specify the lengths of the input strings that you want to match
  • If you will use a regular expression more than once in your program, be sure to compile the pattern using Pattern.compile() instead of the more direct Pattern.matches().
  • Also remember that you can re-use the Matcher object for different input strings by calling the method reset().
  • Beware of alternation. Regular expressions like (X|Y|Z) have a reputation for being slow, so watch out for them. First of all, the order of alternation counts, so place the more common options in the front so they can be matched faster. Also, try to extract common patterns; for example, instead of (abcd|abef) use ab(cd|ef).
  • Whenever you are using negated character classes to match something other than something else, use possessive quantifiers: instead of [^a]*a use [^a]*+a.
  • Non-matching strings may cause your code to freeze more often than those that contain a match. Remember to always test your regular expressions using non-matching strings first!
  • Beware of a known bug #5050507 (when the regex Pattern class throws a StackOverflowError), if you encounter this error, try to rewrite the regular expression or split it into several sub-expressions and run them separately. The latter technique can also sometimes even increase performance.
  • Instead of lazy dot matching, use tempered greedy token (e.g. (?:(?!something).)*) or unrolling the loop techinque (got downvoted for it today, no idea why).


不幸的是,你不能依赖引擎来一直优化你的正则表达式。在上面的例子中,正则表达式实际上匹配得非常快,但在很多情况下,表达式太复杂,输入字符串太大,引擎无法优化。

Unfortunately you can't rely on the engine to optimize your regular expressions all the time. In the above example, the regular expression is actually matched pretty fast, but in many cases the expression is too complex and the input string too large for the engine to optimize.


这篇关于Java正则表达式库是否针对任何字符进行优化。*?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆