我可以进一步提高这个正则表达式的性能吗? [英] Can I improve performance of this regular expression further

查看:126
本文介绍了我可以进一步提高这个正则表达式的性能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从线程转储文件中获取线程名称。
线程名称通常包含在每个线程转储的第一行的双引号中。
它可能看起来如下所示:

 THREAD1守护程序prio = 10 tid = 0x00007ff6a8007000 nid = 0xd4b6 runnable [0x00007ff7f8aa0000] 

或者大到如下:

 [STANDBY] ExecuteThread:队列中的'43':'weblogic.kernel.Default(自调整)'守护进程中的守护进程prio = 10 tid = 0x00007ff71803a000 nid = 0xd3e7。 wait()[0x00007ff7f8ae1000] 

我写的正则表达式很简单: (。*)。它将双引号内的所有内容作为一组捕获。然而,它会导致严重的回溯,因此需要执行大量步骤,如




  • 懒惰的




    • 占有量词解决方案,带有否定的字符类


      I am trying to fetch thread names from the thread dumps file. The thread names are usually contained within "double quotes" in the first line of each thread dump. It may look as simple as follows:

      "THREAD1" daemon prio=10 tid=0x00007ff6a8007000 nid=0xd4b6 runnable [0x00007ff7f8aa0000]
      

      Or as big as follows:

      "[STANDBY] ExecuteThread: '43' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00007ff71803a000 nid=0xd3e7 in Object.wait() [0x00007ff7f8ae1000]
      

      The regular expression I wrote is simple one: "(.*)". It captures everything inside double quotes as a group. However it causes heavy backtracking thus requiring a lot of steps, as can be seen here. Verbally we can explain this regex as "capture anything that is enclosed inside double quotes as a group"

      So I came up with another regex which performs the same: "([^\"])". Verbally we can describe this regex as "capture any number of non-double quote characters that are enclosed inside double quotes". I did not found any fast regex than this. It does not perform any backtracking and hence it requires minimum steps as can be seen here.

      I told this above to my colleague. He came up with yet another one: "(.*?)". I didnt get how it works. It performs considerable less backtracking than the first one but is a bit slower than the second one as can be seen here. However

      • I don't get why the backtracking stops early.
      • I understand ? is a quantifier which means once or not at all. However I dont understand how once or not at all is getting used here.
      • In fact I am not able to guess how can we describe this regex verbally.

      My colleague tried explaining me but I am still not able to understand it completely. Can anyone explain?

      解决方案

      Brief explanation and a solution

      The "(.*)" regex involves a lot of backtracking because it finds the first " and then grabs the whole string and backtracks looking for the " that is closest to the end of string. Since you have a quoted substring closer to the start, there's more backtracking than with "(.*?)" as this lazy quantifier *? makes the regex engine look for the closest " after the first " found.

      The negated character class solution "([^"]*)" is the best from the 3 because it does not have to grab everything, just all characters other than ". However, to stop any backtracking and make the expression ultimately efficient, you can use possessive quantifiers.

      If you need to match strings like " + no quotes here + ", use

      "([^"]*+)"
      

      or even you do not need to match the trailing quote in this situation:

      "([^"]*+)
      

      See regex demo

      In fact I am not able to guess how can we describe this regex verbally.

      The latter "([^"]*+) regex can be described as

      • " - find the first " symbol from the left of the string
      • ([^"]*+) - match and capture into Group 1 zero or more symbols other than ", as many as possible, and once the engine finds a double quote, the match is returned immediately, without backtracking.

      Quantifiers

      More information on quantifiers from Rexegg.com:

      A* Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
      A*? Zero or more As, as few as needed to allow the overall pattern to match (lazy)
      A*+ Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)

      As you see, ? is not a separate quantifier, it is a part of another quantifier.

      I advise to read more about why Lazy Quantifiers are Expensive and that Negated Class Solution is really safe and fast to deal with your input string (where you just match a quote followed by non-quotes and then a final quote).

      Difference between .*?, .* and [^"]*+ quantifiers

      • Greedy "(.*)" solution works like this: checks each symbol from left to right looking for ", and once found grabs the whole string up to the end and checks each symbol if it is equal to ". Thus, in your input string, it backtracks 160 times.

      Since the next " is not far, the number of backtrack steps is much fewer than with greedy matching.

      • possessive quantifier solution with a negated character class "([^"]*+)" works like this: the engine finds the leftmost ", and then grabs all characters that are not " up to the first ". The negated character class [^"]*+ greedily matches zero or more characters that are not a double quote. Therefore, we are guaranteed that the dot-star will never jump over the first encountered ". This is a more direct and efficient way of matching between some delimiters. Note that in this solution, we can fully trust the * that quantifies the [^"]. Even though it is greedy, there is no risk that [^"] will match too much as it is mutually exclusive with the ". This is the contrast principle from the regex style guide [see source].

      Note that the possessive quantifier does not let the regex engine backtrack into the subexpression, once matched, the symbols between " become one hard block that cannot be "re-sorted" due to some "inconveniences" met by the regex engine, and it will be unable to shift any characters from and into this block of text.

      For the current expression, it does not make a big difference though.

      这篇关于我可以进一步提高这个正则表达式的性能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆