在java中模式匹配的Stackoverflow [英] Stackoverflow in pattern matching in java

查看:132
本文介绍了在java中模式匹配的Stackoverflow的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试根据双引号之间没有的空格分割一行。

I tried to split a line based on spaces not enclosed between double quotes.

我的正则表达式是

(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+

我的代码

Pattern regex          = Pattern.compile("(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+");
Matcher regexMatcher   = regex.matcher(line);
List<String> rule      = new ArrayList<String>();

while(regexMatcher.find())
    rule.add(regexMatcher.group());

输入失败的输入。

SecRule REQUEST_COOKIES |!REQUEST_COOKIES:/ __ utm / | REQUEST_COOKIES_NAMES | ARGS_NAMES | ARGS | XML:/ *(?i:\b( ?:( ?: S(:T(:d(:???DEV(_pop | _samp))| R(:???_ TO_DATE | CMP))| U(:?​​??b(:STR(:荷兰国际集团(_index))|(?:?DAT | TIM)E)| M)| E(:C(:?_ TO_TIME | OND)| ssion_user)| YS(?:tem_user |日期)|公顷(1 | 2 ?)| oundex | CHEMA | IG N |节奏| QRT)|我?(?:S(空| _(free_lock | ipv4_compat | ipv4_mapped |的IPv4 | IPv6的| NOT_NULL |没有|空| used_lock))| N(? :ET6 _(航标| ntoa)| S?(?:ERT | TR)|?terval)| F(?NULL))| U(:N(:压缩(?:??ed_length)|?ix_timestamp |十六进制) | TC_(日期|时间|时间戳)| p(?:datexml |每)| UID(_short)|的情况下| SER)| L(:α,α(:CA(:???升(时间戳)| TE) | G(2 | 10)|?ad_file | WER)| AST(_day | _insert_id)|?E(:( ?:作为?| F)T | NGTH)|的情况下|修剪|垫| N)| T(? :IME(邮票| stampadd | stampdiff |的diff | _format | _to_sec)| O_(BASE64 |天|秒| N炭?)| R?(?:uncate | IM)|的)|米(?:一个(?:柯(?:_集|日期)| ster_pos_wait | X)| I(:( ?: crosecon)d | N(?:??UTE))| O(?:第n个(名)| d)| D5)? | R(?:ΔE(:p(:花边|吃)| lease_lock |节)| O(?:?w_count | UND)|一个(:体类| ND)|飞行|修剪|垫)| F( ?:我(?:?场(_in_set)| nd_ in_set时)| rom_(的base64 |天| unixtime)| O(?:und_rows | RMAT)| loor)| A(:ES _(?:DE | EN)隐窝| S(?:?CII(STR)|?中) | DD(?:DAT |添)E |(?:共| b)S | tan2 | VG)| p?(:O(:??sition | W(ER)?)| eriod_(添加|差异)| rocedure_analyse | assword | I)| b(:I(:?吨_(?:长度|计数| x或|和)| N(_to_num)?)| enchmark)| E(X(:p(? ?:ort_set)|道(值))| NC(?:rypt | ODE)| LT)| v(?:一个(:R(:???_(?: SAM | PO)p | iance)|梅毒)|版为)| G(:R(:oup_conca | eates)T | et_(格式|锁))| O(:( ?: ld_passwo)RD | CT(et_length))|我们(???? :EK(天| ofyear)| ight_string)| N?ame_const | | ullif)|(rawton)十六进制(toraw)|曲((:O(T_IN | W?):????ARTER | OTE)| (PG _)睡眠|一年(周)| d计算|的XMLType |小时)\W * \(| \b(:( ?: S(:??????elect\b({ 1100} \b?(?::(长度|?计数|顶部)\b {1100} \bfrom |。?from\b {1100} \bwhere)|。?。?* \b (:d(?:?ump\b * \bfrom | ata_type)|(?:_到(?: numbe | CHA)|研究所)R))| p _(?:SQLEXEC | sp_replwritetovarbin |则sp_help | addextendedproc | IS_SRVROLEMEMBER |准备| sp_password的|执行:| makewebtask | oacreate)| QL _(?:LONGVARCHAR |变体))| XP _(?:REG((SQL):??重( ?:movemultistring |广告)|删除(?:值|键)|枚举(:值|键)S | addmultistring |写)|结束| xp_servicecontrol | xp_ntsec_enumdomains | xp_terminate_process | E(:??xecresultset | numdsn)| availablemedia | loginconfig | CMDSHELL |文件列表| dirtree | makecab | ntsec)| U(:nion\b {} 1100 \bselect | TL _(?:文件| HTTP?。?))| d(?:?b(:a_users | ms_java)| elete\b\W * \bfrom)| group\b * \bby\b {} 1100 \bhaving |开(?:??行集| OWA_UTIL |查询)|负荷?\b\W * \bdata\b * \binfile |(?:3 N VARCHA | tbcreato)R | AUTONOMOUS_TRANSACTION)\b | I(?:?N(:to\b\\ \\W * \b?(?:转储|出来)?文件| sert\b\W * \binto | ner\b\W * \bjoin)\b |(:·F ?(?:??\b\W * \(\W * \bbenchmark | null\b)| snull\b)\W * \()| print\b\ ?W * \ \ @ @ | cas​​t\b\W * \()| C(:( ?: UR(:出租_(?:时间(:????盖章)|日期|用户)|(:DAT | TIM)E)| H(?:?AR(:( ?: ACTER)_长度|集)| R)| IEL(:?????荷兰国际集团)| AST | R32)\W * \(| O(:( ?: N(v(:ERT(:??????_ TZ))| C在?(?:_ WS)| nection_id)|(?:??mpres)S | ercibility | alesce | T)\W * \(| llation\W * \(一))| d(:( ?一(:T(:E(:?????(_(添加|格式|子))|差异)| ABASE)| Y(名称| ofmonth | ofweek | ofyear))|?E(:( ?:S_(DE | EN)cryp | FAUL)T | grees |代码)| UMP)\W * \(| bms_\w + \.\b)|(:; \W *? \b(?:关机|降)| \ \ @ @版本)\b | \butl_inaddr\b | \bsys_context\b |'(?:?S(:qloledb |一)| msdasql | dbo)'))阶段:2,rev:'2',ver:'OWASP_CRS / 2.2.9',成熟度:'9',准确度:'8',capture,t:none,t:urlDecodeUni ,ctl:auditLogParts = + E,block,msg:'SQL Injection Attack',id:'950001',标签:'OWASP_CRS / WEB_ATTACK / SQL_INJECTION',标签:'WASCTC / WASC-19',标签:'OWASP_TOP_10 / A1 ',tag:'OWASP_AppSensor / CIE1',标签:'PCI / 6.5.2',logdata:'匹配数据:%{TX.0}在%{MATCHED_VAR_NAME}内找到:%{MATCHED_VAR}',严重性:'2' ,SETVAR: 'tx.msg =%{rule.msg}',SETVAR:tx.sql_injection_score = +%{tx.critical_anomaly_score},SETVAR:tx.anomaly_score = +%{tx.critical_anomal y_score},setvar:tx。%{rule.id} -OWASP_CRS / WEB_ATTACK / SQL_INJECTION - %{matched_var_name} =%{tx.0}

当我在java中使用它时,某些行成功分离,但是某些行导致错误

When i used this in java, some lines are separated successfully, but some lines are causing errors

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4235)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)

样本输入:


world \很漂亮但我看不到它

The "world \"is beautiful" but i "cannot see" it

预期输出:

The
"world \" beautiful"
but
i
"cannot see"
it


推荐答案

为什么会发生StackOverflowError?



Pattern 类的参考实现中(它带有Oracle的JRE,OpenJDK和许多其他JVM),贪婪和懒惰q当重复模式不重要时,使用递归 1 实现uantifiers。因此,当输入字符串足够长时,您将遇到 StackOverflowError

Why does StackOverflowError occur?

In reference implementation of Pattern class (which comes with Oracle's JRE, OpenJDK, and a number of other JVMs), greedy and lazy quantifiers are implemented with recursion1 when the repeated pattern is non-trivial. Therefore, you will run into StackOverflowError when the input string is long enough.

1 递归是一种快速但不可扩展的解决方案,允许在模式中进行回溯。更好的实现使用数据结构来存储回溯点
(它基本上将递归解决方案转换为带有堆栈的迭代解决方案)。

1 Recursion is a quick but not scalable solution to allow backtracking in the pattern. Better implementation uses a data structure to store backtracking points (which basically converts recursive solution to iterative solution with stack).

以下正则表达式应该有效:

The following regex should work:

"(?:\"(?:[^\"\\\\]++|\\\\.)*+\"|[^ \"]++)++"

好吧,正则表达式令人困惑的是2层转义:在Java字符串文字中转义并在正则表达式语法中转义。

Well, the regex is quite confusing with 2 layers of escaping: escaping in Java string literal and escaping in regex syntax.

打印字符串时的原始正则表达式。我的解释将基于原始正则表达式。

The raw regex when you print the string out. My explanation will be based on the raw regex.

(?:"(?:[^"\\]++|\\.)*+"|[^ "]++)++



解释



由于您只关心整个正则表达式匹配的内容,因此所有捕获组(模式)都已转为进入非捕获组(?: pattern)以提高效率。

Explanation

Since you only care about what the whole regex matches, all the capturing groups (pattern) has been turned into non-capturing group (?:pattern) for efficiency.

第一个替代 (?:[^\\] ++ | \\。)* +匹配带引号的字符串。

The first alternative "(?:[^"\\]++|\\.)*+" matches a quoted string.

第二个替代 [^] ++ 匹配不包含空格的字符序列和双引号

The second alternative [^ "]++ matches a sequence of character that doesn't contain space and double quote ".

(?:
   "             # Double quote
   (?:
      [^"\\]++   # A sequence of characters that are not " and \
      |          # OR
      \\.        # Escape sequence: \ followed by any character (except line terminators)
   )*+           # Match 0 or more of the sequences above (allows empty string)
   "             # Double quote
   |
   [^ "]++
)++

由于正则表达式是为了不需要回溯,所有量词都被占有。因为 Pattern 类用循环实现占有量词,而不是像贪婪/懒惰量词一样递归, StackOverflowError 不会发生。

Since the regex is written so that there is no needs for backtracking, all quantifiers are made possessive. Since Pattern class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers, StackOverflowError will not occur.

我通过编写正则表达式来消除回溯的需要,使其与正确的匹配第一次尝试使用字符串:

I remove the need for backtracking by writing the regex so that it matches the correct string on first try:


  1. 由于 [^\\] 排除 \ ,我们无法从转义序列中窃取 \ ,或者窃取一个并搞乱结束报价,我们可以安全地向前推进而无需回溯。这解释了占有量词在这里 [^\\ \\\] ++ 。这里不需要指定量词,但我这样做是为了减少分支的工作。

  1. Since [^"\\] excludes the \, there is no way we can "steal" a \ from an escaping sequence, or "steal" a " and mess up the closing quote, we can safely advance ahead without backtracking. This explains the possessive quantifier here [^"\\]++. There is no need to assign a quantifier here, but I do this to reduce the work on the branching.

因为 [^ \\] ++ \\。无法窃取并搞砸了收盘价,我们可以在没有回溯的情况下向前推进。这解释了占有量词这里(?:[^\\] ++ | \\。)* +

Since both [^"\\]++ and \\. can't "steal" a " and mess up the closing quote, we can advance ahead without backtracking. This explains the possessive quantifier here (?:[^"\\]++|\\.)*+

[^] 无法启动带引号的字符串,也无法匹配空格(分隔符)。这就是我们可以使用占有量词的原因。

[^ "] can't start a quoted string, and it also can't match a space (delimiter). This is why we can use possessive quantifier on it.

因为(?:[^\\] ++ | \\。)* + [^] ++ 不能搞砸对方的比赛,我们可以使外部量词占有率最高。

Since "(?:[^"\\]++|\\.)*+" and [^ "]++ can't mess up the match of each other, we can make the outer most quantifier possessive.

首次尝试时正确匹配的正则表达式示例回溯后得到正确的结果 ^([bcd] +:[ab] +)+ $ 对输入如 b:ab:a 。第一次迭代将匹配 b:ab ,这会导致第二次迭代失败,然后它回溯并重试,第一次迭代为 b:a 然后成功匹配整个字符串。

Example of a regex that doesn't match things correctly on first try and only get the correct result after backtracking would be ^([bcd]+:[ab]+)+$ against inputs such as b:ab:a. The first iteration will match b:ab, which cause the 2nd iteration to fail, then it backtracks and retry with the first iteration being b:a and then successfully match the whole string.

这篇关于在java中模式匹配的Stackoverflow的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆