在java中模式匹配的Stackoverflow [英] Stackoverflow in pattern matching in java
问题描述
我尝试根据双引号之间没有的空格分割一行。
I tried to split a line based on spaces not enclosed between double quotes.
我的正则表达式是
(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+
我的代码
Pattern regex = Pattern.compile("(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+");
Matcher regexMatcher = regex.matcher(line);
List<String> rule = new ArrayList<String>();
while(regexMatcher.find())
rule.add(regexMatcher.group());
输入失败的输入。
SecRule REQUEST_COOKIES |!REQUEST_COOKIES:/ __ utm / | REQUEST_COOKIES_NAMES | ARGS_NAMES | ARGS | XML:/ *(?i:\b( ?:( ?: S(:T(:d(:???DEV(_pop | _samp))| R(:???_ TO_DATE | CMP))| U(:???b(:STR(:荷兰国际集团(_index))|(?:?DAT | TIM)E)| M)| E(:C(:?_ TO_TIME | OND)| ssion_user)| YS(?:tem_user |日期)|公顷(1 | 2 ?)| oundex | CHEMA | IG N |节奏| QRT)|我?(?:S(空| _(free_lock | ipv4_compat | ipv4_mapped |的IPv4 | IPv6的| NOT_NULL |没有|空| used_lock))| N(? :ET6 _(航标| ntoa)| S?(?:ERT | TR)|?terval)| F(?NULL))| U(:N(:压缩(?:??ed_length)|?ix_timestamp |十六进制) | TC_(日期|时间|时间戳)| p(?:datexml |每)| UID(_short)|的情况下| SER)| L(:α,α(:CA(:???升(时间戳)| TE) | G(2 | 10)|?ad_file | WER)| AST(_day | _insert_id)|?E(:( ?:作为?| F)T | NGTH)|的情况下|修剪|垫| N)| T(? :IME(邮票| stampadd | stampdiff |的diff | _format | _to_sec)| O_(BASE64 |天|秒| N炭?)| R?(?:uncate | IM)|的)|米(?:一个(?:柯(?:_集|日期)| ster_pos_wait | X)| I(:( ?: crosecon)d | N(?:??UTE))| O(?:第n个(名)| d)| D5)? | R(?:ΔE(:p(:花边|吃)| lease_lock |节)| O(?:?w_count | UND)|一个(:体类| ND)|飞行|修剪|垫)| F( ?:我(?:?场(_in_set)| nd_ in_set时)| rom_(的base64 |天| unixtime)| O(?:und_rows | RMAT)| loor)| A(:ES _(?:DE | EN)隐窝| S(?:?CII(STR)|?中) | DD(?:DAT |添)E |(?:共| b)S | tan2 | VG)| p?(:O(:??sition | W(ER)?)| eriod_(添加|差异)| rocedure_analyse | assword | I)| b(:I(:?吨_(?:长度|计数| x或|和)| N(_to_num)?)| enchmark)| E(X(:p(? ?:ort_set)|道(值))| NC(?:rypt | ODE)| LT)| v(?:一个(:R(:???_(?: SAM | PO)p | iance)|梅毒)|版为)| G(:R(:oup_conca | eates)T | et_(格式|锁))| O(:( ?: ld_passwo)RD | CT(et_length))|我们(???? :EK(天| ofyear)| ight_string)| N?ame_const | | ullif)|(rawton)十六进制(toraw)|曲((:O(T_IN | W?):????ARTER | OTE)| (PG _)睡眠|一年(周)| d计算|的XMLType |小时)\W * \(| \b(:( ?: S(:??????elect\b({ 1100} \b?(?::(长度|?计数|顶部)\b {1100} \bfrom |。?from\b {1100} \bwhere)|。?。?* \b (:d(?:?ump\b * \bfrom | ata_type)|(?:_到(?: numbe | CHA)|研究所)R))| p _(?:SQLEXEC | sp_replwritetovarbin |则sp_help | addextendedproc | IS_SRVROLEMEMBER |准备| sp_password的|执行:| makewebtask | oacreate)| QL _(?:LONGVARCHAR |变体))| XP _(?:REG((SQL):??重( ?:movemultistring |广告)|删除(?:值|键)|枚举(:值|键)S | addmultistring |写)|结束| xp_servicecontrol | xp_ntsec_enumdomains | xp_terminate_process | E(:??xecresultset | numdsn)| availablemedia | loginconfig | CMDSHELL |文件列表| dirtree | makecab | ntsec)| U(:nion\b {} 1100 \bselect | TL _(?:文件| HTTP?。?))| d(?:?b(:a_users | ms_java)| elete\b\W * \bfrom)| group\b * \bby\b {} 1100 \bhaving |开(?:??行集| OWA_UTIL |查询)|负荷?\b\W * \bdata\b * \binfile |(?:3 N VARCHA | tbcreato)R | AUTONOMOUS_TRANSACTION)\b | I(?:?N(:to\b\\ \\W * \b?(?:转储|出来)?文件| sert\b\W * \binto | ner\b\W * \bjoin)\b |(:·F ?(?:??\b\W * \(\W * \bbenchmark | null\b)| snull\b)\W * \()| print\b\ ?W * \ \ @ @ | cast\b\W * \()| C(:( ?: UR(:出租_(?:时间(:????盖章)|日期|用户)|(:DAT | TIM)E)| H(?:?AR(:( ?: ACTER)_长度|集)| R)| IEL(:?????荷兰国际集团)| AST | R32)\W * \(| O(:( ?: N(v(:ERT(:??????_ TZ))| C在?(?:_ WS)| nection_id)|(?:??mpres)S | ercibility | alesce | T)\W * \(| llation\W * \(一))| d(:( ?一(:T(:E(:?????(_(添加|格式|子))|差异)| ABASE)| Y(名称| ofmonth | ofweek | ofyear))|?E(:( ?:S_(DE | EN)cryp | FAUL)T | grees |代码)| UMP)\W * \(| bms_\w + \.\b)|(:; \W *? \b(?:关机|降)| \ \ @ @版本)\b | \butl_inaddr\b | \bsys_context\b |'(?:?S(:qloledb |一)| msdasql | dbo)'))阶段:2,rev:'2',ver:'OWASP_CRS / 2.2.9',成熟度:'9',准确度:'8',capture,t:none,t:urlDecodeUni ,ctl:auditLogParts = + E,block,msg:'SQL Injection Attack',id:'950001',标签:'OWASP_CRS / WEB_ATTACK / SQL_INJECTION',标签:'WASCTC / WASC-19',标签:'OWASP_TOP_10 / A1 ',tag:'OWASP_AppSensor / CIE1',标签:'PCI / 6.5.2',logdata:'匹配数据:%{TX.0}在%{MATCHED_VAR_NAME}内找到:%{MATCHED_VAR}',严重性:'2' ,SETVAR: 'tx.msg =%{rule.msg}',SETVAR:tx.sql_injection_score = +%{tx.critical_anomaly_score},SETVAR:tx.anomaly_score = +%{tx.critical_anomal y_score},setvar:tx。%{rule.id} -OWASP_CRS / WEB_ATTACK / SQL_INJECTION - %{matched_var_name} =%{tx.0}
当我在java中使用它时,某些行成功分离,但是某些行导致错误
When i used this in java, some lines are separated successfully, but some lines are causing errors
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4235)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
样本输入:
world \很漂亮但我看不到它
The "world \"is beautiful" but i "cannot see" it
预期输出:
The
"world \" beautiful"
but
i
"cannot see"
it
推荐答案
为什么会发生StackOverflowError?
在 Pattern
类的参考实现中(它带有Oracle的JRE,OpenJDK和许多其他JVM),贪婪和懒惰q当重复模式不重要时,使用递归 1 实现uantifiers。因此,当输入字符串足够长时,您将遇到 StackOverflowError
。
Why does StackOverflowError occur?
In reference implementation of Pattern
class (which comes with Oracle's JRE, OpenJDK, and a number of other JVMs), greedy and lazy quantifiers are implemented with recursion1 when the repeated pattern is non-trivial. Therefore, you will run into StackOverflowError
when the input string is long enough.
1 递归是一种快速但不可扩展的解决方案,允许在模式中进行回溯。更好的实现使用数据结构来存储回溯点
(它基本上将递归解决方案转换为带有堆栈的迭代解决方案)。
1 Recursion is a quick but not scalable solution to allow backtracking in the pattern. Better implementation uses a data structure to store backtracking points (which basically converts recursive solution to iterative solution with stack).
以下正则表达式应该有效:
The following regex should work:
"(?:\"(?:[^\"\\\\]++|\\\\.)*+\"|[^ \"]++)++"
好吧,正则表达式令人困惑的是2层转义:在Java字符串文字中转义并在正则表达式语法中转义。
Well, the regex is quite confusing with 2 layers of escaping: escaping in Java string literal and escaping in regex syntax.
打印字符串时的原始正则表达式。我的解释将基于原始正则表达式。
The raw regex when you print the string out. My explanation will be based on the raw regex.
(?:"(?:[^"\\]++|\\.)*+"|[^ "]++)++
解释
由于您只关心整个正则表达式匹配的内容,因此所有捕获组(模式)
都已转为进入非捕获组(?: pattern)
以提高效率。
Explanation
Since you only care about what the whole regex matches, all the capturing groups (pattern)
has been turned into non-capturing group (?:pattern)
for efficiency.
第一个替代 (?:[^\\] ++ | \\。)* +
匹配带引号的字符串。
The first alternative "(?:[^"\\]++|\\.)*+"
matches a quoted string.
第二个替代 [^] ++
匹配不包含空格的字符序列和双引号。
The second alternative [^ "]++
matches a sequence of character that doesn't contain space and double quote "
.
(?:
" # Double quote
(?:
[^"\\]++ # A sequence of characters that are not " and \
| # OR
\\. # Escape sequence: \ followed by any character (except line terminators)
)*+ # Match 0 or more of the sequences above (allows empty string)
" # Double quote
|
[^ "]++
)++
由于正则表达式是为了不需要回溯,所有量词都被占有。因为 Pattern
类用循环实现占有量词,而不是像贪婪/懒惰量词一样递归, StackOverflowError
不会发生。
Since the regex is written so that there is no needs for backtracking, all quantifiers are made possessive. Since Pattern
class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers, StackOverflowError
will not occur.
我通过编写正则表达式来消除回溯的需要,使其与正确的匹配第一次尝试使用字符串:
I remove the need for backtracking by writing the regex so that it matches the correct string on first try:
-
由于
[^\\]
排除\
,我们无法从转义序列中窃取\
,或者窃取一个并搞乱结束报价,我们可以安全地向前推进而无需回溯。这解释了占有量词在这里
[^\\ \\\] ++
。这里不需要指定量词,但我这样做是为了减少分支的工作。
Since
[^"\\]
excludes the\
, there is no way we can "steal" a\
from an escaping sequence, or "steal" a"
and mess up the closing quote, we can safely advance ahead without backtracking. This explains the possessive quantifier here[^"\\]++
. There is no need to assign a quantifier here, but I do this to reduce the work on the branching.
因为 [^ \\] ++
和 \\。
无法窃取并搞砸了收盘价,我们可以在没有回溯的情况下向前推进。这解释了占有量词这里
(?:[^\\] ++ | \\。)* +
Since both [^"\\]++
and \\.
can't "steal" a "
and mess up the closing quote, we can advance ahead without backtracking. This explains the possessive quantifier here (?:[^"\\]++|\\.)*+
[^]
无法启动带引号的字符串,也无法匹配空格(分隔符)。这就是我们可以使用占有量词的原因。
[^ "]
can't start a quoted string, and it also can't match a space (delimiter). This is why we can use possessive quantifier on it.
因为(?:[^\\] ++ | \\。)* +
和 [^] ++
不能搞砸对方的比赛,我们可以使外部量词占有率最高。
Since "(?:[^"\\]++|\\.)*+"
and [^ "]++
can't mess up the match of each other, we can make the outer most quantifier possessive.
首次尝试时正确匹配的正则表达式示例回溯后得到正确的结果 ^([bcd] +:[ab] +)+ $
对输入如 b:ab:a
。第一次迭代将匹配 b:ab
,这会导致第二次迭代失败,然后它回溯并重试,第一次迭代为 b:a
然后成功匹配整个字符串。
Example of a regex that doesn't match things correctly on first try and only get the correct result after backtracking would be ^([bcd]+:[ab]+)+$
against inputs such as b:ab:a
. The first iteration will match b:ab
, which cause the 2nd iteration to fail, then it backtracks and retry with the first iteration being b:a
and then successfully match the whole string.
这篇关于在java中模式匹配的Stackoverflow的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!