在java中模式匹配的Stackoverflow [英] Stackoverflow in pattern matching in java

查看：132 发布时间：2019/1/2 10:52:15 java regex regex-greedy

本文介绍了在java中模式匹配的Stackoverflow的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试根据双引号之间没有的空格分割一行。

I tried to split a line based on spaces not enclosed between double quotes.

我的正则表达式是

(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+

我的代码

Pattern regex          = Pattern.compile("(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+");
Matcher regexMatcher   = regex.matcher(line);
List<String> rule      = new ArrayList<String>();

while(regexMatcher.find())
    rule.add(regexMatcher.group());

输入失败的输入。

SecRule REQUEST_COOKIES |！REQUEST_COOKIES：/ __ utm / | REQUEST_COOKIES_NAMES | ARGS_NAMES | ARGS | XML：/ *（？i：\b（？:( ?: S（：T（：d（：？？？DEV（_pop | _samp））| R（：？？？_ TO_DATE | CMP））| U（：？？？b（：STR（：荷兰国际集团（_index））|（？：？DAT | TIM）E）| M）| E（：C（：？_ TO_TIME | OND）| ssion_user）| YS（？：tem_user |日期）|公顷（1 | 2 ？）| oundex | CHEMA | IG N |节奏| QRT）|我？（？：S（空| _（free_lock | ipv4_compat | ipv4_mapped |的IPv4 | IPv6的| NOT_NULL |没有|空| used_lock））| N（？：ET6 _（航标| ntoa）| S？（？：ERT | TR）|？terval）| F（？NULL））| U（：N（：压缩（？：？？ed_length）|？ix_timestamp |十六进制） | TC_（日期|时间|时间戳）| p（？：datexml |每）| UID（_short）|的情况下| SER）| L（：α，α（：CA（：？？？升（时间戳）| TE） | G（2 | 10）|？ad_file | WER）| AST（_day | _insert_id）|？E（:( ?:作为？| F）T | NGTH）|的情况下|修剪|垫| N）| T（？：IME（邮票| stampadd | stampdiff |的diff | _format | _to_sec）| O_（BASE64 |天|秒| N炭？）| R？（？：uncate | IM）|的）|米（？：一个（?:柯（？：_集|日期）| ster_pos_wait | X）| I（:( ?: crosecon）d | N（？：？？UTE））| O（？：第n个（名）| d）| D5）？ | R（？：ΔE（：p（：花边|吃）| lease_lock |节）| O（？：？w_count | UND）|一个（：体类| ND）|飞行|修剪|垫）| F（？：我（？：？场（_in_set）| nd_ in_set时）| rom_（的base64 |天| unixtime）| O（？：und_rows | RMAT）| loor）| A（：ES _（？：DE | EN）隐窝| S（？：？CII（STR）|？中） | DD（？：DAT |添）E |（？：共| b）S | tan2 | VG）| p？（：O（：？？sition | W（ER）？）| eriod_（添加|差异）| rocedure_analyse | assword | I）| b（：I（：？吨_（?:长度|计数| x或|和）| N（_to_num）？）| enchmark）| E（X（：p（？？：ort_set）|道（值））| NC（？：rypt | ODE）| LT）| v（？：一个（：R（：？？？_（?: SAM | PO）p | iance）|梅毒）|版为）| G（：R（：oup_conca | eates）T | et_（格式|锁））| O（:( ?: ld_passwo）RD | CT（et_length））|我们（？？？？：EK（天| ofyear）| ight_string）| N？ame_const | | ullif）|（rawton）十六进制（toraw）|曲（（：O（T_IN | W？）：？？？？ARTER | OTE）| （PG _）睡眠|一年（周）| d计算|的XMLType |小时）\W * \（| \b（:( ?: S（：？？？？？？elect\b（{ 1100} \b？（？：:(长度|？计数|顶部）\b {1100} \bfrom |。？from\b {1100} \bwhere）|。？。？* \b （：d（？：？ump\b * \bfrom | ata_type）|（？：_到（?: numbe | CHA）|研究所）R））| p _（？：SQLEXEC | sp_replwritetovarbin |则sp_help | addextendedproc | IS_SRVROLEMEMBER |准备| sp_password的|执行：| makewebtask | oacreate）| QL _（？：LONGVARCHAR |变体））| XP _（？：REG（（SQL）：？？重（？：movemultistring |广告）|删除（？：值|键）|枚举（：值|键）S | addmultistring |写）|结束| xp_servicecontrol | xp_ntsec_enumdomains | xp_terminate_process | E（：？？xecresultset | numdsn）| availablemedia | loginconfig | CMDSHELL |文件列表| dirtree | makecab | ntsec）| U（：nion\b {} 1100 \bselect | TL _（？：文件| HTTP？。？））| d（？：？b（：a_users | ms_java）| elete\b\W * \bfrom）| group\b * \bby\b {} 1100 \bhaving |开（？：？？行集| OWA_UTIL |查询）|负荷？\b\W * \bdata\b * \binfile |（？：3 N VARCHA | tbcreato）R | AUTONOMOUS_TRANSACTION）\b | I（？：？N（：to\b\\ \\W * \b？（？：转储|出来）？文件| sert\b\W * \binto | ner\b\W * \bjoin）\b |（：·F ？（？：？？\b\W * \（\W * \bbenchmark | null\b）| snull\b）\W * \（）| print\b\ ？W * \ \ @ @ | cast\b\W * \（）| C（:( ?: UR（：出租_（？：时间（：？？？？盖章）|日期|用户）|（：DAT | TIM）E）| H（？：？AR（:( ?: ACTER）_长度|集）| R）| IEL（：？？？？？荷兰国际集团）| AST | R32）\W * \（| O（:( ?: N（v（：ERT（：？？？？？？_ TZ））| C在？（？：_ WS）| nection_id）|（？：？？mpres）S | ercibility | alesce | T）\W * \（| llation\W * \（一））| d（:( ？一（：T（：E（：？？？？？（_（添加|格式|子））|差异）| ABASE）| Y（名称| ofmonth | ofweek | ofyear））|？E（:( ？：S_（DE | EN）cryp | FAUL）T | grees |代码）| UMP）\W * \（| bms_\w + \.\b）|（：; \W *？ \b（？：关机|降）| \ \ @ @版本）\b | \butl_inaddr\b | \bsys_context\b |'（？：？S（：qloledb |一）| msdasql | dbo）'））阶段：2，rev：'2'，ver：'OWASP_CRS / 2.2.9'，成熟度：'9'，准确度：'8'，capture，t：none，t：urlDecodeUni ，ctl：auditLogParts = + E，block，msg：'SQL Injection Attack'，id：'950001'，标签：'OWASP_CRS / WEB_ATTACK / SQL_INJECTION'，标签：'WASCTC / WASC-19'，标签：'OWASP_TOP_10 / A1 '，tag：'OWASP_AppSensor / CIE1'，标签：'PCI / 6.5.2'，logdata：'匹配数据：％{TX.0}在％{MATCHED_VAR_NAME}内找到：％{MATCHED_VAR}'，严重性：'2' ，SETVAR： 'tx.msg =％{rule.msg}'，SETVAR：tx.sql_injection_score = +％{tx.critical_anomaly_score}，SETVAR：tx.anomaly_score = +％{tx.critical_anomal y_score}，setvar：tx。％{rule.id} -OWASP_CRS / WEB_ATTACK / SQL_INJECTION - ％{matched_var_name} =％{tx.0}

当我在java中使用它时，某些行成功分离，但是某些行导致错误

When i used this in java, some lines are separated successfully, but some lines are causing errors

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4235)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)

样本输入：

world \很漂亮但我看不到它

The "world \"is beautiful" but i "cannot see" it

预期输出：

The
"world \" beautiful"
but
i
"cannot see"
it

为什么会发生StackOverflowError？

在 Pattern 类的参考实现中（它带有Oracle的JRE，OpenJDK和许多其他JVM），贪婪和懒惰q当重复模式不重要时，使用递归¹实现uantifiers。因此，当输入字符串足够长时，您将遇到 StackOverflowError 。

Why does StackOverflowError occur?

In reference implementation of Pattern class (which comes with Oracle's JRE, OpenJDK, and a number of other JVMs), greedy and lazy quantifiers are implemented with recursion¹ when the repeated pattern is non-trivial. Therefore, you will run into StackOverflowError when the input string is long enough.

^{¹递归是一种快速但不可扩展的解决方案，允许在模式中进行回溯。更好的实现使用数据结构来存储回溯点
（它基本上将递归解决方案转换为带有堆栈的迭代解决方案）。}

^{¹ Recursion is a quick but not scalable solution to allow backtracking in the pattern. Better implementation uses a data structure to store backtracking points
(which basically converts recursive solution to iterative solution with stack).}

以下正则表达式应该有效：

The following regex should work:

"(?:\"(?:[^\"\\\\]++|\\\\.)*+\"|[^ \"]++)++"

好吧，正则表达式令人困惑的是2层转义：在Java字符串文字中转义并在正则表达式语法中转义。

Well, the regex is quite confusing with 2 layers of escaping: escaping in Java string literal and escaping in regex syntax.

打印字符串时的原始正则表达式。我的解释将基于原始正则表达式。

The raw regex when you print the string out. My explanation will be based on the raw regex.

(?:"(?:[^"\\]++|\\.)*+"|[^ "]++)++

解释

由于您只关心整个正则表达式匹配的内容，因此所有捕获组（模式）都已转为进入非捕获组（?: pattern）以提高效率。

Explanation

Since you only care about what the whole regex matches, all the capturing groups (pattern) has been turned into non-capturing group (?:pattern) for efficiency.

第一个替代 （？：[^\\] ++ | \\。）* +匹配带引号的字符串。

The first alternative "(?:[^"\\]++|\\.)*+" matches a quoted string.

第二个替代 [^] ++ 匹配不包含空格的字符序列和双引号。

The second alternative [^ "]++ matches a sequence of character that doesn't contain space and double quote ".

(?:
   "             # Double quote
   (?:
      [^"\\]++   # A sequence of characters that are not " and \
      |          # OR
      \\.        # Escape sequence: \ followed by any character (except line terminators)
   )*+           # Match 0 or more of the sequences above (allows empty string)
   "             # Double quote
   |
   [^ "]++
)++

由于正则表达式是为了不需要回溯，所有量词都被占有。因为 Pattern 类用循环实现占有量词，而不是像贪婪/懒惰量词一样递归， StackOverflowError 不会发生。

Since the regex is written so that there is no needs for backtracking, all quantifiers are made possessive. Since Pattern class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers, StackOverflowError will not occur.

我通过编写正则表达式来消除回溯的需要，使其与正确的匹配第一次尝试使用字符串：

I remove the need for backtracking by writing the regex so that it matches the correct string on first try:

由于 [^\\] 排除 \ ，我们无法从转义序列中窃取 \ ，或者窃取一个并搞乱结束报价，我们可以安全地向前推进而无需回溯。这解释了占有量词在这里 [^\\ \\\] ++ 。这里不需要指定量词，但我这样做是为了减少分支的工作。

Since [^"\\] excludes the \, there is no way we can "steal" a \ from an escaping sequence, or "steal" a " and mess up the closing quote, we can safely advance ahead without backtracking. This explains the possessive quantifier here [^"\\]++. There is no need to assign a quantifier here, but I do this to reduce the work on the branching.

因为 [^ \\] ++ 和 \\。无法窃取并搞砸了收盘价，我们可以在没有回溯的情况下向前推进。这解释了占有量词这里（？：[^\\] ++ | \\。）* +

Since both [^"\\]++ and \\. can't "steal" a " and mess up the closing quote, we can advance ahead without backtracking. This explains the possessive quantifier here (?:[^"\\]++|\\.)*+

[^] 无法启动带引号的字符串，也无法匹配空格（分隔符）。这就是我们可以使用占有量词的原因。

[^ "] can't start a quoted string, and it also can't match a space (delimiter). This is why we can use possessive quantifier on it.

因为（？：[^\\] ++ | \\。）* +和 [^] ++ 不能搞砸对方的比赛，我们可以使外部量词占有率最高。

Since "(?:[^"\\]++|\\.)*+" and [^ "]++ can't mess up the match of each other, we can make the outer most quantifier possessive.

首次尝试时正确匹配的正则表达式示例回溯后得到正确的结果 ^（[bcd] +：[ab] +）+ $ 对输入如 b：ab：a 。第一次迭代将匹配 b：ab ，这会导致第二次迭代失败，然后它回溯并重试，第一次迭代为 b：a 然后成功匹配整个字符串。

Example of a regex that doesn't match things correctly on first try and only get the correct result after backtracking would be ^([bcd]+:[ab]+)+$ against inputs such as b:ab:a. The first iteration will match b:ab, which cause the 2nd iteration to fail, then it backtracks and retry with the first iteration being b:a and then successfully match the whole string.

这篇关于在java中模式匹配的Stackoverflow的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在java中模式匹配的Stackoverflow [英] Stackoverflow in pattern matching in java

问题描述

推荐答案

为什么会发生StackOverflowError？

Why does StackOverflowError occur?

解释

Explanation

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

在java中模式匹配的Stackoverflow [英] Stackoverflow in pattern matching in java

问题描述

推荐答案

为什么会发生StackOverflowError？

Why does StackOverflowError occur?

解释

Explanation

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭