需要防止 PHP regex segfault [英] Need to prevent PHP regex segfault

查看:45
本文介绍了需要防止 PHP regex segfault的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么会出现以下段错误,我该如何预防?

Why does the following segfault, and how can I prevent it?

<?php

$str = ' <fieldset> <label for="go-to">Go to: </label>  ' 
       . str_repeat(' ', 10000) 
       . '<input type="submit" value="Go" /> </fieldset> </form>';

preg_match_all("@
</?(?![bisa]\b)(?!em\b)[^>]*> # starting tag, must not be one of several inline tags
(?:[^<]|</?(?:(?:[bisau]|em|strong|sup)\b)[^>]*>)* #allow text and some inline tags
[\?\!\.]+
@ix", $str, $matches);

?>

我相信它会导致......等待......堆栈溢出.

I believe it's causing a .... wait for it .... stack overflow.

以上是演示问题的模式的简化版本.更完整的版本:

The above is a simplified version the pattern that demonstrates the problem. A more complete version:

@
</?(?![bisa]\b)(?!em\b)[^>]*> # starting tag, must not be one of several inline tags
(?:[^<]|</?(?:(?:[bisau]|em|strong|sup)\b)[^>]*>)* # continue, allow text content and some inline tags

# normal sentence ending
[\?\!\.]+ # valid ending characters -- note elipses allowed
(?<!\b[ap]m\.)(?<!\b[ap]\.m\.)(?<!digg this\!)(?<!Stumble This\!) # disallow some  false positives that we don't care about
\s*
(?:&apos;|&\#0*34;|'|&lsquo;)?\s* # closing single quotes, in the unusual case like "he said: 'go away'".
(?:"|&quot;|&\#0*34;|&\#x0*22;|&rdquo;|&\#0*8221;|&\#x0*201D;|''|``|\xe2\x80\x9d|&\#0*148;|&\#x0*94;|\x94|\))?\s* # followed by any kind of close-quote char
(?=\<) # should be followed by a tag.
@ix

目的是找到似乎以有效英文句子结尾结尾的 html 块.我发现这种方法非常擅长区分内容"文本(如文章正文)和布局"文本(即导航元素).然而,有时如果标签之间有大量空白,它就会爆炸.

The purpose is to find html blocks that appear to end at what looks like a valid English sentence ending. I have found that this method is very good at telling the difference between 'content' text (like an article body) and 'layout' text (i.e., like navigational elements). Sometimes if there's a vast amount of white space in between tags it blows up, however.

推荐答案

我要尝试的第一件事是使所有量词都具有所有格,并使所有组具有原子性:

The first thing I would try is making all the quantifiers possessive and all the groups atomic:

"@</?+(?![bisa]\b)(?!em\b)[^>]*+>
(?>[^<]++|</?+(?>(?>[bisau]|em|strong|sup)\b)[^>]*+>)*+
[?!.]+
@ix"

我认为 Jeremy 是对的:它不是回溯本身会杀死您,而是正则表达式引擎必须保存的所有状态信息才能使回溯成为可能.正则表达式似乎是以这样的方式构建的,如果它不得不回溯,无论如何它都会失败.所以使用所有格量词和原子组,不要费心保存所有无用的信息.

I think Jeremy's right: it's not backtracking per se that's killing you, it's all the state info the regex engine has to save to make backtracking possible. The regex seems to be constructed in such a way that if it ever has to backtrack, it's going to fail anyway. So use possessive quantifiers and atomic groups and don't bother saving all that useless info.

为了允许句子结尾的标点符号,您可以在第二行添加另一个替代选项:

to allow for the sentence-ending punctuation, you could add another alternative to the second line:

(?>[^<?!.]++|(?![^?!.\s<]++<)[?!.]++|</?+(?>(?>[bisau]|em|strong|sup)\b)[^>]*+>)*+

添加匹配一个或多个所述字符,除非它们是元素中的最后一个非空白字符.

The addition matches one or more of said characters, unless they're the last non-whitespace characters in the element.

这篇关于需要防止 PHP regex segfault的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆