PHP正则表达式:此代码有什么问题吗? [英] PHP regex: is there anything wrong with this code?

查看:70
本文介绍了PHP正则表达式:此代码有什么问题吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

preg_replace_callback('#<(code|pre)([^>]*)>(((?!</?\1).)*|(?R))*</\1>#si', 'self::replaceit', $text);

?

我正在尝试在代码/前置标签之间替换文本,它可以实现我想要的功能,但有时会中断页面​​.

I'm trying to replace text between code/pre tags and it does what I want, but sometimes it breaks the page.

我用一些文本示例进行了测试,其中一些包含很多&amp; &lt;等字符.这些字符使浏览器停止显示带有远程服务器关闭连接"消息的页面

I tested it with a few text samples, and some of them that contain lots of &amp; &lt; etc. characters make the browser stop displaying the page with a "connection closed by remote server" message

推荐答案

我想提供帮助.我以前见过这个问题!

I'd like to help. I've seen this problem before!

您的正则表达式在逻辑上看起来不错,但是当应用于较大主题的字符串时,可能会导致大量递归回溯,这会导致PCRE引擎中的堆栈溢出.这种溢出会导致分段错误,并导致PCRE可执行文件(Apache或PHP)崩溃,而不会发出警告. (症状是远程服务器关闭连接" 消息.)未处理的崩溃是由于PHP无法为pcre.recursion_limit参数选择默认设置(默认为100,000,即太高).首先让我们看看这是否实际上是问题的一部分.

Your regex looks logically A-Ok, but when applied to a large-ish subject string, it is likely resulting in a lot of recursive backtracking, which is causing a stack-overflow in the PCRE engine. This overflow results in a segmentation fault and a crashing of the PCRE executable (either Apache or PHP), without warning. (The symptom is the "connection closed by remote server" message.) This un-handled crashing is due to PHP's poor choice of a default setting for the pcre.recursion_limit parameter (it defaults to 100,000 which is too high). First lets see if this is, in fact, part of the problem.

在脚本中添加以下代码:

Add the following code to your script:

// Place this at the top of the script
ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache

$re = '#<(code|pre)([^>]*)>(((?!</?\1).)*|(?R))*</\1>#si';
$text = preg_replace_callback($re, 'self::replaceit', $text);
// Check the return value for NULL which indicates a PCRE error.
if ($text === null) exit("PCRE Error! Subject too large or complex.");

有了这个适当的位置,您应该不再收到连接已关闭"消息,而应该收到PCRE错误退出消息.请注意,上面的524设置是针对Win32 Apache httpd.exe(具有256KB堆栈)的.如果在* nix服务器上运行,则可以将该值设置为16777.这些数字背后的原因是recursion _limit值应设置为可执行文件堆栈大小除以500.WIn32可执行文件通常具有256KB堆栈和* nix可执行文件通常是使用8MB堆栈构建的. Philip Hazel( excellent PCRE引擎的作者)已经详细解决了这个问题.请参阅: pcrestack手册页

With this in place you should no longer get the "connection closed" message but rather the PCRE error exit message. Note that the above setting of 524 is for a Win32 Apache httpd.exe (which has a 256KB stack). If you are running on a *nix server, you can up this value to 16777. The reasoning behind these numbers is that the recursion _limit value should be set to the executable stack size divided by 500. The WIn32 executable typically has a 256KB stack and *nix executables are typically built with an 8MB stack. Philip Hazel, (author of the excellent PCRE engine), has addressed this problem in detail. See: pcrestack man page

完成此操作后,请向我们报告,我们将为下一个阶段提供帮助...

Once you have done this, report back and I'll help with the next phase...

(请注意,不是引起问题的(?R)表达式.稍后更多.)

(Note that it is NOT the (?R) expression causing the problem. More later.)

通过实现Jeffrey Friedl的展开循环" 效率技术,可以显着改善正则表达式(在解决此问题和提高速度方面).这将大大减少必要的回溯,并可能解决您的问题.这是您的regex的改进版本(并有完整注释).

The regex can be significantly improved (with regard to both solving this issue and improving its speed), by implementing Jeffrey Friedl's "Unrolling-the-Loop" efficiency technique. This will dramatically reduce the number of necessary backtracks and likely solve your problem. Here is an improved (and thoroughly commented) version of your regex.

$re = '% # Match an outermost PRE or CODE element.
    (               # $1: PRE/CODE element open tag
      <(code|pre)   # $2: Open tag name
      [^>]*+>       # Remainder of opening tag.
    )               # End $1: PRE/CODE element open tag.
    (               # $3: PRE/CODE element contents.
      (?:           # Group for contents alternatives
        (?R)        # Either a nested PRE or CODE element
      |             # Or non- <CODE, </CODE, <PRE or </PRE stuff.
        [^<]*+      # Begin: {normal* (special normal*)*} construct
        (?:         # See: "Mastering Regular Expressions".
          <         # {special} Match a <, but only if it is
          (?!/?\2)  # not the start of a nested or closing tag.
          [^<]*+    # match more {normal*}
        )*+         # Finish "Unrolling the loop"
      )*+           # Zero or more contents alternatives.
    )               # End $3: PRE/CODE element contents.
    (</\2>)         # $4: PRE/CODE element close tag
    %ix';

但是,此正则表达式的不同之处在于它使用四个捕获组:$1包含整个元素开始标签,$2包含元素标签名称(用作反向引用),$3包含元素内容,$4包含元素结束标记.

However, this regex differs in that it uses four capture groups: $1 contains the whole element start tag, $2 contains the element tag name (which is used as a back reference), $3 contains the element contents, and $4 contains the element end tag.

这篇关于PHP正则表达式:此代码有什么问题吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆