PHP-为什么会警告我正则表达式太大? [英] PHP - Why am I being warned that my regular expression is too large?

查看:178
本文介绍了PHP-为什么会警告我正则表达式太大?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用正则表达式来验证用户输入.我想允许字母,数字,空格,逗号,撇号,句点,感叹号和问号的任何组合,但我也想将输入限制为4000个字符.我想出了以下正则表达式来实现这一目标:/^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i.

I would like to use a regular expression to validate user input. I want to allow any combination of letters, numbers, spaces, commas, apostrophes, periods, exclamation marks, and question marks, but I also want to limit the input to 4000 characters. I have come up with the following regular expression to achieve this: /^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i.

但是,当我尝试使用此正则表达式通过preg_match()在PHP中测试一个主题时,会收到警告:PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37并且该主题无法测试.

However, when I attempt to use this regular expression test a subject in PHP with preg_match(), I am given a warning: PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 and the subject fails to be tested.

我发现这很奇怪,因为如果使用无限量词,则测试可以顺利通过(我在下面演示了这种情况).

I find this strange because if I use an infinite quantifier, the test passes just fine (I demonstrate this situation below).

为什么将重复次数限制为4000个问题,而无限次重复却没有呢?

regex-test.php:

<?php

$infinite = "/^([a-z]|[0-9]| |,|'|\.|!|\?)*$/i";        // Allows infinite repetition
$fourk    = "/^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i"; // Limits repetition to 4000

$string   = "I like apples.";

if ( preg_match($infinite, $string) ){

    echo "Passed infinite repetition. \n";
}

if ( preg_match($fourk, $string) ){

    echo "Passed maximum repetition of 4000. \n";
}

?>

回声:

Passed infinite repetition 
PHP Warning:  preg_match(): Compilation failed: regular expression is too large at offset 37 in regex-test.php on line 16

推荐答案

该错误是由于其 LINK_SIZE 引起的,偏移值将编译后的模式大小限制为64K.这是预期的行为,将在下面进行解释,这并不是因为重复的限制,也不是因为在编译时如何解释模式.

The error is due to its LINK_SIZE, with offset values limiting the compiled pattern size to 64K. This is an expected behavior, explained below, and it's not because of a limit in repetition nor how the pattern is interpreted when compiled.

正如艾伦·摩尔(Alan Moore)在他的答案中指出的那样,所有字符都应使用相同的

As Alan Moore pointed out in his answer, all characters should be in the same character class. I'm more drastic, so allow me to say that pattern is so wrong it makes me cringe.
-No offense, most of us tried that once too. It's just an attempt to underline that in no way such constructs should be used.

(x|y|z){1,4000}中有3个常见陷阱:

  1. 捕获子模式仅在需要时使用(存储 a匹配文本的特定部分,以便提取该值或在反向引用).对于所有其他用例,请坚持使用非捕获组原子组.它们表现更好,并节省了内存.
  2. 捕获子模式不应重复,因为最后一次重复会覆盖捕获的文本.
    -好的,它只能在非常特殊的情况下使用.
  3. 替代(带有| s )添加回溯状态.尽量减少它们是一个好习惯.在这种情况下,正则表达式^[ !',.0-9?A-Z]{1,4000}$/i将完全匹配,不仅避免了错误,而且还证明了更好的性能.
  1. Capturing subpatterns should only be used when needed (to store a specific part of the matched text, in order to extract that value or to use it in a backreference). For all other use cases, stick to non-capturing groups or atomic groups. They perform better and save memory.
  2. Capturing subpatterns should not be repeated because the last repetition overwrites the captured text.
    -OK, it could be used only in very particular cases.
  3. Alternation (with the |s) adds backtracking states. It's a good practice to try to reduce them as much as you can. In this case, the regex ^[ !',.0-9?A-Z]{1,4000}$/i, would match exactly the same, not only avoiding the error, but also proving better performance.


LINK_SIZE

来自"处理超大图案 "rel =" nofollow noreferrer> pcrebuild手册页:

在已编译的模式中,偏移值用于指向一个 部分到另一部分(例如,从左括号到 交替元字符).默认情况下,在8位和16位 库,将两个字节的值用于这些偏移量,从而导致 编译后的图案的最大大小约为64K.

Within a compiled pattern, offset values are used to point from one part to another (for example, from an opening parenthesis to an alternation metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values are used for these offsets, leading to a maximum size for a compiled pattern of around 64K.

这意味着对于组中的每个重复,已编译的模式将为交替中的每个子模式存储一个偏移值.在这种情况下,偏移量不会在其余的已编译模式中留下任何内存.

That means the compiled pattern stores an offset value for every subpattern in the alternation, for every repetition of the group. In this case the offsets leave no memory for the rest of the compiled pattern.

pcre_internal.h (来自PHP发行版):

This is more clearly expressed in a comment in pcre_internal.h from the PHP dist:

PCRE会将已编译代码中的偏移量保持为2字节数量(始终) 默认情况下按大端顺序存储).例如,使用这些 从子模式的开头链接到其替代项及其 结尾.每个偏移量使用2个字节限制了编译后的大小 regex大约为64K,几乎可以容纳每个人.

PCRE keeps offsets in its compiled code as 2-byte quantities (always stored in big-endian order) by default. These are used, for example, to link from the start of a subpattern to its alternatives and its end. The use of 2 bytes per offset limits the size of the compiled regex to around 64K, which is big enough for almost everybody.


使用 pcretest ,得到以下信息:

Using pcretest, I get the following information:

PCRE version 8.37 2015-04-28

/^([a-z]|[0-9]| |,|'|\.|!|\?){1,575}$/i
Failed: regular expression is too large at offset 36

/^([a-z]|[0-9]| |,|'|\.|!|\?){1,574}$/i
Memory allocation (code space): 65432

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆