PHP-为什么会警告我正则表达式太大? [英] PHP - Why am I being warned that my regular expression is too large?
问题描述
我想使用正则表达式来验证用户输入.我想允许字母,数字,空格,逗号,撇号,句点,感叹号和问号的任何组合,但我也想将输入限制为4000个字符.我想出了以下正则表达式来实现这一目标:/^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i
.
I would like to use a regular expression to validate user input. I want to allow any combination of letters, numbers, spaces, commas, apostrophes, periods, exclamation marks, and question marks, but I also want to limit the input to 4000 characters. I have come up with the following regular expression to achieve this: /^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i
.
但是,当我尝试使用此正则表达式通过preg_match()在PHP中测试一个主题时,会收到警告:PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37
并且该主题无法测试.
However, when I attempt to use this regular expression test a subject in PHP with preg_match(), I am given a warning: PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37
and the subject fails to be tested.
我发现这很奇怪,因为如果使用无限量词,则测试可以顺利通过(我在下面演示了这种情况).
I find this strange because if I use an infinite quantifier, the test passes just fine (I demonstrate this situation below).
为什么将重复次数限制为4000个问题,而无限次重复却没有呢?
regex-test.php:
<?php
$infinite = "/^([a-z]|[0-9]| |,|'|\.|!|\?)*$/i"; // Allows infinite repetition
$fourk = "/^([a-z]|[0-9]| |,|'|\.|!|\?){1,4000}$/i"; // Limits repetition to 4000
$string = "I like apples.";
if ( preg_match($infinite, $string) ){
echo "Passed infinite repetition. \n";
}
if ( preg_match($fourk, $string) ){
echo "Passed maximum repetition of 4000. \n";
}
?>
回声:
Passed infinite repetition
PHP Warning: preg_match(): Compilation failed: regular expression is too large at offset 37 in regex-test.php on line 16
推荐答案
该错误是由于其 LINK_SIZE
引起的,偏移值将编译后的模式大小限制为64K.这是预期的行为,将在下面进行解释,这并不是因为重复的限制,也不是因为在编译时如何解释模式.
The error is due to its LINK_SIZE
, with offset values limiting the compiled pattern size to 64K. This is an expected behavior, explained below, and it's not because of a limit in repetition nor how the pattern is interpreted when compiled.
正如艾伦·摩尔(Alan Moore)在他的答案中指出的那样,所有字符都应使用相同的
As Alan Moore pointed out in his answer, all characters should be in the same character class. I'm more drastic, so allow me to say that pattern is so wrong it makes me cringe.
-No offense, most of us tried that once too. It's just an attempt to underline that in no way such constructs should be used.
在(x|y|z){1,4000}
中有3个常见陷阱:
- 捕获子模式仅在需要时使用(存储 a匹配文本的特定部分,以便提取该值或在反向引用).对于所有其他用例,请坚持使用非捕获组或原子组.它们表现更好,并节省了内存.
- 捕获子模式不应重复,因为最后一次重复会覆盖捕获的文本.
-好的,它只能在非常特殊的情况下使用. - 替代(带有
|
s )添加回溯状态.尽量减少它们是一个好习惯.在这种情况下,正则表达式^[ !',.0-9?A-Z]{1,4000}$/i
将完全匹配,不仅避免了错误,而且还证明了更好的性能.
- Capturing subpatterns should only be used when needed (to store a specific part of the matched text, in order to extract that value or to use it in a backreference). For all other use cases, stick to non-capturing groups or atomic groups. They perform better and save memory.
- Capturing subpatterns should not be repeated because the last repetition overwrites the captured text.
-OK, it could be used only in very particular cases. - Alternation (with the
|
s) adds backtracking states. It's a good practice to try to reduce them as much as you can. In this case, the regex^[ !',.0-9?A-Z]{1,4000}$/i
, would match exactly the same, not only avoiding the error, but also proving better performance.
LINK_SIZE
来自"处理超大图案
"rel =" nofollow noreferrer> pcrebuild手册页:
在已编译的模式中,偏移值用于指向一个 部分到另一部分(例如,从左括号到 交替元字符).默认情况下,在8位和16位 库,将两个字节的值用于这些偏移量,从而导致 编译后的图案的最大大小约为64K.
Within a compiled pattern, offset values are used to point from one part to another (for example, from an opening parenthesis to an alternation metacharacter). By default, in the 8-bit and 16-bit libraries, two-byte values are used for these offsets, leading to a maximum size for a compiled pattern of around 64K.
这意味着对于组中的每个重复,已编译的模式将为交替中的每个子模式存储一个偏移值.在这种情况下,偏移量不会在其余的已编译模式中留下任何内存.
That means the compiled pattern stores an offset value for every subpattern in the alternation, for every repetition of the group. In this case the offsets leave no memory for the rest of the compiled pattern.
在 pcre_internal.h (来自PHP发行版):
This is more clearly expressed in a comment in pcre_internal.h from the PHP dist:
PCRE会将已编译代码中的偏移量保持为2字节数量(始终) 默认情况下按大端顺序存储).例如,使用这些 从子模式的开头链接到其替代项及其 结尾.每个偏移量使用2个字节限制了编译后的大小 regex大约为64K,几乎可以容纳每个人.
PCRE keeps offsets in its compiled code as 2-byte quantities (always stored in big-endian order) by default. These are used, for example, to link from the start of a subpattern to its alternatives and its end. The use of 2 bytes per offset limits the size of the compiled regex to around 64K, which is big enough for almost everybody.
使用 pcretest ,得到以下信息:
Using pcretest, I get the following information:
PCRE version 8.37 2015-04-28
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,575}$/i
Failed: regular expression is too large at offset 36
/^([a-z]|[0-9]| |,|'|\.|!|\?){1,574}$/i
Memory allocation (code space): 65432
- 您可以从 RexEgg.com 下载Windows版本./li>
- There's a Windows version you can download from RexEgg.com.
关于PCRE中的其他大小限制,您可以查看我的帖子.
Regarding other size limitations in PCRE, you can check this post of mine.
如果我们确实有理由使用巨大的模式,并且不能完全简化此模式,则可以增加链接的大小.但是,您只能通过自己重新编译PHP来实现此目的(因此,从现在开始,您的代码将无法移植).如果没有其他选择,那应该是最后的选择.
If we had a true reason to use a huge pattern, and this pattern could not be simplified any further by all means, the link size could be increased. However, you can only achieve this by recompiling PHP yourself (therefore, your code won't be portable from now on). It should be the last resort, provided there's no other choice.
也在 pcre_internal.h :
Also commented in pcre_internal.h:
宏由
LINK_SIZE
的值控制. 在 config.h <中,该默认值为2/a>文件, 但是可以通过在命令行上使用-D
进行覆盖. 在Unix系统上,这是通过配置"命令自动完成的.The macros are controlled by the value of
LINK_SIZE
. This defaults to 2 in the config.h file, but can be overridden by using-D
on the command line. This is automated on Unix systems via the "configure" command.PCRE链接大小可以配置为3或4:
PCRE link size can be configured to 3 or 4:
./configure -DLINK_SIZE=4
但是请记住,更长的偏移量需要更多数据,这会减慢对
preg_*
函数.But keep in mind that longer offsets require additional data, and it will slow down all calls to
preg_*
functions.如果您自己编译PHP,请参见在Unix系统上安装或在Windows上构建自己的PHP .
In case of compiling PHP on your own, see Installation on Unix systems or Build your own PHP on Windows.
这篇关于PHP-为什么会警告我正则表达式太大?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!