PCRE PHP"S"的用法和实用性的具体示例. (模式的额外分析)修饰符? [英] PCRE PHP Concrete example of the usage and utility of the "S" (Extra analysis of pattern) modifier?

查看:79
本文介绍了PCRE PHP"S"的用法和实用性的具体示例. (模式的额外分析)修饰符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

PHP手册在 http://php.net/manual/zh/reference.pcre.pattern.modifiers.php

S

当一个模式要被多次使用时,值得 花更多时间分析它,以加快时间 用于匹配.如果设置了此修饰符,则此额外分析为 执行.目前,研究模式仅对 没有单个固定开始的非锚定模式 字符.

因此,它的用法与应该使用几次的模式有关,这些模式内部没有锚点(例如^$)或固定的起始字符序列,例如以类似'/^abc/'的模式.

但是没有具体的细节,例如应用此修饰符及其实际作用方式.

它仅适用于当前正在执行的脚本的PHP线程,并且在脚本执行后,模式的缓存"分析丢失了吗?还是引擎将模式分析存储在全局缓存中,然后供使用PCRE且标记有此修饰符的模式的多个PHP线程使用?

此外,从PCRE引言中: http://php.net/manual/en/intro.pcre.php

注意:此扩展维护已编译的全局每线程缓存 正则表达式(最多4096个)

如果仅对每个线程使用"S"修饰符,它与已编译的正则表达式的PCRE缓存有何不同?我猜想存储了其他信息,就像MySQL在索引表中的行时所做的事情一样(当然,对于PCRE,此附加信息存储在内存中).

最后但并非最不重要的是,有人使用过该修饰符的真实用例吗?您是否注意到改进之处并欣赏其好处?

感谢您的关注.

解决方案

PHP文档引用了PCRE文档的一小部分.这是来自 PCRE 8.36 的更多详细信息(重点为我):

如果要多次使用已编译的模式,则值得花费更多的时间对其进行分析,以加快匹配所需的时间.函数pcre_study()将指向已编译模式的指针作为其第一个参数.如果研究模式会产生有助于加快匹配速度的其他信息,则pcre_study()返回指向pcre_extra块的指针,其中study_data字段指向研究结果.

...

研究模式有两件事:首先,计算匹配模式所需的主题字符串长度的下限.这并不意味着有任何该长度的字符串都可以匹配,但是可以确保没有更短的字符串匹配.该值用于通过尝试匹配短于下限的字符串来避免浪费时间.您可以通过pcre_fullinfo()函数在调用程序中找到该值.

研究模式对于没有单个固定起始字符的非锚定模式也很有用.创建可能的起始字节的位图. (在16位模式下,位图用于小于256的16位值.在32位模式下,位图用于进行匹配的速度更快).(在16位模式下,位图用于小于256的16位值.小于256的32位值.)

请注意,在更高版本的PCRE(v10.00,也称为PCRE2)中,该库已进行了大规模的重构和API重新设计.结果之一是,研究总是在PCRE 10.00及更高版本中进行的.我不知道PHP何时会使用PCRE2,但是迟早会发生,因为PCRE 8.x从现在开始将不会获得任何新功能.

以下是 PCRE2发布公告的引文:

已取消对编译模式的明确研究"-现在总是如此 自动发生. JIT编译是通过调用新函数来完成的, 从pcre2_compile()成功返回后的pcre2_jit_compile().


关于第二个问题:

如果仅对每个线程使用"S"修饰符,它与已编译的正则表达式的PCRE缓存有何不同?

PCRE本身没有缓存,但是PHP维护了正则表达式的缓存,以避免一遍又一遍地重新编译相同的模式,例如,在循环中使用preg_函数的情况.

the PHP manual states the following about the PCRE's "S" (Extra analysis of pattern) modifier on http://php.net/manual/en/reference.pcre.pattern.modifiers.php

S

When a pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. If this modifier is set, then this extra analysis is performed. At present, studying a pattern is useful only for non-anchored patterns that do not have a single fixed starting character.

So its usage is related to patterns which should be used several times, without anchors inside of them (such as ^, $) or a fixed starting char sequence, e.g. in a pattern like '/^abc/'.

But there aren't any specific details on where e.g. apply this modifier and how it actually works.

Does it apply only for the PHP thread of the current executing script and after the script is executed the "cached" analysis of the pattern is lost? Or does the engine store the analysis of the pattern in a global cache which is then made available to several PHP threads that use PCRE with the pattern marked with this modifier?

Also, from the PCRE introduction: http://php.net/manual/en/intro.pcre.php

Note: This extension maintains a global per-thread cache of compiled regular expressions (up to 4096)

If the "S" modifier is used per-thread only, how does it differs from the PCRE cache of compiled regexps? I guess additional information is stored, something like MySQL does when it indexes the rows in a table (of course in the case of PCRE, this additional information is stored in memory).

Last, but not the least, have someone experienced a real use case where he/she had used this modifier and did you notice an improvement and appreciate its benefits?

Thanks for the attention.

解决方案

PHP docs quote a small part of the PCRE docs. Here are some more details (emphasis mine) from PCRE 8.36:

If a compiled pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. The function pcre_study() takes a pointer to a compiled pattern as its first argument. If studying the pattern produces additional information that will help speed up matching, pcre_study() returns a pointer to a pcre_extra block, in which the study_data field points to the results of the study.

...

Studying a pattern does two things: first, a lower bound for the length of subject string that is needed to match the pattern is computed. This does not mean that there are any strings of that length that match, but it does guarantee that no shorter strings match. The value is used to avoid wasting time by trying to match strings that are shorter than the lower bound. You can find out the value in a calling program via the pcre_fullinfo() function.

Studying a pattern is also useful for non-anchored patterns that do not have a single fixed starting character. A bitmap of possible starting bytes is created. This speeds up finding a position in the subject at which to start matching. (In 16-bit mode, the bitmap is used for 16-bit values less than 256. In 32-bit mode, the bitmap is used for 32-bit values less than 256.)

Please note that in the later PCRE version (v10.00, also called PCRE2), the lib has undergone a massive refactoring and API redesign. One of the consequences is that studying is always performed in PCRE 10.00 and above. I don't know when PHP will make use of PCRE2, but it will happen sooner or later because PCRE 8.x won't get any new features from now on.

Here's a quote from the PCRE2 release announcment:

Explicit "studying" of compiled patterns has been abolished - it now always happens automatically. JIT compiling is done by calling a new function, pcre2_jit_compile() after a successful return from pcre2_compile().


As for your second question:

If the "S" modifier is used per-thread only, how does it differs from the PCRE cache of compiled regexps?

There's no cache in PCRE itself, but PHP maintains a cache of regexes to avoid recompiling the same pattern over and over again, for instance in case you use a preg_ function inside a loop.

这篇关于PCRE PHP"S"的用法和实用性的具体示例. (模式的额外分析)修饰符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆