将正向后方插入任意正则表达式以模拟字节偏移的结果 [英] Consequences of Inserting Positive Lookbehind into Arbitrary Regex to Simulate Byte Offset

查看:67
本文介绍了将正向后方插入任意正则表达式以模拟字节偏移的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在任意正则表达式的开头插入n字节(?<=\C{n})的正向后注的后果是什么,特别是在用于替换操作时?

至少在PHP中,正则表达式匹配函数preg_matchpreg_match_all允许在给定的字节偏移后开始匹配.其他任何PCRE PHP函数均没有相应的功能-例如,您可以指定preg_replace进行的替换次数的限制,但不能规定这些替换的匹配必须在n字节之后进行.

显然会给性能和可读性带来一些(让他们称之为微不足道的)后果,但是会带来任何(非琐碎的)影响,例如匹配变得不匹配(除非它们没有被n个字节抵消)或替代品变得畸形了?

一些例子:

/some expression/变为/(?<=\C{4})some expression/以获得4字节的偏移量

/(this) has (groups)/i变为/(?<=\C{2})(this) has (groups)/i 2个字节的偏移量

据我所知,从我已经运行的有限测试中,添加此lookbehind有效地模拟了该offset参数,并且不会与任何其他lookbehind,替代或其他控制模式混为一谈;但是我也不是Regex的专家.

我正在尝试通过将n字节的lookbehind插入到模式中来确定构建替换/过滤器​​功能扩展是否有可能的后果.它应该像匹配函数的offset参数一样工作-因此,对substr( $subject, $offset )运行简单表达式由于与preg_match不相同的原因而不会起作用(最值得注意的是,它切断了所有lookbehinds和^然后错误地匹配了子字符串的开头,而不是原始字符串).

解决方案

简短答案

在非UTF模式下,使用UTF-8库

假定与PHP捆绑在一起的PCRE库被编译为 8位库(UTF-8),然后在非UTF模式下被编译为

\C

等同于

[\x00-\xff]

(?s:.)

它们中的任何一个都可以在后视中代替preg_matchpreg_match_all函数中的offset字段.

在非UTF模式下,它们都与1个数据单元匹配,在8位(UTF-8)PCRE库中为1字节,并且与所有256个可能的不同值匹配.

在UTF模式下,使用UTF-8库

可以通过传递给preg_*函数的模式中的u标志来激活

UTF模式,或者通过在模式的开头指定(*UTF)(*UTF8)(*UTF16)(*UTF32)动词来激活模式./p>

在UTF模式下,字符类[]和点元字符.将与Unicode字符的有效范围内的一个代码点匹配,而不是替代.由于可以在UTF-8中将一个代码点编码为1到4个字节,并且由于UTF-8的编码方案,因此无法使用字符类构造为0x80到0xFF范围内的值匹配单个字节

尽管\C是专门为匹配一个数据单元(在UTF-8中为一个字节)而设计的,而不管是否启用了UTF模式,但在UTF模式的后向构造中均不支持该数据单元.

UTF-16和UTF-32库

我不知道是否有人实际上编译了16位或32位PCRE库,并将其包含在PHP库中并使其真正起作用.如果有人知道这样的构建在野外广泛使用,请ping me.我实际上不知道如何将字符串和PHP的偏移量传递到PCRE的C API,具体取决于preg_*函数的结果可能会有所不同.

更多详细信息

在PCRE库的C API级别上,您只能使用数据单位,对于8位库,单位为8位;对于16位库,单位为16位;对于32位,单位为32位.位库.

对于8位库(UTF-8),1个数据单元是8位或1个字节,因此以字节为单位指定偏移量(无论是用作函数的参数还是作为正则表达式构造)没有太多障碍.

正则表达式构造

在非UTF模式下,字符类[],点.\C恰好匹配1个数据单元.

  • \C匹配1个数据单元,无论是在UTF模式还是非UTF模式下.但是,不能在UTF模式的后视中使用它.

    匹配单个数据单元

    在字符类之外,转义序列\C与任何一个数据匹配 单位,无论是否设置了UTF模式.

  • .在非UTF模式下匹配1个数据单元.

    关于UTF模式的一般评论

    [...]

    1. 点元字符匹配一个UTF字符,而不是单个 数据单元.

  • 字符类在非UTF模式下匹配1个数据单元.文档中没有明确说明这一点,但其措辞暗示了这一点.

    方括号和字符类

    [...]

    字符类与主题中的单个字符匹配.在UTF中 模式下,字符长度可能超过一个数据单元.

    通过查看\x{hh...}语法的上限以在非UTF模式下通过十六进制代码指定字符,可以得出相同的结论.通过测试,关于代理的最后一个子句似乎不适用于非UTF模式.

    使用八进制或十六进制数字指定的字符为 限制为某些值,如下所示:

     8-bit non-UTF mode    less than 0x100
     8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
     16-bit non-UTF mode   less than 0x10000
     16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
     32-bit non-UTF mode   less than 0x100000000
     32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
    

    无效的Unicode代码点的范围是0xd800到0xdfff(因此, 称为代理"代码点)和0xffef.

偏移量

所有提供和返回的偏移量均以数据单位为单位

要与pcre_exec()

匹配的字符串

主题字符串作为subject中的指针传递给pcre_exec()length中的长度和startoffset中的起始偏移量.的单位 lengthstartoffset是8位库,16位数据的字节 16位库的数据项和32位的32位数据项 库.

pcre_exec()如何返回捕获的子字符串

[...]

匹配成功后,有关捕获的子字符串的信息为 从ovector的开头开始,以整数对形式返回, 并最多延续其长度的三分之二.首先 每对元素设置为第一个字符的偏移量 子字符串,第二个设置为第一个字符的偏移量 子字符串结束后.这些值始终是数据单元关闭的 设置,即使在UTF模式下也是如此.

What would be the consequences of inserting a positive lookbehind for n-bytes, (?<=\C{n}), into the beginning of any arbitrary regular expression, particularly when used for replacement operations?

At least within PHP, the regex match functions, preg_match and preg_match_all, allow for matching to begin after a given byte offset. There is no corresponding feature in any of the other PCRE PHP functions - you can specify a limit to the number of replacements done by preg_replace for instance, but not that those replacements' matches must occur after n-bytes.

There would obviously be some (lets call them trivial) consequences to performance and readability, but would there be any (non-trivial) impacts, like matches becoming non-matches (except when they are not offset by n bytes) or replacements becoming malformed?

Some examples:

/some expression/ becomes /(?<=\C{4})some expression/ for a 4-byte offset

/(this) has (groups)/i becomes /(?<=\C{2})(this) has (groups)/i for a 2-byte offset

As far as I can tell, and from the limited tests that I've run, adding in this lookbehind effectively simulates this offset parameter and doesn't mess with any other lookbehinds, substitutions, or other control patterns; but I'm also not an expert on Regex.

I'm trying to determine if there are any likely consequences to building replace/filter function extensions by inserting the n-byte lookbehind into patterns. It should operate just as the match functions' offset parameter works - so simply running the expression against substr( $subject, $offset ) won't work for the same reasons it doesn't for preg_match (most notably it cuts off any lookbehinds and ^ then incorrectly matches the start of the substring, not the original string).

解决方案

Short answer

In non-UTF mode, UTF-8 library

Assuming your PCRE library bundled with PHP is compiled as 8-bit library (UTF-8), then in non-UTF mode

\C

is equivalent to

[\x00-\xff]

and

(?s:.)

Any of them can be used in a look-behind as replacement for offset field in preg_match and preg_match_all functions.

In non-UTF mode, all of them matches 1 data unit, which is 1 byte in 8-bit (UTF-8) PCRE library, and they match all 256 possible different values.

In UTF-mode, UTF-8 library

UTF mode can be activated by u flag in the pattern passed to preg_* function, or by specifying (*UTF), (*UTF8), (*UTF16), (*UTF32) verbs at the beginning of the pattern.

In UTF mode, character class [] and dot metacharacter . will match one code point within valid range of Unicode character and is not a surrogate. Since one code point can be encoded into 1 to 4 bytes in UTF-8, and due to the encoding scheme of UTF-8, it is not possible to use character class construct to match a single byte for values in the range 0x80 to 0xFF.

While \C is specifically designed to match one data unit (which is one byte in UTF-8) regardless of whether UTF mode is on or not, it is not supported in look-behind construct in UTF mode.

UTF-16 and UTF-32 library

I don't know if anyone actually compiles 16-bit or 32-bit PCRE library, includes it in the PHP library and actually makes it work. If anyone knows of such build being widely used in the wild, please ping me. I actually have no clue how the string and the offset from PHP is passed to the C API of PCRE, depending on which the result of preg_* functions may differ.

More details

At C API level of PCRE library, you can only work with data unit, which is in 8-bit units for 8-bit library, in 16-bit units for 16-bit library and in 32-bit units for 32-bit library.

For 8-bit library (UTF-8), 1 data unit is 8-bit or 1 byte, so there is not much barrier to specifying offset in bytes, whether as a parameter to function, or as a regex construct.

Regex constructs

In non-UTF mode, character class [], dot . and \C matches exactly 1 data unit.

  • \C matches 1 data unit, regardless in UTF-mode or non-UTF mode. It can't be used in look-behind in UTF-mode, though.

    MATCHING A SINGLE DATA UNIT

    Outside a character class, the escape sequence \C matches any one data unit, whether or not a UTF mode is set.

  • . matches 1 data unit in non-UTF mode.

    General comments about UTF modes

    [...]

    1. The dot metacharacter matches one UTF character instead of a single data unit.

  • Character class matches 1 data unit in non-UTF mode. The documentation doesn't explicitly state this, but it's implied by the wording.

    SQUARE BRACKETS AND CHARACTER CLASSES

    [...]

    A character class matches a single character in the subject. In a UTF mode, the character may be more than one data unit long.

    The same conclusion can be reached by looking at the upper limit of \x{hh...} syntax to specify character by hex code in non-UTF mode. Through testing, the last clause about surrogate doesn't seem to apply to non-UTF-mode.

    Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows:

     8-bit non-UTF mode    less than 0x100
     8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
     16-bit non-UTF mode   less than 0x10000
     16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
     32-bit non-UTF mode   less than 0x100000000
     32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
    

    Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- called "surrogate" codepoints), and 0xffef.

Offset

All offset supplied and returned are in number of data units:

The string to be matched by pcre_exec()

The subject string is passed to pcre_exec() as a pointer in subject, a length in length, and a starting offset in startoffset. The units for length and startoffset are bytes for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit data items for the 32-bit library.

How pcre_exec() returns captured substrings

[...]

When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the offset of the first character in a substring, and the second is set to the offset of the first character after the end of a substring. These values are always data unit off- sets, even in UTF mode.

这篇关于将正向后方插入任意正则表达式以模拟字节偏移的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆