将正向后方插入任意正则表达式以模拟字节偏移的结果 [英] Consequences of Inserting Positive Lookbehind into Arbitrary Regex to Simulate Byte Offset

查看：67 发布时间：2020/7/1 4:59:27 php regex regex-lookarounds

本文介绍了将正向后方插入任意正则表达式以模拟字节偏移的结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在任意正则表达式的开头插入n字节(?<=\C{n})的正向后注的后果是什么，特别是在用于替换操作时?

至少在PHP中，正则表达式匹配函数preg_match和preg_match_all允许在给定的字节偏移后开始匹配.其他任何PCRE PHP函数均没有相应的功能-例如，您可以指定preg_replace进行的替换次数的限制，但不能规定这些替换的匹配必须在n字节之后进行.

显然会给性能和可读性带来一些(让他们称之为微不足道的)后果，但是会带来任何(非琐碎的)影响，例如匹配变得不匹配(除非它们没有被n个字节抵消)或替代品变得畸形了?

一些例子:

/some expression/变为/(?<=\C{4})some expression/以获得4字节的偏移量

/(this) has (groups)/i变为/(?<=\C{2})(this) has (groups)/i 2个字节的偏移量

据我所知，从我已经运行的有限测试中，添加此lookbehind有效地模拟了该offset参数，并且不会与任何其他lookbehind，替代或其他控制模式混为一谈；但是我也不是Regex的专家.

我正在尝试通过将n字节的lookbehind插入到模式中来确定构建替换/过滤器功能扩展是否有可能的后果.它应该像匹配函数的offset参数一样工作-因此，对substr( $subject, $offset )运行简单表达式由于与preg_match不相同的原因而不会起作用(最值得注意的是，它切断了所有lookbehinds和^然后错误地匹配了子字符串的开头，而不是原始字符串).

解决方案

简短答案

在非UTF模式下，使用UTF-8库

假定与PHP捆绑在一起的PCRE库被编译为 8位库(UTF-8)，然后在非UTF模式下被编译为

\C

等同于

[\x00-\xff]

和

(?s:.)

它们中的任何一个都可以在后视中代替preg_match和preg_match_all函数中的offset字段.

在非UTF模式下，它们都与1个数据单元匹配，在8位(UTF-8)PCRE库中为1字节，并且与所有256个可能的不同值匹配.

在UTF模式下，使用UTF-8库

可以通过传递给preg_*函数的模式中的u标志来激活

UTF模式，或者通过在模式的开头指定(*UTF)，(*UTF8)，(*UTF16)，(*UTF32)动词来激活模式./p>

在UTF模式下，字符类[]和点元字符.将与Unicode字符的有效范围内的一个代码点匹配，而不是替代.由于可以在UTF-8中将一个代码点编码为1到4个字节，并且由于UTF-8的编码方案，因此无法使用字符类构造为0x80到0xFF范围内的值匹配单个字节

尽管\C是专门为匹配一个数据单元(在UTF-8中为一个字节)而设计的，而不管是否启用了UTF模式，但在UTF模式的后向构造中均不支持该数据单元.

UTF-16和UTF-32库

我不知道是否有人实际上编译了16位或32位PCRE库，并将其包含在PHP库中并使其真正起作用.如果有人知道这样的构建在野外广泛使用，请ping me.我实际上不知道如何将字符串和PHP的偏移量传递到PCRE的C API，具体取决于preg_*函数的结果可能会有所不同.

Short answer

In non-UTF mode, UTF-8 library

Assuming your PCRE library bundled with PHP is compiled as 8-bit library (UTF-8), then in non-UTF mode

\C

is equivalent to

[\x00-\xff]

and

(?s:.)

Any of them can be used in a look-behind as replacement for offset field in preg_match and preg_match_all functions.

In non-UTF mode, all of them matches 1 data unit, which is 1 byte in 8-bit (UTF-8) PCRE library, and they match all 256 possible different values.

In UTF-mode, UTF-8 library

UTF mode can be activated by u flag in the pattern passed to preg_* function, or by specifying (*UTF), (*UTF8), (*UTF16), (*UTF32) verbs at the beginning of the pattern.

In UTF mode, character class [] and dot metacharacter . will match one code point within valid range of Unicode character and is not a surrogate. Since one code point can be encoded into 1 to 4 bytes in UTF-8, and due to the encoding scheme of UTF-8, it is not possible to use character class construct to match a single byte for values in the range 0x80 to 0xFF.

While \C is specifically designed to match one data unit (which is one byte in UTF-8) regardless of whether UTF mode is on or not, it is not supported in look-behind construct in UTF mode.

UTF-16 and UTF-32 library

I don't know if anyone actually compiles 16-bit or 32-bit PCRE library, includes it in the PHP library and actually makes it work. If anyone knows of such build being widely used in the wild, please ping me. I actually have no clue how the string and the offset from PHP is passed to the C API of PCRE, depending on which the result of preg_* functions may differ.

More details

At C API level of PCRE library, you can only work with data unit, which is in 8-bit units for 8-bit library, in 16-bit units for 16-bit library and in 32-bit units for 32-bit library.

For 8-bit library (UTF-8), 1 data unit is 8-bit or 1 byte, so there is not much barrier to specifying offset in bytes, whether as a parameter to function, or as a regex construct.

Regex constructs

In non-UTF mode, character class [], dot . and \C matches exactly 1 data unit.

\C matches 1 data unit, regardless in UTF-mode or non-UTF mode. It can't be used in look-behind in UTF-mode, though.

MATCHING A SINGLE DATA UNIT

Outside a character class, the escape sequence \C matches any one data unit, whether or not a UTF mode is set.
. matches 1 data unit in non-UTF mode.
General comments about UTF modes

[...]
1. The dot metacharacter matches one UTF character instead of a single data unit.
Character class matches 1 data unit in non-UTF mode. The documentation doesn't explicitly state this, but it's implied by the wording.

SQUARE BRACKETS AND CHARACTER CLASSES

[...]

A character class matches a single character in the subject. In a UTF mode, the character may be more than one data unit long.

The same conclusion can be reached by looking at the upper limit of \x{hh...} syntax to specify character by hex code in non-UTF mode. Through testing, the last clause about surrogate doesn't seem to apply to non-UTF-mode.
Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows:
```
 8-bit non-UTF mode    less than 0x100
 8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
 16-bit non-UTF mode   less than 0x10000
 16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
 32-bit non-UTF mode   less than 0x100000000
 32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
```
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- called "surrogate" codepoints), and 0xffef.

Offset

All offset supplied and returned are in number of data units:

The string to be matched by pcre_exec()

The subject string is passed to pcre_exec() as a pointer in subject, a length in length, and a starting offset in startoffset. The units for length and startoffset are bytes for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit data items for the 32-bit library.

How pcre_exec() returns captured substrings

[...]

When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the offset of the first character in a substring, and the second is set to the offset of the first character after the end of a substring. These values are always data unit off- sets, even in UTF mode.

这篇关于将正向后方插入任意正则表达式以模拟字节偏移的结果的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将正向后方插入任意正则表达式以模拟字节偏移的结果 [英] Consequences of Inserting Positive Lookbehind into Arbitrary Regex to Simulate Byte Offset

问题描述

简短答案

在非UTF模式下，使用UTF-8库

在UTF模式下，使用UTF-8库

UTF-16和UTF-32库

更多详细信息

正则表达式构造

偏移量

Short answer

In non-UTF mode, UTF-8 library

In UTF-mode, UTF-8 library

UTF-16 and UTF-32 library

More details

Regex constructs

Offset

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

将正向后方插入任意正则表达式以模拟字节偏移的结果 [英] Consequences of Inserting Positive Lookbehind into Arbitrary Regex to Simulate Byte Offset

问题描述

简短答案

在非UTF模式下，使用UTF-8库

在UTF模式下，使用UTF-8库

UTF-16和UTF-32库

更多详细信息

正则表达式构造

偏移量

Short answer

In non-UTF mode, UTF-8 library

In UTF-mode, UTF-8 library

UTF-16 and UTF-32 library

More details

Regex constructs

Offset

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭