将正向后方插入任意正则表达式以模拟字节偏移的结果 [英] Consequences of Inserting Positive Lookbehind into Arbitrary Regex to Simulate Byte Offset
问题描述
在任意正则表达式的开头插入n字节(?<=\C{n})
的正向后注的后果是什么,特别是在用于替换操作时?
至少在PHP中,正则表达式匹配函数preg_match
和preg_match_all
允许在给定的字节偏移后开始匹配.其他任何PCRE PHP函数均没有相应的功能-例如,您可以指定preg_replace
进行的替换次数的限制,但不能规定这些替换的匹配必须在n字节之后进行.
显然会给性能和可读性带来一些(让他们称之为微不足道的)后果,但是会带来任何(非琐碎的)影响,例如匹配变得不匹配(除非它们没有被n个字节抵消)或替代品变得畸形了?
一些例子:
/some expression/
变为/(?<=\C{4})some expression/
以获得4字节的偏移量
/(this) has (groups)/i
变为/(?<=\C{2})(this) has (groups)/i
2个字节的偏移量
我正在尝试通过将n字节的lookbehind插入到模式中来确定构建替换/过滤器功能扩展是否有可能的后果.它应该像匹配函数的offset参数一样工作-因此,对substr( $subject, $offset )
运行简单表达式由于与preg_match
不相同的原因而不会起作用(最值得注意的是,它切断了所有lookbehinds和^
然后错误地匹配了子字符串的开头,而不是原始字符串).
简短答案
在非UTF模式下,使用UTF-8库
假定与PHP捆绑在一起的PCRE库被编译为 8位库(UTF-8),然后在非UTF模式下被编译为
\C
等同于
[\x00-\xff]
和
(?s:.)
它们中的任何一个都可以在后视中代替preg_match
和preg_match_all
函数中的offset
字段.
在非UTF模式下,它们都与1个数据单元匹配,在8位(UTF-8)PCRE库中为1字节,并且与所有256个可能的不同值匹配.
在UTF模式下,使用UTF-8库
可以通过传递给preg_*
函数的模式中的u
标志来激活 UTF模式,或者通过在模式的开头指定 在UTF模式下,字符类 尽管 我不知道是否有人实际上编译了16位或32位PCRE库,并将其包含在PHP库中并使其真正起作用.如果有人知道这样的构建在野外广泛使用,请ping me.我实际上不知道如何将字符串和PHP的偏移量传递到PCRE的C API,具体取决于 在PCRE库的C API级别上,您只能使用数据单位,对于8位库,单位为8位;对于16位库,单位为16位;对于32位,单位为32位.位库. 对于8位库(UTF-8),1个数据单元是8位或1个字节,因此以字节为单位指定偏移量(无论是用作函数的参数还是作为正则表达式构造)没有太多障碍. 在非UTF模式下,字符类 匹配单个数据单元 在字符类之外,转义序列 关于UTF模式的一般评论 [...]
字符类在非UTF模式下匹配1个数据单元.文档中没有明确说明这一点,但其措辞暗示了这一点. 方括号和字符类 [...] 字符类与主题中的单个字符匹配.在UTF中
模式下,字符长度可能超过一个数据单元. 通过查看 使用八进制或十六进制数字指定的字符为
限制为某些值,如下所示: 无效的Unicode代码点的范围是0xd800到0xdfff(因此,
称为代理"代码点)和0xffef. 所有提供和返回的偏移量均以数据单位为单位 要与 主题字符串作为
[...] 匹配成功后,有关捕获的子字符串的信息为
从ovector的开头开始,以整数对形式返回,
并最多延续其长度的三分之二.首先
每对元素设置为第一个字符的偏移量
子字符串,第二个设置为第一个字符的偏移量
子字符串结束后.这些值始终是数据单元关闭的
设置,即使在UTF模式下也是如此. What would be the consequences of inserting a positive lookbehind for n-bytes, At least within PHP, the regex match functions, There would obviously be some (lets call them trivial) consequences to performance and readability, but would there be any (non-trivial) impacts, like matches becoming non-matches (except when they are not offset by n bytes) or replacements becoming malformed? Some examples: As far as I can tell, and from the limited tests that I've run, adding in this lookbehind effectively simulates this offset parameter and doesn't mess with any other lookbehinds, substitutions, or other control patterns; but I'm also not an expert on Regex. I'm trying to determine if there are any likely consequences to building replace/filter function extensions by inserting the n-byte lookbehind into patterns. It should operate just as the match functions' offset parameter works - so simply running the expression against Assuming your PCRE library bundled with PHP is compiled as 8-bit library (UTF-8), then in non-UTF mode is equivalent to and Any of them can be used in a look-behind as replacement for In non-UTF mode, all of them matches 1 data unit, which is 1 byte in 8-bit (UTF-8) PCRE library, and they match all 256 possible different values. UTF mode can be activated by In UTF mode, character class While I don't know if anyone actually compiles 16-bit or 32-bit PCRE library, includes it in the PHP library and actually makes it work. If anyone knows of such build being widely used in the wild, please ping me. I actually have no clue how the string and the offset from PHP is passed to the C API of PCRE, depending on which the result of At C API level of PCRE library, you can only work with data unit, which is in 8-bit units for 8-bit library, in 16-bit units for 16-bit library and in 32-bit units for 32-bit library. For 8-bit library (UTF-8), 1 data unit is 8-bit or 1 byte, so there is not much barrier to specifying offset in bytes, whether as a parameter to function, or as a regex construct. In non-UTF mode, character class MATCHING A SINGLE DATA UNIT Outside a character class, the escape sequence General comments about UTF modes [...]
Character class matches 1 data unit in non-UTF mode. The documentation doesn't explicitly state this, but it's implied by the wording. SQUARE BRACKETS AND CHARACTER CLASSES [...] A character class matches a single character in the subject. In a UTF
mode, the character may be more than one data unit long. The same conclusion can be reached by looking at the upper limit of Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows: Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
called "surrogate" codepoints), and 0xffef. All offset supplied and returned are in number of data units: The string to be matched by The subject string is passed to
How [...] When a match is successful, information about captured substrings is
returned in pairs of integers, starting at the beginning of ovector,
and continuing up to two-thirds of its length at the most. The first
element of each pair is set to the offset of the first character in a
substring, and the second is set to the offset of the first character
after the end of a substring. These values are always data unit off-
sets, even in UTF mode.
这篇关于将正向后方插入任意正则表达式以模拟字节偏移的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!(*UTF)
,(*UTF8)
,(*UTF16)
,(*UTF32)
动词来激活[]
和点元字符.
将与Unicode字符的有效范围内的一个代码点匹配,而不是替代.由于可以在UTF-8中将一个代码点编码为1到4个字节,并且由于UTF-8的编码方案,因此无法使用字符类构造为0x80到0xFF范围内的值匹配单个字节\C
是专门为匹配一个数据单元(在UTF-8中为一个字节)而设计的,而不管是否启用了UTF模式,但在UTF模式的后向构造中均不支持该数据单元. UTF-16和UTF-32库
preg_*
函数的结果可能会有所不同.更多详细信息
正则表达式构造
[]
,点.
和\C
恰好匹配1个数据单元.
\C
匹配1个数据单元,无论是在UTF模式还是非UTF模式下.但是,不能在UTF模式的后视中使用它.
\C
与任何一个数据匹配
单位,无论是否设置了UTF模式..
在非UTF模式下匹配1个数据单元.
\x{hh...}
语法的上限以在非UTF模式下通过十六进制代码指定字符,可以得出相同的结论.通过测试,关于代理的最后一个子句似乎不适用于非UTF模式.
8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
偏移量
pcre_exec()
subject
中的指针传递给pcre_exec()
,
length
中的长度和startoffset
中的起始偏移量.的单位
length
和startoffset
是8位库,16位数据的字节
16位库的数据项和32位的32位数据项
库.
pcre_exec()
如何返回捕获的子字符串 (?<=\C{n})
, into the beginning of any arbitrary regular expression, particularly when used for replacement operations?preg_match
and preg_match_all
, allow for matching to begin after a given byte offset. There is no corresponding feature in any of the other PCRE PHP functions - you can specify a limit to the number of replacements done by preg_replace
for instance, but not that those replacements' matches must occur after n-bytes./some expression/
becomes /(?<=\C{4})some expression/
for a 4-byte offset/(this) has (groups)/i
becomes /(?<=\C{2})(this) has (groups)/i
for a 2-byte offsetsubstr( $subject, $offset )
won't work for the same reasons it doesn't for preg_match
(most notably it cuts off any lookbehinds and ^
then incorrectly matches the start of the substring, not the original string).Short answer
In non-UTF mode, UTF-8 library
\C
[\x00-\xff]
(?s:.)
offset
field in preg_match
and preg_match_all
functions.In UTF-mode, UTF-8 library
u
flag in the pattern passed to preg_*
function, or by specifying (*UTF)
, (*UTF8)
, (*UTF16)
, (*UTF32)
verbs at the beginning of the pattern.[]
and dot metacharacter .
will match one code point within valid range of Unicode character and is not a surrogate. Since one code point can be encoded into 1 to 4 bytes in UTF-8, and due to the encoding scheme of UTF-8, it is not possible to use character class construct to match a single byte for values in the range 0x80 to 0xFF.\C
is specifically designed to match one data unit (which is one byte in UTF-8) regardless of whether UTF mode is on or not, it is not supported in look-behind construct in UTF mode.UTF-16 and UTF-32 library
preg_*
functions may differ.More details
Regex constructs
[]
, dot .
and \C
matches exactly 1 data unit.
\C
matches 1 data unit, regardless in UTF-mode or non-UTF mode. It can't be used in look-behind in UTF-mode, though.
\C
matches any one data
unit, whether or not a UTF mode is set..
matches 1 data unit in non-UTF mode.
\x{hh...}
syntax to specify character by hex code in non-UTF mode. Through testing, the last clause about surrogate doesn't seem to apply to non-UTF-mode.
8-bit non-UTF mode less than 0x100
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
Offset
pcre_exec()
pcre_exec()
as a pointer in subject
, a
length in length
, and a starting offset in startoffset
. The units for
length
and startoffset
are bytes for the 8-bit library, 16-bit data
items for the 16-bit library, and 32-bit data items for the 32-bit
library.
pcre_exec()
returns captured substrings