PHP Regex定界符 [英] PHP Regex delimiter

查看:102
本文介绍了PHP Regex定界符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很长时间以来,每次需要使用正则表达式时,我都将使用版权符号©作为分隔符进行了标准化,因为它是键盘上没有的符号一定不要在正则表达式中使用! @#\或/(有时在正则表达式中全部使用).

For a long time, any time I've needed to use a regular expression, I've standardized on using the copyright symbol © as the delimiter because it was a symbol that wasn't on the keyboard that I was sure to not use in a regular expression, unlike ! @ # \ or / (which are sometimes all in use within in a regex).

代码:

$result=preg_match('©<.*?>©', '<something string>');

但是,今天我需要使用带有重音符号的正则表达式,其中包括以下内容:

However, today I needed to use a regular expression with accented characters which included this:

代码:

[a-zA-ZàáâäãåąćęèéêëìíîïłńòóôöõøùúûüÿýżźñçčšžÀÁÂÄÃÅĄĆĘÈÉÊËÌÍÎÏŁŃÒÓÔÖÕØÙÚÛÜŸÝŻŹÑßÇŒÆČŠŽ∂ð \,\.\'-]+

在将这个新的正则表达式包含在我的IDE(Eclipse PDT)的PHP文件中之后,系统提示我将PHP文件另存为UTF-8,而不是默认的cp1252.

After including this new regex in the PHP file in my IDE (Eclipse PDT), I was prompted to save the PHP file as UTF-8 instead of the default cp1252.

保存并运行PHP文件后,每次我在preg_match()或preg_replace()函数调用中使用正则表达式时,它都会生成一个通用的PHP警告(警告:x上file.php中的preg_match)和正则表达式没有处理.

After saving and running the PHP file, every time I used a regex in a preg_match() or preg_replace() function call, it generated a generic PHP warning (Warning: preg_match in file.php on line x) and the regex was not processed.

所以-两个问题:

1)是否有另一个可以很好地用作分隔符的符号,我可以对其进行标准化,所以通常不必在键盘(`~!@#$%^&*()+=[]{};\':",./<>?|\)上找到它,而不必担心必须检查每个正则表达式才能看到该符号是否真正在表达式中的某处使用?

1) Is there another symbol that would be good to use as a delimiter that isn't typically found on a keyboard (`~!@#$%^&*()+=[]{};\':",./<>?|\) that I can standardize on and not worry about having to check each and every regex to see if that symbol is actually used somewhere in the expression?

2)或者,当文件格式为UTF-8时,是否可以使用版权符号作为标准定界符?

2) Or, is there a I way I can use the copyright symbol as the standard delimiter when the file format is UTF-8?

推荐答案

需要纠正的一件事是,如果您的正则表达式和/或输入数据是用UTF-8编码的(在这种情况下是这样,因为它来了直接从UTF-8编码的文件内部),则必须对正则表达式使用u修饰符.

One thing that needs correcting is that if your regular expression and/or input data is encoded in UTF-8 (which in this case it is, since it comes straight from inside a UTF-8 encoded file) you must use the u modifier for your regular expression.

另一个问题是,版权字符不应用作UTF-8中的定界符,因为PCRE函数认为

Another issue is that the copyright character should not be used as a delimiter in UTF-8 because the PCRE functions consider that the first byte of your pattern encodes your delimiter (this could plausibly be called a bug in PHP).

当您尝试使用版权符号作为UTF-8中的定界符,实际上保存到文件中的是字节序列0xC2 0xA9. preg_match查看第一个字节0xC2并确定它是字母数字字符,因为在您当前的语言环境中,该字节对应于带有抑扬音的拉丁字母大写字母A Â(请参见扩展的ASCII表).因此,将生成警告并立即中止处理.

When you attempt to use the copyright sign as a delimiter in UTF-8, what actually gets saved into the file is the byte sequence 0xC2 0xA9. preg_match looks at the first byte 0xC2 and decides that it is an alphanumeric character because in your current locale that byte corresponds to the character Latin capital letter A with circumflex  (see extended ASCII table). Therefore a warning is generated and processing is immediately aborted.

鉴于这些事实,理想的解决方案是从ASCII字符集中选择一个不常见的分隔符,因为该字符将以单字节编码和UTF-8编码为相同的字节序列.

Given these facts, the ideal solution would be to choose an unusual delimiter from inside the ASCII character set because that character would encode to the same byte sequence both in single byte encodings and in UTF-8.

我不会认为可打印的ASCII字符在此方面异常特殊,因此一个不错的选择是控制字符(ASCII码1至31)之一.例如,STX(\x02)可以满足要求.

I would not consider printable ASCII characters unusual enough for this purpose, so a good choice would be one of the control characters (ASCII codes 1 to 31). For example, STX (\x02) would fit the bill.

u regex修饰符一起,这意味着您应该这样编写正则表达式:

Together with the u regex modifier this means you should write the regex like this:

$result = preg_match("\x02<.*?>\x02u", '<something string>');

这篇关于PHP Regex定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆