当检测到非UTF8字符时,PHP preg_replace()失败 [英] PHP preg_replace() fails when a non UTF8 Character is detected

查看:377
本文介绍了当检测到非UTF8字符时,PHP preg_replace()失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当发现非UTF 8字符时,PHP正则表达式将失败!

PHP Regular expression fails when non UTF 8 character found!

我需要剥离40,000个数据库记录以从custom_size mysql表字段中获取宽度和高度值.

I need to strip 40,000 database records to grab a width and height value from a custom_size mysql table field.

文件具有各种不同的随机格式.

The filed is in all sorts of different random formats.

最可靠的方法是从x的左侧和右侧获取数字值,并从它们中剥离所有非数字值.

The most reliable way is to grab a numeric value from the left and right side of an x and strip all non numeric values from them.

下面的代码在找到一些非UTF 8字符的记录之前,在99%的时间内都能很好地工作.

The code below works pretty good 99% of the time until it found a few records with non UTF 8 characters.

31*3235x21是两个示例.

运行这些命令后,我得到这些PHP错误和脚本暂停....

When these are ran I get these PHP errors and script halts....

Warning: preg_replace(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 1683977065 on line 21

Warning: preg_match(): Compilation failed: this version of PCRE is not compiled with PCRE_UTF8 support at offset 0 on line 24

演示:

<?php

$strings = array(

    '12x12',
    '172.61 cm x 28.46 cm',
    '31"x21"',
    '1"x1"',
    '31*32',
    '35"x21"'
);


foreach($strings as $string){

    if($string != ''){

        $string = str_replace('"','"',$string);

        // Strip out all characters except for numbers, letter x, and decimal points
        $string = preg_replace( '/([^0-9x\.])/ui', '', strtolower( $string ) );

        // Find anything that fits the number X number format
        preg_match( '/([0-9]+(\.[0-9]+)?)x([0-9]+(\.[0-9]+)?)/ui', $string, $values ); 

        echo 'Original value: ' .$string.'<br>';
        echo 'Width: ' .$values[1].'<br>';
        echo 'Height: ' .$values[3].'<br><hr><br>';         

    }

}

对此有任何想法吗?我无法重建服务器软件以添加支持

Any ideas around this? I cannot rebuild server software to add support

刚刚找到了使用PHP库转换为UTF8的答案,这似乎很有帮助 https://stackoverflow.com/a/3521396/143030

Just found an answer with a PHP library to convert to UTF8 that seems to be helping a lot https://stackoverflow.com/a/3521396/143030

推荐答案

默认情况下,PCRE正则表达式引擎一次读取一个字节的字符串,因此,默认情况下,它会忽略可能由单个字符组成的字节序列正在使用类似UTF-8的多字节编码,并将其视为分开的字节(一个字节,一个字符).

By default, the PCRE regex-engine reads a character string one byte at a time, so, by default it ignores byte sequences that may compose a single character when a multibyte encoding like UTF-8 is in use, and see them as separated bytes (one byte, one character).

例如,字符U + 201D:右双引号在UTF-8中使用三个字节:

For example, the character U+201D: RIGHT DOUBLE QUOTATION MARK uses three bytes in UTF-8:

$a = '"';

for ($i=0; $i < strlen($a); $i++) {
    echo dechex(ord($a[$i])), ' ';
}

结果:

e2 80 9d

要在PCRE regex引擎中启用多字节读取,可以在模式开头使用以下指令之一:(*UTF)(*UTF8)(*UTF16)(*UTF32)或u修饰符(可以切换到可用的多字节模式,但也将速写字符类(如\s\d\w ...)的含义扩展为unicode.换句话说,u修饰符是快捷方式更改字符类的(*UTFx)(*UCP).)

To enable the multibyte read in the PCRE regex engine, you can either use one of these directives at the beginning of the pattern: (*UTF), (*UTF8), (*UTF16), (*UTF32) or the u modifier (that switches on the available multi-bytes mode, but that extends too the meaning of the shorthand character classes like \s, \d, \w... to unicode. In other words the u modifier is a shortcut for (*UTFx) and (*UCP) that changes the character classes.)

但是,仅当PCRE模块已在这些编码的支持下进行编译时,这些功能才可用. (大多数默认的PHP安装都是这种情况,但这不是绝对的系统性或强制性.)

But these features are only available if the PCRE module has been compiled with the support of these encodings. (This is the case for most of the default PHP installations, but it isn't absolutely systematic or mandatory.)

您似乎并非如此,因为当您使用u修饰符时,会收到以下明确消息:

It seems that it isn't the case for you since when you use the u modifier, you obtain this explicit message:

this version of PCRE is not compiled with PCRE_UTF8 support

除了您决定使用支持UTF8的PCRE模块将PHP安装更改为一个安装,否则您将无能为力.

You can't do anything against that except if you decide to change your PHP installation by one with the PCRE module compiled with UTF8 support.

但是,这实际上不是问题,因为在您的模式中,即使您的输入是UTF8编码的,u修饰符也完全没用.

However, it isn't really a problem in your case, because in your patterns the u modifier is totally useless even if your input is UTF8 encoded.

原因是您的两个模式仅使用ASCII文字字符(00-7F范围内的字符),并且因为UTF8编码中超出ASCII范围的字符从不使用该范围内的字节:

The reason is that your two patterns use only ASCII literal characters (characters that are in the 00-7F range) and because characters beyond the ASCII range in the UTF8 encoding never use bytes from this range:

Unicode  char   UTF8    Name
--------------------------------------------------------
U+007D     }       7d   RIGHT CURLY BRACKET
U+007E     ~       7e   TILDE
U+007F             7f   <control>
U+0080          c2 80   <control>
U+0081          c2 81   <control>
...
U+00BE     ¾    c2 be   VULGAR FRACTION THREE QUARTERS
U+00BF     ¿    c2 bf   INVERTED QUESTION MARK
U+00C0     À    c3 80   LATIN CAPITAL LETTER A WITH GRAVE
U+00C1     Á    c3 81   LATIN CAPITAL LETTER A WITH ACUTE
...

所以你可以这样写:

$string = preg_replace( '/[^0-9x.]+/', '', strtolower( $string ) );

(无需使用i修饰符,因为您的字符串已经是小写字母.无需在字符类中转义点并使用捕获组.添加+量词可加快替换速度,因为多个连续的字符将被替换一次,而不是一个接一个地删除.)

(No need to use the i modifier since your string is already lowercase. No need to escape a dot in a character class and to use a capture group. Adding the + quantifier speeds up the replacement since several consecutive characters are removed in one replacement, instead of one by one.)

和:

if (preg_match('/([0-9]+(?:\.[0-9]+)?)x([0-9]+(?:\.[0-9]+)?)/', $string, $values)) {
    echo 'Original value: ', $string, '<br>';
    echo 'Width: ', $values[1], '<br>';
    echo 'Height: ', $values[2], '<br><hr><br>';
}

但是,使用某些模式可能会很危险,例如,如果第一个字符用多个字节编码,而仅使用该字符的第一个字节编码,则不会删除第一个字符:

However, it can be dangerous with some patterns, for example this will not remove the first character as expected if this one is encoded with several bytes, but only the first byte of this character:

$a = preg_replace('/^./', '', '"abc');

for ($i=0; $i < strlen($a); $i++) {
    echo ' ', dechex(ord($a[$i]));
}

返回:

 80 9d 61 62 63
# �  �  a  b  c

这篇关于当检测到非UTF8字符时,PHP preg_replace()失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆