确保PHP中有效的utf-8 [英] Ensuring valid utf-8 in PHP

查看:127
本文介绍了确保PHP中有效的utf-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用PHP来处理来自各种来源的文本。我不期望它将是UTF-8,ISO-8859-1或者也许是WINDOWS-1252以外的任何东西。如果它不是其中之一,我只需要确保文本变成一个有效的UTF-8字符串,即使字符丢失。 iconv的// TRANSLIT选项是否解决了这个问题?例如,此代码是否确保字符串可以安全地插入UTF-8编码的文档(或数据库)?

  ($ string,UTF-8,ISO-8859-1,WINDOWS-1252); 

if($ encoding!='UTF-8'){
return iconv($ encoding,'UTF-8 // TRANSLIT',$ string);
} else {
return $ string;
}
}


解决方案

-8可以存储任何Unicode字符。如果您的编码是其他任何东西,包括ISO-8859-1或Windows-1252,UTF-8可以存储其中的每个字符。因此,当您将字符串从任何其他编码转换为UTF-8时,您不必担心丢失任何字符。



此外,ISO-8859-1和Windows-1252是单字节编码,其中任何字节都有效。在技​​术上不可能区分它们。我会选择Windows-1252作为非UTF-8序列的默认匹配,因为唯一解码格式不同的字节是0x80-0x9F。这些解码为诸如智能报价和Windows-1252中的欧元等各种字符,而在ISO-8859-1中,它们是几乎不被使用的隐形控制字符。网页浏览器有时可能会说他们正在使用ISO-8859-1,但往往会使用Windows-1252。


这段代码确保字符串可以安全地插入到UTF-8编码的文档中


您一定要设置可选的strict参数为此目的为TRUE。但我不知道这实际上涵盖了所有无效的UTF-8序列。该功能不会明确地声明检查UTF-8有效性的字节序列。有一些已知的情况,其中mb_detect_encoding以前会错误地猜到UTF-8,尽管我不知道是否仍然可以在严格模式下发生。



如果你想成为当然,请自行使用 W3推荐的正则表达式

  if(preg_match('%^(?: 
[\x09\x0A\ x0D \x20-\x7E]#ASCII
| [\xC2-\xDF] [\x80-\xBF]#非超长的2字节
| \xE0 [ \xA0-\xBF] [\x80-\xBF]#不包括超额
| [\xE1-\xEC\xEE\xEF] [\x80-\xBF] { 2}#直3字节
| \xED [\x80-\x9F] [\x80-\xBF]#不包括代理
| \xF0 [\x90-\\ \\ xBF] [\x80-\xBF] {2}#plane 1-3
| [\xF1-\xF3] [\x80-\xBF] {3} 15
| \xF4 [\x80-\x8F] [\x80-\xBF] {2}#plane 16
)* $%xs',$ string))
return $ string;
else
return iconv('CP1252','UTF-8',$ string);


I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO-8859-1, or perhaps WINDOWS-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this? For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

function make_safe_for_utf8_use($string) {

    $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

    if ($encoding != 'UTF-8') {
        return iconv($encoding, 'UTF-8//TRANSLIT', $string);
    } else {
        return $string;
    }
}

解决方案

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);

这篇关于确保PHP中有效的utf-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆