确保PHP中有效的utf-8 [英] Ensuring valid utf-8 in PHP

查看：127 发布时间：2017/8/16 19:36:59 php encoding utf-8

本文介绍了确保PHP中有效的utf-8的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用PHP来处理来自各种来源的文本。我不期望它将是UTF-8，ISO-8859-1或者也许是WINDOWS-1252以外的任何东西。如果它不是其中之一，我只需要确保文本变成一个有效的UTF-8字符串，即使字符丢失。 iconv的// TRANSLIT选项是否解决了这个问题？例如，此代码是否确保字符串可以安全地插入UTF-8编码的文档（或数据库）？

  （$ string，UTF-8，ISO-8859-1，WINDOWS-1252）; 
 
 if（$ encoding！='UTF-8'）{
 return iconv（$ encoding，'UTF-8 // TRANSLIT'，$ string）; 
} else {
 return $ string; 
} 
}

解决方案

-8可以存储任何Unicode字符。如果您的编码是其他任何东西，包括ISO-8859-1或Windows-1252，UTF-8可以存储其中的每个字符。因此，当您将字符串从任何其他编码转换为UTF-8时，您不必担心丢失任何字符。

此外，ISO-8859-1和Windows-1252是单字节编码，其中任何字节都有效。在技术上不可能区分它们。我会选择Windows-1252作为非UTF-8序列的默认匹配，因为唯一解码格式不同的字节是0x80-0x9F。这些解码为诸如智能报价和Windows-1252中的欧元等各种字符，而在ISO-8859-1中，它们是几乎不被使用的隐形控制字符。网页浏览器有时可能会说他们正在使用ISO-8859-1，但往往会使用Windows-1252。

这段代码确保字符串可以安全地插入到UTF-8编码的文档中

您一定要设置可选的strict参数为此目的为TRUE。但我不知道这实际上涵盖了所有无效的UTF-8序列。该功能不会明确地声明检查UTF-8有效性的字节序列。有一些已知的情况，其中mb_detect_encoding以前会错误地猜到UTF-8，尽管我不知道是否仍然可以在严格模式下发生。

如果你想成为当然，请自行使用 W3推荐的正则表达式：

  if（preg_match（'％^（?: 
 [\x09\x0A\ x0D \x20-\x7E]＃ASCII 
 | [\xC2-\xDF] [\x80-\xBF]＃非超长的2字节
 | \xE0 [ \xA0-\xBF] [\x80-\xBF]＃不包括超额
 | [\xE1-\xEC\xEE\xEF] [\x80-\xBF] { 2}＃直3字节
 | \xED [\x80-\x9F] [\x80-\xBF]＃不包括代理
 | \xF0 [\x90-\\ \\ xBF] [\x80-\xBF] {2}＃plane 1-3 
 | [\xF1-\xF3] [\x80-\xBF] {3} 15 
 | \xF4 [\x80-\x8F] [\x80-\xBF] {2}＃plane 16 
）* $％xs'，$ string））
 return $ string; 
 else 
 return iconv（'CP1252'，'UTF-8'，$ string）;

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO-8859-1, or perhaps WINDOWS-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this? For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

function make_safe_for_utf8_use($string) {

    $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

    if ($encoding != 'UTF-8') {
        return iconv($encoding, 'UTF-8//TRANSLIT', $string);
    } else {
        return $string;
    }
}

解决方案

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);

这篇关于确保PHP中有效的utf-8的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

确保PHP中有效的utf-8 [英] Ensuring valid utf-8 in PHP

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

确保PHP中有效的utf-8 [英] Ensuring valid utf-8 in PHP

问题描述

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭