PHP cp1252 / windows-1252转换为UTF-8 [英] PHP cp1252/windows-1252 conversion to UTF-8

查看:2282
本文介绍了PHP cp1252 / windows-1252转换为UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将我们的数据库从latin1转换为UTF-8。不幸的是,我不能做一个大规模的单一切换,因为应用程序需要保持在线,我们有700GB的数据库转换。

I'm in the process of trying to convert our database from latin1 to UTF-8. Unfortunately I can't do a massive single switchover as the application needs to stay online and we have 700GB of database to convert.

所以我试图利用一点mysql hack的转换表到UTF-8但不是数据。我想要实时读取,转换和替换数据。 (如果你愿意,可以进行JIT转换)

So I'm trying to leverage a little mysql hack of converting tables to UTF-8 however not the data. I'd like the data to be read, converted, and replaced in real time. (A JIT conversion if you will)

我们的php应用程序目前使用所有的默认值,所以它使用latin1字符集连接到mysql,它丢弃UTF-8数据编码在latin1。当您使用latin1查看数据时,UTF-8字符按预期显示。当你使用UTF-8查看数据时,事情会混乱。

Our php app currently uses all of the defaults so it's connecting to mysql using the latin1 character set and it drops UTF-8 data encoded in latin1. When you view the data with latin1 the UTF-8 characters show up as expected. When you view the data with UTF-8 things get jumbled up.

因此,我建议强制将mysql字符集设置为UTF-8,然后进行时间转换如果必要,数据。现在,看到cp1252 / windows-1252是UTF-8的一个子集,它不是那么简单(我看到)来检测cp1252 / windows-1252编码。

So I propose forcing the mysql character set to UTF-8 and then doing a just in time conversion of the data if necessary. Now, seeing as cp1252/windows-1252 is a subset of UTF-8 it's not so straight forward (as far as I can see) to detect the cp1252/windows-1252 encoding.

我写了下面的代码,试图检测cp1252 / windows-1252编码,并根据需要进行转换。它还应该检测正确编码的UTF-8字符,不做任何操作。

I've written the following code that attempts to detect cp1252/windows-1252 encoding and convert as necessary. It should also detect properly encoded UTF-8 characters and do nothing.

$a = 'Card☃'; //cp1252 encoded
$a_test = '☃'.$a; //add known UTF8 character
$c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8');
// attempt to detect known utf8 character after conversion
if (mb_strpos($c, '☃') === false) {
    // not found, original string was not cp1252 encoded, so print
    var_dump($a);
} else {
    // found, original string was cp1252 encoded, remove test character and print
    // This case runs
    $c = mb_strcut($c, 1);
    var_dump($c);
}

$a = 'COD☃'; //proper UTF8 encoded
$a_test = '☃'.$a; //add known UTF8 character
$c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8');
// attempt to detect known utf8 character after conversion
if (mb_strpos($c, '☃') === false) {
    // not found, original string was not cp1252 encoded, so print
    // This case runs
    var_dump($a);
} else {
    // found, original string was cp1252 encoded, remove test character and print
    $c = mb_strcut($c, 1);
    var_dump($c);
}

运行此代码的输出是:

string 'Card☃' (length=7)
string 'COD☃' (length=6)

我理解,在从数据库中传出的所有字符串上运行此操作将会产生性能影响,但尚待衡量,但如果我可以执行JIT

I understand that running this on all strings coming out of the database will have a performance impact, yet to be measured, but if I can do a JIT conversion before switching everything completely it's worth it to me.

有没有人对如何优化这个有任何指针?

Does anyone have any pointers on how to optimize this?

推荐答案

首先,Windows-1252是 UTF-8的子集。你可以认为ASCII是UTF-8的一个子集,但通常更多是一个意识形态的辩论。

Firstly, Windows-1252 is not a subset of UTF-8. You could argue that ASCII is a subset of UTF-8, but that is usually more of an ideological debate.

其次,不可能处理带有CP1252和UTF-8字符(实际上对于CP1252它是一个字节,对于Unicode它是一个代码点)。或者你试图读它作为CP1252,并看到所有的Unicode字符作为单字节,或者你读它作为UTF-8,它删除任何无效的字节序列(或创建随机字符,如果CP1252字符匹配一个Unicode代码点) 。您使用 $ c = mb_strcut($ c,1); 删除测试字符,您要删除由mb_convert_encoding创建的问号,因为它不能将该Unicode字符转换为CP1252字符。

Secondly, it is impossible to handle strings with both CP1252 and UTF-8 "characters" in them (really for CP1252 it's a byte and for Unicode it's a code point). Either you try to read it as CP1252, and see all the Unicode characters as single bytes, or you read it as UTF-8 and it cuts out any invalid byte sequences (or creates random characters if the CP1252 characters match a Unicode code point). You are not removing the test character with $c = mb_strcut($c, 1);, you are removing a question mark created by mb_convert_encoding because it could not convert that Unicode character into a CP1252 character.

第三,您应该从不转换一个字符串,然后事后尝试确定编码。在您转换第二个测试字符串后,它?COD?。没有理由检查是否存在Unicode字符,因为将其转换为CP1252。不能有Unicode字符。作为程序员,你必须知道输出是什么。

Thirdly, you should never convert a String, and then after the fact try to determine the encoding. After you converted your second test string, it was ?COD?. There is no reason to check if a Unicode character exists in it, because you converted it to CP1252. There can't be Unicode characters in it. As the programmer, you have to know what the output is.

唯一的解决方案是检查字符串是否为CP1252,将有问题的字符转换为占位符,然后转换

The only solution is to check if the string is CP1252, convert the offending characters to placeholders, and then convert that string to Unicode:

function convert_cp1252_to_utf8($input, $default = '', $replace = array()) {
    if ($input === null || $input == '') {
        return $default;
    }

    // https://en.wikipedia.org/wiki/UTF-8
    // https://en.wikipedia.org/wiki/ISO/IEC_8859-1
    // https://en.wikipedia.org/wiki/Windows-1252
    // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
    $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
    if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
        /*
         * Use the search/replace arrays if a character needs to be replaced with
         * something other than its Unicode equivalent.
         */ 

        /*$replace = array(
            128 => "€",      // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
            129 => "",              // UNDEFINED
            130 => "‚",      // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
            131 => "ƒ",      // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
            132 => "„",      // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
            133 => "…",      // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
            134 => "†",      // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
            135 => "‡",      // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
            136 => "ˆ",      // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
            137 => "‰",      // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
            138 => "Š",      // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
            139 => "‹",      // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
            140 => "Œ",      // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
            141 => "",              // UNDEFINED
            142 => "Ž",      // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON 
            143 => "",              // UNDEFINED
            144 => "",              // UNDEFINED
            145 => "‘",      // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK 
            146 => "’",      // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
            147 => "“",      // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
            148 => "”",      // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
            149 => "•",      // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
            150 => "–",      // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
            151 => "—",      // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
            152 => "˜",      // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
            153 => "™",      // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
            154 => "š",      // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
            155 => "›",      // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
            156 => "œ",      // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
            157 => "",              // UNDEFINED
            158 => "ž",      // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
            159 => "Ÿ",      // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
        );*/

        if (count($replace) != 0) {
            $find = array();
            foreach (array_keys($replace) as $key) {
                $find[] = chr($key);
            }
            $input = str_replace($find, array_values($replace), $input);
        }
        /*
         * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
         * and control characters, always convert from Windows-1252 to UTF-8.
         */
        $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
        if (count($replace) != 0) {
            $input = html_entity_decode($input);
        }
    }
    return $input;
}


$ b < ISO-8859-1 和 CP1252 ,因为它们是如此相似。我发现这是艰难的方式后,几个小时玩这个功能,只有有这个答案救我。

The trick is that you have to check for both ISO-8859-1 and CP1252 because they are so similar. I found this out the hard way after hours of playing around with this function, only to have this answer save me. If you found this function helpful, go +1 that answer.

基本上,这个函数用代表Unicode字符的HTML实体替换所有那些坏的CP1252字节。然后我们将字符串从 ISO-8859-1 / CP1252 转换为 UTF- / code>,而我们的新的Unicode字符都不会被压缩,因为它们是简单的ASCII字符。最后,我们解码HTML实体,最后有一个100%的Unicode字符串。

Basically, this function replaces all those bad CP1252 bytes with HTML entities representing the Unicode characters. We then convert the string from ISO-8859-1/CP1252 to UTF-8, while none of our new Unicode characters are mangled because they are simple ASCII characters. Finally, we decode the HTML entities and finally have a 100% Unicode string.

这篇关于PHP cp1252 / windows-1252转换为UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆