PHP cp1252 / windows-1252转换为UTF-8 [英] PHP cp1252/windows-1252 conversion to UTF-8

查看:132
本文介绍了PHP cp1252 / windows-1252转换为UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将我们的数据库从latin1转换为UTF-8。不幸的是,我不能做大量的单一切换,因为应用程序需要保持在线,我们有700GB的数据库进行转换。



所以我试图利用一点mysql将表转换为UTF-8,但不是数据。我希望数据被实时读取,转换和替换。 (JIT转换,如果你愿意)



我们的php应用程序目前使用所有的默认值,所以它使用latin1字符集连接到mysql,并删除UTF-8数据编码在拉丁1。当您使用latin1查看数据时,UTF-8字符按预期显示。当您使用UTF-8查看数据时,会发生混乱。



所以我建议强制将mysql字符集强制为UTF-8,然后进行即时转换数据如有必要。现在,看到像cp1252 / windows-1252是UTF-8的一个子集,它不是那么直接(据我所见)来检测cp1252 / windows-1252编码。


$ b $我写了以下代码,尝试检测cp1252 / windows-1252编码和转换。它也应该检测到正确编码的UTF-8字符,不做任何操作。

  $ a ='Cardâ~ƒ' // cp1252 encoded 
$ a_test ='☃'。$ a; //添加已知的UTF8字符
$ c = mb_convert_encoding($ a_test,'cp1252','UTF-8');
//尝试在转换后检测已知的utf8字符
if(mb_strpos($ c,'☃')=== false){
//未找到,原始字符串未被cp1252编码,所以打印
var_dump($ a);
} else {
//发现,原始字符串被cp1252编码,删除测试字符并打印
//这种情况运行
$ c = mb_strcut($ c,1);
var_dump($ c);
}

$ a ='COD☃'; //正确的UTF8编码
$ a_test ='☃'。$ a; //添加已知的UTF8字符
$ c = mb_convert_encoding($ a_test,'cp1252','UTF-8');
//尝试在转换后检测已知的utf8字符
if(mb_strpos($ c,'☃')=== false){
//未找到,原始字符串未被cp1252编码,所以打印
//这种情况运行
var_dump($ a);
} else {
//发现,原始字符串被cp1252编码,删除测试字符并打印
$ c = mb_strcut($ c,1);
var_dump($ c);
}

运行此代码的输出是:

  string'Card☃'(length = 7)
string'COD☃'(length = 6)

我知道在数据库中出现的所有字符串上运行这些都会对性能产生影响,但尚待测量,但如果我可以做一个JIT转换之前完全切换一切,这对我来说是值得的。



有没有人有任何指针如何优化这个?

解决方案

首先,Windows-1252是不是 UTF-8的一个子集。您可以认为ASCII是UTF-8的一个子集,但这通常更多的是意识形态争论。



其次,不可能处理CP1252和UTF-8字符(对于CP1252,它是一个字节,对于Unicode是一个代码点)。您尝试将其读取为CP1252,并将所有Unicode字符视为单字节,或者将其视为UTF-8,并删除任何无效的字节序列(如果CP1252字符与Unicode代码点匹配,则会创建随机字符) 。您使用 $ c = mb_strcut($ c,1); 删除测试字符,您将删除由mb_convert_encoding创建的问号,因为它不能将该Unicode字符转换为CP1252字符。



第三,您应该永远不要转换一个String,然后在事实之后尝试确定编码。转换第二个测试字符串后,它是?COD?。没有理由检查它是否存在Unicode字符,因为您将其转换为CP1252。其中不能有Unicode字符。作为程序员,你必须知道输出是什么。



唯一的解决方案是检查字符串是否为CP1252,将有害字符转换为占位符,然后转换该字符串到Unicode:

 函数convert_cp1252_to_utf8($ input,$ default ='',$ replace = array()){
if($ input === null || $ input ==''){
return $ default;
}

// https://en.wikipedia.org/wiki/UTF-8
// https://en.wikipedia.org/wiki/ISO/ GB / T×××× CP1252.TXT
$ encoding = mb_detect_encoding($ input,array('Windows-1252','ISO-8859-1'),true);
if($ encoding =='ISO-8859-1'|| $ encoding =='Windows-1252'){
/ *
*如果字符使用搜索/替换数组需要用
*替换为Unicode等价物以外的东西。
* /

/ * $ replace = array(
128 =>€,// http://www.fileformat.info/info /unicode/char/20AC/index.htm EURO SIGN
129 =>,// UNDEFINED
130 =>‚,// http:// www。 fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
131 ="ƒ,// http://www.fileformat.info/info /unicode/char/0192/index.htm拉丁小写字母F带钩
132 ="„,// http://www.fileformat.info/info/unicode/char/ 201e / index.htm DOUBLE LOW-9 QUOTATION MARK
133 ="…,// http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
134 ="†,// http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
135 => ‡,// http://www.fileformat.info/info/unicode/char/ 2021 / index.htm DOUBLE DAGGER
136 => ˆ,// http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
137 => ‰,// http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
138 => Š,// http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S with CARON
139 => ‹,// http://www.fileformat.info/info/unicode/char/2039/index.htm单个左点角度报价标记
140 => Œ,// http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
141 => ,// UNDEFINED
142 => Ž,// http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON
143 => ,// UNDEFINED
144 => ,// UNDEFINED
145 => ‘,// http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK
146 => ’// http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
147 => “,// http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
148 => ”,// http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
149 => •,// http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
150 => –,// http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
151 => —,// http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
152 => ˜,// http://www.fileformat.info/info/unicode/char/02D​​C/index.htm SMALL TILDE
153 => ™,// http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
154 => š,// http://www.fileformat.info/info/unicode/char/0161/index.htm拉丁小姐用CARON
155 => ›,// http://www.fileformat.info/info/unicode/char/203A/index.htm单一指向角度报价标志
156 => œ,// http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
157 => ,// UNDEFINED
158 => ž,// http://www.fileformat.info/info/unicode/char/017E/index.htm拉丁小号Z带卡通
159 => Ÿ,// http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
); * /

if(count($ replace)!= 0){
$ find = array();
foreach(array_keys($ replace)as $ key){
$ find [] = chr($ key);
}
$ input = str_replace($ find,array_values($ replace),$ input);
}
/ *
*由于ISO-8859-1和CP1252是相同的,除了0x80到0x9F
*和控制字符,始终从Windows-1252转换为UTF-8 。
* /
$ input = iconv('Windows-1252','UTF-8 // IGNORE',$ input);
if(count($ replace)!= 0){
$ input = html_entity_decode($ input);
}
}
return $ input;
}

诀窍是你必须检查因为它们非常相似,因此ISO-8859-1 CP1252 我在这个功能上玩了几个小时后才发现这一点,只有这个答案保存我。如果您发现此功能有帮助,请转到+1答案。



基本上,此函数将代替Unicode字符的HTML实体替换所有那些错误的CP1252字节。然后,将字符串从 ISO-8859-1 / CP1252 转换为 UTF-8 ,而我们的新的Unicode字符都不会被破坏,因为它们是简单的ASCII字符。最后,我们对HTML实体进行解码,最后有一个100%的Unicode字符串。


I'm in the process of trying to convert our database from latin1 to UTF-8. Unfortunately I can't do a massive single switchover as the application needs to stay online and we have 700GB of database to convert.

So I'm trying to leverage a little mysql hack of converting tables to UTF-8 however not the data. I'd like the data to be read, converted, and replaced in real time. (A JIT conversion if you will)

Our php app currently uses all of the defaults so it's connecting to mysql using the latin1 character set and it drops UTF-8 data encoded in latin1. When you view the data with latin1 the UTF-8 characters show up as expected. When you view the data with UTF-8 things get jumbled up.

So I propose forcing the mysql character set to UTF-8 and then doing a just in time conversion of the data if necessary. Now, seeing as cp1252/windows-1252 is a subset of UTF-8 it's not so straight forward (as far as I can see) to detect the cp1252/windows-1252 encoding.

I've written the following code that attempts to detect cp1252/windows-1252 encoding and convert as necessary. It should also detect properly encoded UTF-8 characters and do nothing.

$a = 'Card☃'; //cp1252 encoded
$a_test = '☃'.$a; //add known UTF8 character
$c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8');
// attempt to detect known utf8 character after conversion
if (mb_strpos($c, '☃') === false) {
    // not found, original string was not cp1252 encoded, so print
    var_dump($a);
} else {
    // found, original string was cp1252 encoded, remove test character and print
    // This case runs
    $c = mb_strcut($c, 1);
    var_dump($c);
}

$a = 'COD☃'; //proper UTF8 encoded
$a_test = '☃'.$a; //add known UTF8 character
$c = mb_convert_encoding($a_test, 'cp1252', 'UTF-8');
// attempt to detect known utf8 character after conversion
if (mb_strpos($c, '☃') === false) {
    // not found, original string was not cp1252 encoded, so print
    // This case runs
    var_dump($a);
} else {
    // found, original string was cp1252 encoded, remove test character and print
    $c = mb_strcut($c, 1);
    var_dump($c);
}

The output of running this code is:

string 'Card☃' (length=7)
string 'COD☃' (length=6)

I understand that running this on all strings coming out of the database will have a performance impact, yet to be measured, but if I can do a JIT conversion before switching everything completely it's worth it to me.

Does anyone have any pointers on how to optimize this?

解决方案

Firstly, Windows-1252 is not a subset of UTF-8. You could argue that ASCII is a subset of UTF-8, but that is usually more of an ideological debate.

Secondly, it is impossible to handle strings with both CP1252 and UTF-8 "characters" in them (really for CP1252 it's a byte and for Unicode it's a code point). Either you try to read it as CP1252, and see all the Unicode characters as single bytes, or you read it as UTF-8 and it cuts out any invalid byte sequences (or creates random characters if the CP1252 characters match a Unicode code point). You are not removing the test character with $c = mb_strcut($c, 1);, you are removing a question mark created by mb_convert_encoding because it could not convert that Unicode character into a CP1252 character.

Thirdly, you should never convert a String, and then after the fact try to determine the encoding. After you converted your second test string, it was ?COD?. There is no reason to check if a Unicode character exists in it, because you converted it to CP1252. There can't be Unicode characters in it. As the programmer, you have to know what the output is.

The only solution is to check if the string is CP1252, convert the offending characters to placeholders, and then convert that string to Unicode:

function convert_cp1252_to_utf8($input, $default = '', $replace = array()) {
    if ($input === null || $input == '') {
        return $default;
    }

    // https://en.wikipedia.org/wiki/UTF-8
    // https://en.wikipedia.org/wiki/ISO/IEC_8859-1
    // https://en.wikipedia.org/wiki/Windows-1252
    // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
    $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true);
    if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') {
        /*
         * Use the search/replace arrays if a character needs to be replaced with
         * something other than its Unicode equivalent.
         */ 

        /*$replace = array(
            128 => "€",      // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN
            129 => "",              // UNDEFINED
            130 => "‚",      // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK
            131 => "ƒ",      // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK
            132 => "„",      // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK
            133 => "…",      // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS
            134 => "†",      // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER
            135 => "‡",      // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER
            136 => "ˆ",      // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT
            137 => "‰",      // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN
            138 => "Š",      // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON
            139 => "‹",      // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK
            140 => "Œ",      // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE
            141 => "",              // UNDEFINED
            142 => "Ž",      // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON 
            143 => "",              // UNDEFINED
            144 => "",              // UNDEFINED
            145 => "‘",      // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK 
            146 => "’",      // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK
            147 => "“",      // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK
            148 => "”",      // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK
            149 => "•",      // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET
            150 => "–",      // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH
            151 => "—",      // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH
            152 => "˜",      // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE
            153 => "™",      // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN
            154 => "š",      // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON
            155 => "›",      // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
            156 => "œ",      // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE
            157 => "",              // UNDEFINED
            158 => "ž",      // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON
            159 => "Ÿ",      // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS
        );*/

        if (count($replace) != 0) {
            $find = array();
            foreach (array_keys($replace) as $key) {
                $find[] = chr($key);
            }
            $input = str_replace($find, array_values($replace), $input);
        }
        /*
         * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F
         * and control characters, always convert from Windows-1252 to UTF-8.
         */
        $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input);
        if (count($replace) != 0) {
            $input = html_entity_decode($input);
        }
    }
    return $input;
}

The trick is that you have to check for both ISO-8859-1 and CP1252 because they are so similar. I found this out the hard way after hours of playing around with this function, only to have this answer save me. If you found this function helpful, go +1 that answer.

Basically, this function replaces all those bad CP1252 bytes with HTML entities representing the Unicode characters. We then convert the string from ISO-8859-1/CP1252 to UTF-8, while none of our new Unicode characters are mangled because they are simple ASCII characters. Finally, we decode the HTML entities and finally have a 100% Unicode string.

这篇关于PHP cp1252 / windows-1252转换为UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆