如何对一组 UTF-8 字符串进行排序? [英] How to sort an array of UTF-8 strings?
问题描述
我目前不知道如何在 PHP 中对包含 UTF-8 编码字符串的数组进行排序.该阵列来自 LDAP 服务器,因此通过数据库进行排序(不会有问题)不是解决方案.以下在我的 Windows 开发机器上不起作用(尽管我认为这至少应该是一个可能的解决方案):
I currentyl have no clue on how to sort an array which contains UTF-8 encoded strings in PHP. The array comes from a LDAP server so sorting via a database (would be no problem) is no solution. The following does not work on my windows development machine (although I'd think that this should be at least a possible solution):
$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.65001'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);
输出为:
string(20) "German_Germany.65001"
string(1) "C"
array(6) {
[0]=>
string(6) "Birnen"
[1]=>
string(9) "Ungetiere"
[2]=>
string(6) "Äpfel"
[3]=>
string(5) "Apfel"
[4]=>
string(9) "Ungetüme"
[5]=>
string(11) "Österreich"
}
这完全是废话.使用 1252 作为 setlocale()
的代码页给出了另一个输出,但仍然是一个明显错误的输出:
This is complete nonsense. Using 1252 as the codepage for setlocale()
gives another output but still a plainly wrong one:
string(19) "German_Germany.1252"
string(1) "C"
array(6) {
[0]=>
string(11) "Österreich"
[1]=>
string(6) "Äpfel"
[2]=>
string(5) "Apfel"
[3]=>
string(6) "Birnen"
[4]=>
string(9) "Ungetüme"
[5]=>
string(9) "Ungetiere"
}
有没有办法对带有 UTF-8 字符串区域设置的数组进行排序?
Is there a way to sort an array with UTF-8 strings locale aware?
刚刚注意到这似乎是 Windows 上的 PHP 问题,因为与用作区域设置的 de_DE.utf8
相同的片段在 Linux 机器上工作.不过,针对此 Windows 特定问题的解决方案会很好...
Just noted that this seems to be PHP on Windows problem, as the same snippet with de_DE.utf8
used as locale works on a Linux machine. Nevertheless a solution for this Windows-specific problem would be nice...
推荐答案
最终,如果不使用 ΤΖΩΤΖΙΟΥ 建议的重新编码的字符串(UTF-8 → Windows-1252 或 ISO-8859-1),则无法以简单的方式解决此问题由于 Huppie 发现的一个明显的 PHP 错误.为了总结这个问题,我创建了以下代码片段,它清楚地表明问题出在使用 65001 Windows-UTF-8-codepage 时的 strcoll() 函数.
Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie. To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.
function traceStrColl($a, $b) {
$outValue=strcoll($a, $b);
echo "$a $b $outValue\r\n";
return $outValue;
}
$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';
$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
$array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);
结果是:
string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
[0]=>
string(1) "c"
[1]=>
string(1) "B"
[2]=>
string(1) "s"
[3]=>
string(1) "C"
[4]=>
string(1) "k"
[5]=>
string(1) "D"
[6]=>
string(2) "ä"
[7]=>
string(1) "E"
[8]=>
string(1) "g"
[...]
相同的代码段在 Linux 机器上运行没有任何问题,产生以下输出:
The same snippet works on a Linux machine without any problems producing the following output:
string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
[0]=>
string(1) "a"
[1]=>
string(1) "A"
[2]=>
string(2) "ä"
[3]=>
string(2) "Ä"
[4]=>
string(1) "b"
[5]=>
string(1) "B"
[6]=>
string(1) "c"
[7]=>
string(1) "C"
[...]
该代码段在使用 Windows-1252 (ISO-8859-1) 编码的字符串时也有效(当然,必须更改 mb_* 编码和区域设置).
The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).
我在 bugs.php.net 上提交了错误报告:Bug #46165 strcoll() 不适用于 Windows 上的 UTF-8 字符串.如果您遇到同样的问题,您可以在错误报告页面上向 PHP 团队提供反馈(另外两个可能相关的错误已被归类为 bogus - 我不认为这错误是假的 ;-).
I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).
谢谢大家.
这篇关于如何对一组 UTF-8 字符串进行排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!