为什么在OS X和Linux之间,UTF-8文本按不同顺序排序? [英] Why does UTF-8 text sort in different order between OS X and Linux?

查看:95
本文介绍了为什么在OS X和Linux之间,UTF-8文本按不同顺序排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,其中包含多行UTF-8编码的文本:

I have a text file with lines of UTF-8 encoded text:

mac-os-x$ cat unsorted.txt
ウ
foo
チ
'foo'
津

为防止重现此问题,这里有一个校验和和文件中确切字节的转储,以及您自己生成文件的方式(在Linux上,使用base64 -d而不是-D ):

In case it helps to reproduce the problem, here is a checksum and a dump of the exact bytes in the file, as well as how you could generate the file yourself (on Linux, use base64 -d instead of -D):

mac-os-x$ shasum unsorted.txt
a6d0b708d3e0cafb0c6e1af7450e9243da8cb078  unsorted.txt

mac-os-x$ perl -ne 'print join(" ", map { sprintf "%02x", ord } split //), "\n"' unsorted.txt
e3 82 a6 0a
66 6f 6f 0a
e3 83 81 0a
27 66 6f 6f 27 0a
e6 b4 a5 0a

mac-os-x$ echo 44KmCmZvbwrjg4EKJ2ZvbycK5rSlCg== | base64 -D > unsorted.txt

当我在Mac OS X上对该输入文件进行排序时(无论我使用的是Mac OS X Yosemite附带的GNU排序5.93,还是使用安装了Homebrew的GNU排序版本8.23),我都会得到以下排序结果:

When I sort this input file on Mac OS X (regardless of whether I use GNU sort 5.93 which Mac OS X Yosemite ships with, or with a Homebrew installed GNU sort version 8.23), I get this sorted result:

mac-os-x$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /usr/bin/sort unsorted.txt
'foo'
foo
ウ
チ
津

mac-os-x$ echo `sw_vers -productName` `sw_vers -productVersion`
Mac OS X 10.10.1

mac-os-x$ /usr/bin/sort --version | head -1
sort (GNU coreutils) 5.93

当我在Linux(我在Centos 5.5和CentOS 6.5上都进行了测试)上对具有相同语言环境设置的相同文件进行排序时,我得到了不同的结果:

When I sort the same file, with the same locale settings, on Linux (I tested on both Centos 5.5 and CentOS 6.5), I get a different result:

linux-centos-6.5$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /bin/sort unsorted.txt
ウ
チ
foo
'foo'
津

linux-centos-6.5$ cat /etc/redhat-release
CentOS release 6.5 (Final)

linux-centos-6.5$ /bin/sort --version | head -1
sort (GNU coreutils) 8.4

请注意日语假名与英语假名的不同位置,以及仅用单引号引起区别的两行之间的排序顺序不同.

Note the different locations of the Japanese kana vs. the English, and the different sort order between two lines that differ only by the single quotes.

要添加另一个变体,我注意到在一个非常旧的FreeBSD 6盒子上,我得到了与OS X相同的排序顺序:

To add another variant to the mix, I notice that on a very old FreeBSD 6 box I have, I get the same sort order as OS X:

freebsd-6.0$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /usr/bin/sort unsorted.txt
'foo'
foo
ウ
チ
津

freebsd-6.0$ uname -rs
FreeBSD 6.0-RELEASE

freebsd-6.0$ sort --version | head -1
sort (GNU coreutils) 5.3.0-20040812-FreeBSD

我希望每种情况下的排序顺序都相同,因为所有情况都使用GNU排序,并且所有语言环境设置都相同.我尝试分别明确设置LC_COLLATE,并尝试使用LC_COLLATE=C强制按字节顺序排序,但这并没有改变任何结果.

I expected the sort order to be the same in each case, given that all cases are using GNU sort, all with the same locale settings. I tried explictly setting LC_COLLATE separately, and tried using LC_COLLATE=C to force a sort by byte order, but that did not change any results.

为什么我的示例输入文件在OS X和Linux上排序不同?以及如何强制两个系统产生相同排序的文本(我不在乎哪个变体,只要它们在两个变体之间是一致的)?

Why does my example input file sort differently across OS X and Linux? And how could I force both systems to produce identically sorted text (I don't care which variant, as long as it is consistent between the two)?

推荐答案

似乎-您的Linux sort没有保留正确的UTF-8顺序.

As it seems - your linux sort is not preserving proper UTF-8 order.

unsorted.txt(首字母)的十六进制UTF-8表示形式是:

Hex UTF-8 representations of your unsorted.txt (first letters) would be:

-30A6

foo -0066

-30C1

'foo'-0027

-6D25

取自因此,请根据Unicode排序规则进行适当的排序( http://www.unicode .org/Public/UCA/latest/allkeys.txt )为:

So proper sorting according to unicode collation (http://www.unicode.org/Public/UCA/latest/allkeys.txt) would be:

'foo'-第487行

foo -第8966行

-第20875行

-第21004行

-不在文件中

因此,为回答您的问题,您的linux计算机为sort函数提供了错误的排序规则表.不幸的是,我不知道是什么原因.

So, to answer your question, your linux machine is providing wrong collation tables to sort function. Unfortunately, i can't tell what is possible reason for that.

PS:您的此处也有类似的问题.

PS: There's similar question to yours here.

编辑

@ninjalj注意到,glibc不使用UCA,而是使用ISO-14651. 此错误报告建议迁移到UCA.不幸的是,它仍然没有解决.

As @ninjalj noticed, glibc doesn't use UCA, but ISO-14651 instead. This bug report suggest migration to UCA. Unfortunately, it's still not resolved.

此外,它可能以某种方式与有关ls案件敏感性的问题有关在MacOSX上.甚至有人建议它与HFS文件系统有关.

Also, it could be somehow connected with question about ls case insensivity on MacOSX. Some people even suggest that it has something to do with HFS filesystem.

这篇关于为什么在OS X和Linux之间,UTF-8文本按不同顺序排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆