语言环境如何在Linux/POSIX中工作,并且应用了哪些转换? [英] How do locales work in Linux / POSIX and what transformations are applied?
问题描述
我正在处理(希望)UTF-8文本巨大的文件.我可以使用Ubuntu 13.10(3.11.0-14-generic)和12.04复制它.
I'm working with huge files of (I hope) UTF-8 text. I can reproduce it using Ubuntu 13.10 (3.11.0-14-generic) and 12.04.
在调查错误时,我遇到了奇怪的行为
While investigating a bug I've encountered strange behavoir
$ export LC_ALL=en_US.UTF-8
$ sort part-r-00000 | uniq -d
ɥ ɨ ɞ ɧ 251
ɨ ɡ ɞ ɭ ɯ 291
ɢ ɫ ɬ ɜ 301
ɪ ɳ 475
ʈ ʂ 565
$ export LC_ALL=C
$ sort part-r-00000 | uniq -d
$ # no duplicates found
在运行使用
至少对于std::stringstream
读取文件的自定义C ++程序时,重复项也会出现-由于使用en_US.UTF-8
语言环境时出现重复项,因此失败.std::string
和输入/输出,C ++似乎并不受影响.
The duplicates also appear when running a custom C++ program that reads the file using
C++ seems to be unaffected at least for std::stringstream
- it fails due to duplicates when using en_US.UTF-8
locale.std::string
and input/output.
为什么在使用UTF-8语言环境时找不到重复项,而在C语言环境中却找不到重复项?
Why are duplicates found when using a UTF-8 locale and no duplicates are found with the C locale?
语言环境对导致此行为的文本进行哪些转换?
What transformations does the locale to the text that causes this behavoir?
这里是一个小例子
$ uniq -D duplicates.small.nfc
ɢ ɦ ɟ ɧ ɹ 224
ɬ ɨ ɜ ɪ ɟ 224
ɥ ɨ ɞ ɧ 251
ɯ ɭ ɱ ɪ 251
ɨ ɡ ɞ ɭ ɯ 291
ɬ ɨ ɢ ɦ ɟ 291
ɢ ɫ ɬ ɜ 301
ɧ ɤ ɭ ɪ 301
ɹ ɣ ɫ ɬ 301
ɪ ɳ 475
ͳ ͽ 475
ʈ ʂ 565
ˈ ϡ 565
出现问题时locale
的输出:
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=de_DE.UTF-8
LC_TIME=de_DE.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=de_DE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=de_DE.UTF-8
LC_NAME=de_DE.UTF-8
LC_ADDRESS=de_DE.UTF-8
LC_TELEPHONE=de_DE.UTF-8
LC_MEASUREMENT=de_DE.UTF-8
LC_IDENTIFICATION=de_DE.UTF-8
LC_ALL=
标准化后使用:
cat duplicates | uconv -f utf8 -t utf8 -x nfc > duplicates.nfc
我仍然得到相同的结果
根据iconv
-(从此处)起,该文件是有效的UTF-8.
The file is valid UTF-8 according to iconv
- (from here)
$ iconv -f UTF-8 duplicates -o /dev/null
$ echo $?
0
看起来与此类似: http://xahlee.info/comp/unix_uniq_unicode_bug .html 和 https://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html
Looks like it something similiar to this: http://xahlee.info/comp/unix_uniq_unicode_bug.html and https://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html
它正在FreeBSD上运行
It's working on FreeBSD
推荐答案
我将问题归结为strcoll()
函数的问题,该函数与Unicode规范化无关.回顾:我的最小示例展示了uniq
取决于当前语言环境的不同行为是:
I have boiled down the problem to an issue with the strcoll()
function, which is not related to Unicode normalization. Recap: My minimal example that demonstrates the different behaviour of uniq
depending on the current locale was:
$ echo -e "\xc9\xa2\n\xc9\xac" > test.txt
$ cat test.txt
ɢ
ɬ
$ LC_COLLATE=C uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 uniq -D test.txt
ɢ
ɬ
很明显,如果语言环境为en_US.UTF-8
,则uniq
会将ɢ
和ɬ
视为重复项,而不是这种情况.然后,我再次使用valgrind
运行相同的命令,并使用kcachegrind
研究了两个调用图.
Obviously, if the locale is en_US.UTF-8
uniq
treats ɢ
and ɬ
as duplicates, which shouldn't be the case. I then ran the same commands again with valgrind
and investigated both call graphs with kcachegrind
.
$ LC_COLLATE=C valgrind --tool=callgrind uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 valgrind --tool=callgrind uniq -D test.txt
$ kcachegrind callgrind.out.5754 &
$ kcachegrind callgrind.out.5763 &
唯一的区别是,带有LC_COLLATE=en_US.UTF-8
的版本称为strcoll()
,而没有LC_COLLATE=C
的版本.所以我想出了strcoll()
上的以下最小示例:
The only difference was, that the version with LC_COLLATE=en_US.UTF-8
called strcoll()
whereas LC_COLLATE=C
did not. So I came up with the following minimal example on strcoll()
:
#include <iostream>
#include <cstring>
#include <clocale>
int main()
{
const char* s1 = "\xc9\xa2";
const char* s2 = "\xc9\xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;
std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::cout << std::endl;
s1 = "\xa2";
s2 = "\xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;
std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
}
输出:
ɢ
ɬ
0
-1
-10
-1
�
�
0
-1
-10
-1
那么,这是怎么了?为什么strcoll()
对于两个不同的字符返回0
(等于)?
So, what's wrong here? Why does strcoll()
returns 0
(equal) for two different characters?
这篇关于语言环境如何在Linux/POSIX中工作,并且应用了哪些转换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!