Perl和Unix如何以相同的顺序对Unicode字符串进行排序和排序? [英] How can Perl and Unix sort, order Unicode strings in the same sequence?

查看:86
本文介绍了Perl和Unix如何以相同的顺序对Unicode字符串进行排序和排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试让Perl和GNU/Linux sort (1)程序就如何对Unicode字符串进行排序达成共识.我正在使用LANG=en_US.UTF-8运行 sort .在Perl程序中,我尝试了以下方法:

I am trying to get Perl and the GNU/Linux sort(1) program agree on how to sort Unicode strings. I'm running sort with LANG=en_US.UTF-8. In the Perl program I have tried the following methods:

  • use Unicode::Collate with $Collator = Unicode::Collate->new();
  • use Unicode::Collate::Locale with $Collator = Unicode::Collate->new(locale => $ENV{'LANG'});
  • use locale

其中每个失败并出现以下错误(来自Perl方面):

Each one of them failed with the following errors (from the Perl side):

  • 输入未排序:[----,]在[($ 1]
  • 之后
  • 输入未排序:[...]在[&]
  • 之后
  • 输入未排序:[($ 1]在[1]之后

对我而言唯一有效的方法是为 sort 设置LC_ALL=C,并在Perl中使用8位字符.但是,这种方式无法正确地对Unicode字符串进行排序.

The only method that worked for me involved setting LC_ALL=C for sort, and using 8-bit characters in Perl. However, in this way Unicode strings are not properly ordered.

推荐答案

使用Unicode :: Sort或Unicode :: Sort :: Locale毫无意义.您不是在尝试根据Unicode定义进行排序,而是在尝试根据您的语言环境进行排序.这就是use locale;的作用.

Using Unicode::Sort or Unicode::Sort::Locale makes no sense. You're not trying to sort based on Unicode definitions, you're trying to sort based on your locale. That's what use locale; is for.

我不知道为什么您没有从use locale;下的cmp中获得所需的订单.

I don't know why you didn't get the desired order out of cmp under use locale;.

您可以处理解压缩的文件.

You could process the decompressed files.

for q in file1.uniqc file2.uniqc ; do
   perl -ne's/^\s*(\d+) //; for $c (1..$1) { print }' "$q"
done | sort | uniq -c

当然,它需要更多的临时存储空间,但您将获得所需的确切订单.

It'll require more temporary storage, of course, but you'll get exactly the order you want.

我发现一个案例use locale;并未导致Perl的sort/cmp给出与sort实用程序相同的结果.奇怪的.

I found a case use locale; didn't cause Perl's sort/cmp to give the same result as the sort utility. Weird.

$ export LC_COLLATE=en_US.UTF-8

$ perl -Mlocale -e'print for sort { $a cmp $b } <>' data
(
($1
1

$ perl -MPOSIX=strcoll -e'print for sort { strcoll($a, $b) } <>' data
(
($1
1

$ sort data
(
1
($1

应该说的是,sort实用程序很奇怪.

Truth be told, it's the sort utility that's weird.

在评论中,@ ninjalj指出怪异可能是由于字符的权重不确定.比较这些字符时,顺序是不确定的,因此不同的引擎可能会产生不同的结果.最好重新创建确切顺序的方法是通过 IPC :: Run3使用sort实用程序,但听起来不能保证总是以相同的顺序得到结果.

In the comments, @ninjalj points out that the weirdness is probably due to characters with undefined weights. When comparing such characters, the ordering is undefined, so different engines could produce different results. Your best bet to recreate the exact order would be to use the sort utility through IPC::Run3, but it sounds like that's not guaranteed to always result in the same order.

这篇关于Perl和Unix如何以相同的顺序对Unicode字符串进行排序和排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆