unix sort -n -t“,"给出意想不到的结果 [英] unix sort -n -t"," gives unexpected result
问题描述
即使我指定了分隔符,unix数值排序也会产生奇怪的结果.
unix numeric sort gives strange results, even when I specify the delimiter.
$ cat example.csv # here's a small example
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035
$ cat example.csv | sort -n --field-separator=,
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035
在此示例中,无论您指定了定界符,sort都会给出相同的结果.我知道是否设置了LC_ALL=C
,然后排序开始再次给出预期的行为.但是我不明白为什么默认的环境设置(如下所示)会导致这种情况发生.
For this example, sort gives the same result regardless if you specify the delimiter. I know if I set LC_ALL=C
then sort starts to give expected behavior again. But I do not understand why the default environment settings, as shown below, would make this happen.
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
我已经阅读了许多其他问题(例如,此处,此处和此处)如何避免这种行为,但是,这种行为令人难以置信的怪异和不可预测,并已导致我一个星期的心痛.有人可以解释为什么在Mac OS X(10.8.5)上使用默认环境设置进行排序会导致这种情况吗?换句话说:如何进行排序(将局部变量设置为en_US.UTF-8)才能得到该结果?
I've read from many other questions (e.g. here, here, and here) how to avoid this behavior in sort, but still, this behavior is incredibly weird and unpredictable and has caused me a week of heartache. Can someone explain why sort with default environment settings on Mac OS X (10.8.5) would behave this way? In other words: what is sort doing (with local variables set to en_US.UTF-8) to get that result?
我正在使用
sort 5.93 November 2005
$ type sort
sort is /usr/bin/sort
更新
我已经在gnu-coreutils列表上对此进行了讨论,现在了解了为什么使用英语unicode的默认语言环境设置进行排序会得到它的输出.因为在英语unicode中,逗号,"被认为是数字(以便允许逗号为千(或几百)个分隔符),并且在解释一行时默认排序为"being greedy",因此请阅读示例大约为
UPDATE
I've discussed this on the gnu-coreutils list and now understand why sort with english unicode default locale settings gave the output it did. Because in English unicode, the comma character "," is considered a numeric (so as to allow for comma's as thousand's (or e.g. hundreds) separators), and sort defaults to "being greedy" when it interprets a line, it read the example numbers as approximately
581.491...
590.000...
590.001...
591.492...
600.001...
601.492...
1213.08...
1214.15...
1226.76...
1250.45...
尽管这不是我想要的,而chepner正确地获取了我想要的实际结果,但我需要指定我想排序以仅在第一个字段上键入.默认情况下,sort会将行的更多内容解释为键,而不仅仅是将第一个字段解释为键.
Although this was not what I had intended and chepner is right that to get the actual result I want, I need to specify that I want sort to key on only the first field. sort defaults to interpreting more of the line as a key rather than just the first field as a key.
这种排序行为已在gnu-coreutil的排序的POSIX描述.
This behavior of sort has been discussed in gnu-coreutil's FAQ, and is further specified in the POSIX description of sort.
So that, as Eric Blake on the gnu-coreutil's list put it, if the field-separator is also a numeric (which a comma is) then "Without -k to stop things, [the field-separator] serves as BOTH a separator AND a numeric character - you are sorting on numbers that span multiple fields."
推荐答案
我不确定这是完全正确的,但是已经接近了.
I'm not sure this is entirely correct, but it's close.
sort -n -t,
将尝试按给定的键对数字进行排序.在这种情况下,键是一个由整数和浮点数组成的元组.这样的元组不能按数字排序.
sort -n -t,
will try to sort numerically by the given key(s). In this case, the key is a tuple consisting of an integer and a float. Such tuples cannot be sorted numerically.
如果您明确指定用于排序的单个键
If you explicitly specify which single keys to sort on with
sort -k1,1n -k2,2n -t,
它应该工作.现在,您明确地告诉sort
首先对第一个字段(数字)排序,然后对第二个字段(也数字)排序.
it should work. Now you are explicitly telling sort
to first sort on the first field (numerically), then on the second field (also numerically).
我怀疑-n
仅在输入的每一行包含单个数值时才可用作全局选项.否则,您需要结合使用-n
选项和-k
选项来确切指定哪些字段是数字.
I suspect that -n
is useful as a global option only if each line of the input consists of a single numerical value. Otherwise, you need to use the -n
option in conjunction with the -k
option to specify exactly which fields are numbers.
这篇关于unix sort -n -t“,"给出意想不到的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!