unix sort -n -t“,"给出意想不到的结果 [英] unix sort -n -t"," gives unexpected result

查看:77
本文介绍了unix sort -n -t“,"给出意想不到的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

即使我指定了分隔符,unix数值排序也会产生奇怪的结果.

unix numeric sort gives strange results, even when I specify the delimiter.

$ cat example.csv  # here's a small example
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035

$ cat example.csv | sort -n --field-separator=,
58,1.49270399401
59,0.000192136419373
59,0.00182092924724
59,1.49270399401
60,0.00182092924724
60,1.49270399401
12,13.080339685
12,14.1531049905
12,26.7613447051
12,50.4592437035

在此示例中,无论您指定了定界符,sort都会给出相同的结果.我知道是否设置了LC_ALL=C,然后排序开始再次给出预期的行为.但是我不明白为什么默认的环境设置(如下所示)会导致这种情况发生.

For this example, sort gives the same result regardless if you specify the delimiter. I know if I set LC_ALL=C then sort starts to give expected behavior again. But I do not understand why the default environment settings, as shown below, would make this happen.

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

我已经阅读了许多其他问题(例如,此处此处此处)如何避免这种行为,但是,这种行为令人难以置信的怪异和不可预测,并已导致我一个星期的心痛.有人可以解释为什么在Mac OS X(10.8.5)上使用默认环境设置进行排序会导致这种情况吗?换句话说:如何进行排序(将局部变量设置为en_US.UTF-8)才能得到该结果?

I've read from many other questions (e.g. here, here, and here) how to avoid this behavior in sort, but still, this behavior is incredibly weird and unpredictable and has caused me a week of heartache. Can someone explain why sort with default environment settings on Mac OS X (10.8.5) would behave this way? In other words: what is sort doing (with local variables set to en_US.UTF-8) to get that result?

我正在使用

 sort 5.93                        November 2005

 $ type sort
 sort is /usr/bin/sort

更新

我已经在gnu-coreutils列表上对此进行了讨论,现在了解了为什么使用英语unicode的默认语言环境设置进行排序会得到它的输出.因为在英语unicode中,逗号,"被认为是数字(以便允许逗号为千(或几百)个分隔符),并且在解释一行时默认排序为"being greedy",因此请阅读示例大约为

UPDATE

I've discussed this on the gnu-coreutils list and now understand why sort with english unicode default locale settings gave the output it did. Because in English unicode, the comma character "," is considered a numeric (so as to allow for comma's as thousand's (or e.g. hundreds) separators), and sort defaults to "being greedy" when it interprets a line, it read the example numbers as approximately

581.491...
590.000...
590.001...
591.492...
600.001...
601.492...
1213.08...
1214.15...
1226.76...
1250.45...

尽管这不是我想要的,而chepner正确地获取了我想要的实际结果,但我需要指定我想排序以仅在第一个字段上键入.默认情况下,sort会将行的更多内容解释为键,而不仅仅是将第一个字段解释为键.

Although this was not what I had intended and chepner is right that to get the actual result I want, I need to specify that I want sort to key on only the first field. sort defaults to interpreting more of the line as a key rather than just the first field as a key.

这种排序行为已在gnu-coreutil的

This behavior of sort has been discussed in gnu-coreutil's FAQ, and is further specified in the POSIX description of sort.

因此,就像

So that, as Eric Blake on the gnu-coreutil's list put it, if the field-separator is also a numeric (which a comma is) then "Without -k to stop things, [the field-separator] serves as BOTH a separator AND a numeric character - you are sorting on numbers that span multiple fields."

推荐答案

我不确定这是完全正确的,但是已经接近了.

I'm not sure this is entirely correct, but it's close.

sort -n -t,将尝试按给定的键对数字进行排序.在这种情况下,键是一个由整数和浮点数组成的元组.这样的元组不能按数字排序.

sort -n -t, will try to sort numerically by the given key(s). In this case, the key is a tuple consisting of an integer and a float. Such tuples cannot be sorted numerically.

如果您明确指定用于排序的单个键

If you explicitly specify which single keys to sort on with

sort -k1,1n -k2,2n -t,

它应该工作.现在,您明确地告诉sort首先对第一个字段(数字)排序,然后对第二个字段(也数字)排序.

it should work. Now you are explicitly telling sort to first sort on the first field (numerically), then on the second field (also numerically).

我怀疑-n仅在输入的每一行包含单个数值时才可用作全局选项.否则,您需要结合使用-n选项和-k选项来确切指定哪些字段是数字.

I suspect that -n is useful as a global option only if each line of the input consists of a single numerical value. Otherwise, you need to use the -n option in conjunction with the -k option to specify exactly which fields are numbers.

这篇关于unix sort -n -t“,"给出意想不到的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆