当未提供--stable选项时，sort -n是否可以按预期处理关系?如果可以，怎么办? [英] Does sort -n handle ties predictably when the --stable option is NOT provided? If it does, how?

查看：58 发布时间：2021/4/14 20:28:51 bash sorting language-lawyer lexicographic stable-sort

本文介绍了当未提供--stable选项时，sort -n是否可以按预期处理关系?如果可以，怎么办?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在这里，两行中的 3 之后的空格看起来像打破了数字排序，并让字母排序开始，因此 11 < 2:

  $ echo -e'3 2 \ n3 11'|排序-n3 113 2

在 man sort 中，我阅读

  -s，--stable通过禁用最后查询比较来稳定排序

表示没有 -s 的最后一项比较已完成(在关系之间，因为 -s 不会影响非纽带.

所以问题是:最后一次比较是如何完成的?如果需要回答该问题，将欢迎对源代码进行引用.

Unix的答案通过实验得出结论，领带的排序是按字典顺序进行的.

标准/POSIX对此有何规定?

解决方案

问题: 最后的比较如何完成?

这在GNU coreutils的文档中很快得到了回答:

一对线的比较如下:sort根据相关的排序选项，按照命令行上指定的顺序比较每对字段(请参见-key )，直到找到差异或没有剩余字段.如果未指定键字段，则sort使用整行的默认键.最后，当所有键都比较相等时，作为最后的选择，sort比较整行，就好像没有指定除-reverse ( -r )以外的其他排序选项一样.-stable ( -s )选项禁用此最后一种比较，以便所有字段比较均等的行以其原始相对顺序保留.-unique ( -u )选项还禁用了最后一种比较.
除非另有说明，否则所有比较均使用 LC_COLLATE
指定的字符整理顺序
_{源:排序调用GNUCoreutils}

这意味着最终手段将根据LC_COLLATE的排序顺序进行排序，即按字典顺序(大部分)进行排序.

另一方面，POSIX添加了最后一个更严格的最终选择选项.

如果此整理序列没有所有字符的总排序(请参见XBD LC_COLLATE)，则应使用以下整理序列逐字节逐字节比较所有应整理的输入行:POSIX语言环境.
_{源:排序POSIX标准}

我不确定这是否以GNU排序实现，因为这不是必需条件.尽管如此，POSIX还是强烈推荐它(请参阅理论上的最后一段)

如果是OP，这是什么意思?

对键定义存在误解.假设您做类似的事情

  $ sort --option -k1,3文件

通常可以理解， sort 将首先使用-option 对字段1进行排序，然后对2进行排序，最后对3进行排序.这是不正确的.它将使用键将其定义为由字段1到3组成的子字符串.如果两行均等地排列，则 sort 将执行last-resort选项(请参见前面的内容)

  aa bb cc xxxxxxxx---------<<<Rule1:根据关键------------------<<<规则2:按字典顺序排序(不得已)

使用GNU排序，您可以看到用于排序的子字符串.这是通过-debug 选项完成的.在这里，您会看到3种简单情况之间的区别:

 #用全行按字典顺序排序#-------------------------------------------------------------------$ echo -e"ab c d \ nefg h i"|排序-调试排序:使用?en_GB.UTF-8?排序规则A B C D______efg h I_______#-------------------------------------------------------------------#使用字段1和2形成的子串按字典顺序排序#-------------------------------------------------------------------$ echo -e"ab c d \ nefg h i"|排序-k1,2-调试排序:使用?en_GB.UTF-8?排序规则排序:关键1中的空白处很重要；考虑也指定"b"A B C D__________efg h I____________#-------------------------------------------------------------------#按照字段1和字段2的字典顺序排序#-------------------------------------------------------------------$ echo -e"ab c d \ nefg h i"|排序-k1,1 -k2,2-调试排序:使用?en_GB.UTF-8?排序规则排序:关键1中的空白处很重要；考虑也指定"b"排序:关键2中的空白处很重要；考虑也指定"b"A B C D__________efg h I____________

当您执行数字排序(使用 -n 或 -g )时， sort 会尝试从中提取数字>键(1234abc导致1234)并使用该数字进行排序.

 #用全行数字排序#-------------------------------------------------------------------$ echo -e"3a 11a \ n3b 2b"；|排序-n-调试排序:使用?en_GB.UTF-8?排序规则3a 11a_#全行数字______#在字典上按实线显示(不得已)3b 2b_#全行数字_____#在字典上按实线显示(不得已)#-------------------------------------------------------------------#使用字段1和字段2进行数字排序#-------------------------------------------------------------------$ echo -e"3a 11a \ n3b 2b"；|排序-n -k1,1 -k2,2-调试排序:使用?en_GB.UTF-8?排序规则3b 2b_#字段1上的数字_#字段2上的数字_____#在字典上按实线显示(不得已)3a 11a_#字段1上的数字__#字段2上的数字______#在字典上按实线显示(不得已)

您在这两种情况下都注意到，即使第一个字段可以按字典顺序排序 3a<3b ，它被忽略了，因为我们只从键中选择数字.

Here it looks like the space after the 3 in both rows breaks the numerical sorting and lets the alphabetic sorting kick in, so that 11<2:

$ echo -e '3 2\n3 11' | sort -n
3 11
3 2

In man sort, I read

-s, --stable stabilize sort by disabling last-resort comparison

which implies that without -s a last-resort comparison is done (between ties, because -s does not affect non-ties).

So the question is: how is this last-resort comparison accomplished? A reference to the source code would be welcome, if necessary to answer the question.

This answer Unix deduces, from experimentation, that the sorting of ties is lexicographic.

Does the standard/POSIX say anything about this?
解决方案
Question: How is the last resort comparison done?

This is quickly answered in documentation of GNU coreutils:

A pair of lines is compared as follows: sort compares each pair of fields (see --key), in the order specified on the command line, according to the associated ordering options, until a difference is found or no fields are left. If no key fields are specified, sort uses a default key of the entire line. Finally, as a last resort when all keys compare equal, sort compares entire lines as if no ordering options other than --reverse (-r) were specified. The --stable (-s) option disables this last-resort comparison so that lines in which all fields compare equal are left in their original relative order. The --unique (-u) option also disables the last-resort comparison.

Unless otherwise specified, all comparisons use the character collating sequence specified by the LC_COLLATE

_{source: Sort Invocation GNU Coreutils}

This means that the final resort will sort according to the sorting order of LC_COLLATE, i.e. lexicographically (mostly).

POSIX, on the other hand adds a final ultra-last resort option which is stricter.

If this collating sequence does not have a total ordering of all characters (see XBD LC_COLLATE), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.

_{source: Sort POSIX standard}

I am not certain if this is implemented in GNU sort, since it is not a requirement. Nonetheless, POSIX strongly recommends it (See Rationale last paragraph)

What does this mean in case of the OP?

There is an uncomfortable misunderstanding of the key-definitions. Assume you do something like
$ sort --option -k1,3 file
It is often understood that sort will first sort on field 1, then 2 and finally 3 using --option. This is incorrect. It will use the key to be defined as the substring consisting of fields 1 till 3. And in case when two lines collate equally, sort will perform the last-resort option (see earlier)
aa bb cc xxxxxxxx --------- <<< rule1: according to the key ------------------ <<< rule2: lexicographical sort (last resort)
Using GNU sort, you can see which substring is used for the sort. This is done with the --debug option. Here you see the difference between 3 simple cases:
# Sort lexicographically with full line # ------------------------------------------------------------------- $ echo -e "ab c d\nefg h i" | sort --debug sort: using ?en_GB.UTF-8? sorting rules ab c d ______ efg h i _______ # ------------------------------------------------------------------- # Sort lexicographically with the substring formed by field 1 and 2 # ------------------------------------------------------------------- $ echo -e "ab c d\nefg h i" | sort -k1,2 --debug sort: using ?en_GB.UTF-8? sorting rules sort: leading blanks are significant in key 1; consider also specifying 'b' ab c d ____ ______ efg h i _____ _______ # ------------------------------------------------------------------- # Sort lexicographically with field 1 followed by field 2 # ------------------------------------------------------------------- $ echo -e "ab c d\nefg h i" | sort -k1,1 -k2,2 --debug sort: using ?en_GB.UTF-8? sorting rules sort: leading blanks are significant in key 1; consider also specifying 'b' sort: leading blanks are significant in key 2; consider also specifying 'b' ab c d __ __ ______ efg h i ___ __ _______
When you do a numeric sort (using -n or -g), sort will attempt to extract a number from the key (1234abc leads to 1234) and use that number for the sorting.
# Sort numerically with full line # ------------------------------------------------------------------- $ echo -e "3a 11a\n3b 2b" | sort -n --debug sort: using ?en_GB.UTF-8? sorting rules 3a 11a _ # numeric on full line ______ # lexicographically on full line (last resort) 3b 2b _ # numeric on full line _____ # lexicographically on full line (last resort) # ------------------------------------------------------------------- # Sort numerically with field 1 then field 2 # ------------------------------------------------------------------- $ echo -e "3a 11a\n3b 2b" | sort -n -k1,1 -k2,2 --debug sort: using ?en_GB.UTF-8? sorting rules 3b 2b _ # numeric on field 1 _ # numeric on field 2 _____ # lexicographically on full line (last resort) 3a 11a _ # numeric on field 1 __ # numeric on field 2 ______ # lexicographically on full line (last resort)
As you notice in these two cases, even though the first field can be ordered lexicographically 3a < 3b, it is ignored as we only pick the number from the key.

这篇关于当未提供--stable选项时，sort -n是否可以按预期处理关系?如果可以，怎么办?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

当未提供--stable选项时，sort -n是否可以按预期处理关系?如果可以，怎么办? [英] Does sort -n handle ties predictably when the --stable option is NOT provided? If it does, how?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

当未提供--stable选项时，sort -n是否可以按预期处理关系?如果可以，怎么办? [英] Does sort -n handle ties predictably when the --stable option is NOT provided? If it does, how?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭