Bash 字符串字典序比较不一致 [英] Bash string lexicographical comparisons inconsistency

查看:36
本文介绍了Bash 字符串字典序比较不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Bash 手册第 6.4 节描述[[ 字符串 1 <string2 ]] 为

Bash manual section 6.4 describes [[ string1 < string2 ]] as

如果 string1 在当前语言环境中按字典顺序排在 string2 之后,则为真.

True if string1 sorts after string2 lexicographically in the current locale.

我使用的是标准英语 Linux,并希望我当前的语言环境是 ASCII,其中句点 [.] 在字典上小于 [0-9A-Za-z].但是,请查看以下内容:

I am using a stock English language Linux and was expecting my current locale is ASCII where period [.] is lexicographically less than [0-9A-Za-z]. However, take a look at these:

$ echo $BASH_VERSION
4.3.11(1)-release
$ [[ "." < "1" ]] && echo "yes"
yes
$ [[ "A" < "B" ]] && echo "yes"
yes
$ [[ ".A" < "1B" ]] && echo "yes"
$

第一次和第二次比较与ASCII表一致,但为什么第三次是错误的?这个字典排序顺序到底是什么?

The 1st and 2nd comparison agree with the ASCII table, but why the 3rd one false? What exactly is this lexicographical sort order?

这里是语言环境的输出:

Here is the output of locale:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

推荐答案

这与您的 shell 没有太大关系.要对 .A1B 执行依赖于语言环境的词典比较,bash 只需调用 strcoll(".A", "1B"),并解释返回值,仅此而已.

This doesn't have much to do with your shell. To perform a locale-dependent lexicographic comparison of .A and 1B, bash simply calls strcoll(".A", "1B"), and interprets the return value, that's all.

    {
#if defined (HAVE_STRCOLL)
      if (shell_compatibility_level > 40 && flags & TEST_LOCALE)
    return ((op[0] == '>') ? (strcoll (arg1, arg2) > 0) : (strcoll (arg1, arg2) < 0));
      else
#endif
    return ((op[0] == '>') ? (strcmp (arg1, arg2) > 0) : (strcmp (arg1, arg2) < 0));
    }

(复制自 test.c)

以上摘录还表明,为了强制逐字节比较 在不改变区域设置的情况下,需要将 shell 兼容级别更改为 40(代表 4.0,bash 的最后一个版本,默认情况下的行为方式符合您的预期).>

Above excerpt also reveals that in order to force a byte-by-byte comparison without altering locale settings, one needs to change the shell compatibility level to 40 (which stands for 4.0, the last version of bash which behaves the way you expected by default).

$ shopt -s compat40
$ [[ .A < 1B ]] && echo yes
yes
$ 

现在,至于你的问题(第一和第二个比较与 ASCII 表一致,但为什么第三个是错误的?这个字典排序顺序到底是什么?),好吧,这是你的语言环境的整理顺序显然.在什么不是排序规则下,UCA 规范说:

Now, as to your question (The 1st and 2nd comparison agree with the ASCII table, but why the 3rd one false? What exactly is this lexicographical sort order?), well, it's your locale's collation order apparently. Under What Collation is NOT, UCA specification says:

通常在串联或子字符串操作下不保留整理顺序.

Collation order is not preserved under concatenation or substring operations, in general.

例如,x 小于 y 的事实并不意味着 x + z 小于 y +z,因为字符可能会在子字符串或连接边界之间形成收缩.总结:

For example, the fact that x is less than y does not mean that x + z is less than y + z, because characters may form contractions across the substring or concatenation boundaries. In summary:

x 并不意味着 xz
x 并不意味着 zx
xz 并不意味着 x <你
zx 并不意味着 x <你

x < y does not imply that xz < yz
x < y does not imply that zx < zy
xz < yz does not imply that x < y
zx < zy does not imply that x < y

我认为这证实了这不是错误而是功能.

Which, I think, corroborates that this is not a bug but a feature.

这篇关于Bash 字符串字典序比较不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆