BASH glob/regex范围的怪异行为 [英] Weird behavior of BASH glob/regex ranges

查看:69
本文介绍了BASH glob/regex范围的怪异行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到BASH括号范围(例如[A-Z])表现出意外的情况.
是否有这种行为的解释,或者是错误?

I'm seeing BASH bracket ranges (e.g. [A-Z]) behaving in an unexpected way.
Is there's an explanation for such behavior, or it is a bug?

假设我有一个变量,我要从中删除所有大写字母:

Let's say I have a variable, from which I want to strip all uppercase letters:

$ var='ABCDabcd0123'
$ echo "${var//[A-Z]/}"

我得到的结果是:

a0123

如果使用sed进行操作,则会得到预期的结果:

If I do it with sed, I get an expected result:

$ echo "${var}" | sed 's/[A-Z]//g'
abcd0123

BASH内置正则表达式匹配似乎也是如此:

The same seems to be the case for BASH built-in regex match:

$ [[ a =~ [A-Z] ]] ; echo $?
1
$ [[ b =~ [A-Z] ]] ; echo $?
0

如果我检查从'a'到'z'的所有小写字母,似乎只有'a'是个例外:

If I check all lowercase letters from 'a' to 'z', it seems that only 'a' is an exception:

$ for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done
a

我没有启用不区分大小写的匹配,即使我这样做,也不应使字母'a'的行为有所不同:

I do not have case-insensitive matching enabled, and even if I did, it should not make letter 'a' behave differently:

$ shopt -p nocasematch
shopt -u nocasematch

作为参考,我正在使用Cygwin,但在其他任何计算机上都没有看到此行为:

For the reference, I'm using Cygwin, and I don't see this behavior on any other machine:

$ uname
CYGWIN_NT-6.3
$ bash --version | head -1
GNU bash, version 4.3.46(7)-release (x86_64-unknown-cygwin)
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=

我在这里发现了完全相同的问题: https://bugs.launchpad.net/ubuntu/+source/bash/+ bug/120687
因此,我想这是"en_GB.UTF-8"排序规则的bug(?),但不是BASH本身.
设置LC_COLLATE=C确实可以解决此问题.

I've found the exact same issue reported here: https://bugs.launchpad.net/ubuntu/+source/bash/+bug/120687
So, I guess it's a bug(?) of "en_GB.UTF-8" collation, but not BASH itself.
Setting LC_COLLATE=C indeed solves this.

推荐答案

肯定地 与设置locale有关.摘录自模式匹配下的GNU bash手册页

It certainly had to do with setting of your locale. An excerpt from the GNU bash man page under Pattern Matching

在默认的C语言环境中,

[..]等同于[abcdxyz].许多语言环境都按字典顺序对字符进行排序,在这些语言环境中,[a-dx-z]通常不等同于[abcdxyz];例如,它可能等效于[aBbCcDdxXyYz].要获取括号表达式中范围的传统解释,可以通过将LC_COLLATELC_ALL环境变量设置为值C或启用globasciiranges shell选项来强制使用C语言环境. .]

[..] in the default C locale, [a-dx-z] is equivalent to [abcdxyz]. Many locales sort characters in dictionary order, and in these locales [a-dx-z] is typically not equivalent to [abcdxyz]; it might be equivalent to [aBbCcDdxXyYz], for example. To obtain the traditional interpretation of ranges in bracket expressions, you can force the use of the C locale by setting the LC_COLLATE or LC_ALL environment variable to the value C, or enable the globasciiranges shell option.[..]

在这种情况下,请使用POSIX字符类,[[:upper:]]或如上所述将locale设置LC_ALLLC_COLLATE更改为C.

Use the POSIX character-classess, [[:upper:]] in this case or change your locale setting LC_ALL or LC_COLLATE to C as mentioned above.

LC_ALL=C var='ABCDabcd0123'
echo "${var//[A-Z]/}"
abcd0123

此外,在设置此语言环境时,您对所有小写字母的测试都将失败,因此将其打印出来,

Also, your negative test to do upper-case check will fail for all the lower case letters when setting this locale hence printing the letters,

LC_ALL=C; for l in {a..z}; do [[ $l =~ [A-Z] ]] || echo $l; done

此外,在上述语言环境设置下

Also, under the above locale setting

[[ a =~ [A-Z] ]] ; echo $?
1
[[ b =~ [A-Z] ]] ; echo $?
1

,但在所有小写范围内都适用,

but will be true for all lower-case ranges,

[[ a =~ [a-z] ]] ; echo $?
0
[[ b =~ [a-z] ]] ; echo $?
0


表示这一点,可以使用POSIX指定的字符类,在新的外壳程序没有任何locale设置,


Said this, all these can be avoided by using the POSIX specified character classes, under a new shell without any locale setting,

echo "${var//[[:upper:]]/}"
abcd0123

for l in {a..z}; do [[ $l =~ [[:upper:]] ]] || echo $l; done

这篇关于BASH glob/regex范围的怪异行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆