正则表达式[A-Za-z]似乎不包含字母W和w [英] Regular expression [A-Za-z] seems to not include letter W and w

查看:293
本文介绍了正则表达式[A-Za-z]似乎不包含字母W和w的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于某种原因,我不知道为什么,也许是我的系统或大脑中的某些内容不太正确,正则表达式"[AZ]"似乎无法识别字母"W"和"[az]"似乎无法识别字母"w".示例:

For some reason, I don't know why, maybe something isn't quite right in my system or in my brain, the regular expression "[A-Z]" doesn't seem to recognise the letter "W" and "[a-z]" doesn't seem to recognise the letter "w". Example:

for x in A a B b C c D d E e F f G g H h I i J j K k L l M m N n O o P p Q q R r S s T t U u V v W w X x Y y Z z; do echo $x | egrep "[A-Za-z]"; done

我的输出是: 一种 一种 乙 b C C d d E Ë F F G G H H 一世 一世 Ĵ Ĵ ķ ķ 大号 升 中号 米 ñ ñ Ø Ø P p 问 q [R [R 小号 s Ť Ť ü ü 伏特 v X X ÿ ÿ ž z

My output is: A a B b C c D d E e F f G g H h I i J j K k L l M m N n O o P p Q q R r S s T t U u V v X x Y y Z z

如您所见,字母"W"和"w"都丢失了.我是唯一一个?可能是什么原因造成的?如果是错误,我应该在哪里报告?这发生在bash和zsh中,并且发生在sed和egrep中(可能还有更多,我只测试了这两个),所以问题似乎与一般的正则表达式有关……:o 所以……发生了什么事?

As you can see, letters "W" and "w" are both missing. Am I the only one? What could possibly cause this? If it's a bug, where do I report it? This happens in bash and zsh and it happens in sed and egrep (and possibly more, I only tested those two), so the problem seems to be about regular expressions in general… :o So… what is going on??

  • Manjaro 17.1.12
  • XFCE 4.12
  • bash 4.4.23(1)-发行版(x86_64-unknown-linux-gnu)
  • zsh 5.5.1(x86_64-unknown-linux-gnu)
  • egrep 3.1
  • sed 4.5

有人问我的语言环境,就在这里.

Someone asked for my locale, so here it is.

$ locale        
LANG=sv_SE.utf8
LC_CTYPE="sv_SE.utf8"
LC_NUMERIC=sv_SE.UTF-8
LC_TIME=sv_SE.UTF-8
LC_COLLATE="sv_SE.utf8"
LC_MONETARY=sv_SE.UTF-8
LC_MESSAGES="sv_SE.utf8"
LC_PAPER=sv_SE.UTF-8
LC_NAME=sv_SE.UTF-8
LC_ADDRESS=sv_SE.UTF-8
LC_TELEPHONE=sv_SE.UTF-8
LC_MEASUREMENT=sv_SE.UTF-8
LC_IDENTIFICATION=sv_SE.UTF-8
LC_ALL=

如果这是问题所在,那么我猜想任何决定sv_SE.UTF-8是错的都是错的,因为字母"w"是在2006年添加到瑞典字母中的. 另外,如果A-Z间隔取决于当前的语言环境,那么在将语言环境设置为瑞典语时,[A-Ö]是否不应该为整个瑞典字母工作?没有,它给出了一条错误消息.但是[[:alpha:]]似乎包括所有瑞典语字母,所以我想对此感到满意.

If this is the problem, then I guess whatever decides what sv_SE.UTF-8 is, is wrong, because the letter "w" was added to the Swedish alphabet in 2006. Also, if the A-Z interval is dependent on the current locale, shouldn't [A-Ö] work for the whole Swedish alphabet when locale is set to Swedish? It doesn't, it gives an error message. However [[:alpha:]] seems to include all Swedish letters, so I guess I'm happy with that.

推荐答案

从技术上讲,在Posix正则表达式中使用范围表达式(例如[a-z])(与grep实用程序一样)仅在Posix(C)中具有指定的行为语言环境.这意味着您确实不能在sv_SE区域设置(或任何其他国际化的区域设置)中可靠地使用范围表达式.但是,您可以可靠地使用字符类,例如[[:lower:]][[:alpha:]][[:alnum:]]等,这通常是您应该做的.

Technically speaking, using range expressions such as [a-z] in a Posix regular expression (as with the grep utility) only has specified behaviour in the Posix (C) locale. That means that you really cannot reliably use range expressions in the sv_SE locale (or any other internationalised locale). You can, however, reliably use character classes, such as [[:lower:]], [[:alpha:]], [[:alnum:]], and so on, and that is normally what you should do.

话虽如此,我相信您所遇到的确实是v2.28中引入的glibc中的错误,因为sv_SE语言环境的先前版本正确地将w置于小写字母范围内并将W置于大写范围.我认为此更改与用户的期望不符,因为它将破坏正则表达式范围表达式,尽管表达式行为未指定,该表达式以前仍能按预期工作.

Having said that, I believe that what you are experiencing is indeed a bug in glibc introduced in v2.28, since previous versions of the sv_SE locale correctly placed w in lower-case ranges and W in upper-case ranges. I think the change does not match user expectations, since it will break regex range expressions which previously worked as expected despite having unspecified behaviour.

大约一个月前,该问题被报告为glibc错误,由于缺少文档,该问题几乎立即关闭;昨天,我要求将其重新打开. (更新:,该错误已被重新鉴定为另一个错误的重复,其最终解决方案只能是基础设计问题的全面解决方案.换句话说,glibc团队理解存在问题,但不要不要屏住呼吸寻求解决方案.)

The problem was reported as a glibc bug about a month ago, and almost immediately closed for lack of documentation; yesterday, I requested that it be reopened. (Update: that bug was requalified as a duplicate of another bug whose eventual solution can only be a comprehensive solution to the underlying design issue. In other words, the glibc team understand that there is a problem but don't hold your breath for a solution.)

我已经在此存储库中放入了可能的替换sv_SE语言环境定义文件它被证明对某人有用.除非您遇到来自glibc的语言环境定义问题,否则请不要安装它.

I've put a possible replacement sv_SE locale definition file in this repository, in case it proves to be useful to someone. Please don't install it unless you are experiencing problems with the locale definition from glibc.

我在上面链接的错误报告中的评论过长,试图找出问题所在,这更多是定义问题,而不是实现问题.根本问题是很难(如果不是不可能)定义一个与整个字符串比较顺序完全一致的单字符排序顺序.仔细阅读Posix基本原理文档中的内容之后,似乎很明显,很多人都对这个特定的砖墙之以鼻,却从未设法达成一个具有实施共识的实用的可移植提案. (如上所述,我们已经努力解决了这些差异,但是还没有找到足够具体的解决方案以允许可移植软件同时又不会使现有实现无效.")

My excessively long comment in the bug report linked above tries to lay out the problem, which is more a problem of definition than implementation. The essential problem is that it is very difficult (if not impossible) to define a single-character collation order which is completely consistent with a whole-string comparison order. Reading between the lines in the Posix rationale document, it seems clear that a lot of people banged their heads against this particular brick wall without ever managing to come up with a practical portable proposal with implementation consensus. ("As noted above, efforts were made to resolve the differences, but no solution has been found that would be specific enough to allow for portable software while not invalidating existing implementations.")

对各种语言环境定义文件的精心策划的清除导致更改了瑞典语言环境中的字符顺序.它并没有改变字符串的排序顺序,因此VW继续像以前一样进行排序(也就是说,好像它们是同一字母而不是不同字母的变体拼写),并且它不会更改CTYPE定义,因此Ww继续像以前一样是字母(并因此匹配[[:alpha:]]).但这确实(偶然地,我相信)改变了字符顺序.之前,W跟在V后面,而w跟在v后面,所以W匹配了[U-X],而w匹配了[u-x].更改将两个字符都放在刺()后面,这意味着它不能匹配任何范围表达式. (正则表达式范围表达式仅限于单字节代码点.)

A well-intentioned cleanup of the various locale definition files resulted in a change to the character ordering in the Swedish locale. It did not alter the string sortation order, so that V and W continue to be sorted as before (that is, as though they were variant spellings of the same letter rather than different letters), and it did not alter the CTYPE definitions, so W and w continue to be letters (and thus match [[:alpha:]]) as they were before. But it did (accidentally, I believe) alter the character order. Before, W followed V and w followed v, so that W matched [U-X] and w matched [u-x]. The change placed both characters after thorn (þ), which means it cannot match any range expression. (Regex range expressions are limited to single-byte codepoints.)

建议使用先前的问题作为这个问题的重复部分,但是我删除了重复标记,因为该问题集中在使用[a-z]的智慧上,而不是在可能的实现错误上,并且还因为它与Perl正则表达式有关,而不是与Posix正则表达式有关.但是,答案中有很多有用的信息.

A previous question had been suggested as a duplicate of this question, but I removed the duplicate marker because that question focuses on the wisdom of using [a-z] and not on possible implementation errors, and also because is is about Perl regexes rather than Posix regexes. However, there is a lot of useful information in the answers.

这篇关于正则表达式[A-Za-z]似乎不包含字母W和w的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆