我们应该考虑将范围[a-z]用作错误吗? [英] Should we consider using range [a-z] as a bug?

查看:88
本文介绍了我们应该考虑将范围[a-z]用作错误吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的语言环境(et_EE)中,[a-z]表示:

In my locale (et_EE) [a-z] means:

abcdefghijklmnopqrsšz

因此,不包括6个ASCII字符(tuvwxy)和一个爱沙尼亚字母(ž).我看到很多模块仍在使用

So, 6 ASCII chars (tuvwxy) and one from Estonian alphabet (ž) are not included. I see a lot modules which are still using regexes like

/\A[0-9A-Z_a-z]+\z/

对我来说,定义ASCII字母数字字符范围的方法似乎是错误的,我认为应将其替换为:

For me it seems wrong way to define range of ASCII alphanumeric chars and i think it should be replaced with:

/\A\p{PosixAlnum}+\z/

第一个仍然被认为是惯用的方式吗?还是公认的解决方案?还是错误?

Is the first one still considered idiomatic way? Or accepted solution? Or a bug?

还是最后一个警告?

推荐答案

在旧的Perl 3.0天内,所有内容都是ASCII,而Perl反映了这一点. \w的含义与[0-9A-Z_a-z]相同.而且,我们喜欢它!

Back in the old Perl 3.0 days, everything was ASCII, and Perl reflected that. \w meant the same thing as [0-9A-Z_a-z]. And, we liked it!

但是,Perl不再绑定到ASCII.我已经停止使用[a-z]了,因为当我编写的程序不适用于非英语的语言时,我大喊大叫.您一定以为我作为美国人感到惊讶,发现这个世界上至少有成千上万的人不会说英语.

However, Perl is no longer bound to ASCII. I've stopped using [a-z] a while ago because I got yelled at when programs I wrote didn't work with languages that weren't English. You must have imagined my surprise as an American to discover that there are at least several thousand people in this world who don't speak English.

无论如何,Perl都有更好的方式处理[0-9A-Z_a-z].您可以使用[[:alnum:]]设置,也可以简单地使用\w来完成正确的操作.如果只能使用小写字符,则可以使用[[:lower:]]代替[a-z](假定使用英语类型的语言). (即使在EBCDIC平台上,Perl花费了一定的时间才能使[a-z]表示仅26个字符a,b,c,... z.)

Perl has better ways of handling [0-9A-Z_a-z] anyway. You can use the [[:alnum:]] set or simply use \w which should do the right thing. If you must only have lowercase characters, you can use [[:lower:]] instead of [a-z] (Which assumes an English type of language). (Perl goes to some lengths to get [a-z] mean just the 26 characters a, b, c, ... z even on EBCDIC platforms.)

如果仅需要指定ASCII,则可以添加/a限定符.如果您是特定于语言环境的,则应在使用语言环境"的词汇范围内编译正则表达式. (避免使用/l修饰符,因为该修饰符仅适用于正则表达式模式,而别无其他.例如,在's/[[:lower:]]/\ U $&/lg'中,该模式使用语言环境进行编译,但\ U不是.这可能应该被认为是Perl中的错误,但这是当前的工作方式./l修饰符实际上仅用于内部簿记,不应直接键入.) ,最好在内部使用Unicode时将输入时的语言环境数据转换为程序,然后将其转换回输出.如果您的语言环境是新型的UTF-8语言环境,则5.16使用语言环境:not_characters"中的一项新功能可用于允许您的语言环境的其他部分在Perl中无缝工作.

If you need to specify ASCII only, you can add the /a qualifier. If you mean locale specific, you should compile the regular expression within the lexical scope of a 'use locale'. (Avoid the /l modifier, as that applies only to the regular expression pattern, and nothing else. For example in 's/[[:lower:]]/\U$&/lg', the pattern is compiled using locale, but the \U is not. This probably should be considered a bug in Perl, but it is the way things currently work. The /l modifier is really only intended for internal bookkeeping, and should not be typed-in directly.) Actually, it is better to translate your locale data upon input into the program, and translate it back on output, while using Unicode internally. If your locale is one of the new-fashioned UTF-8 ones, a new feature in 5.16 'use locale ":not_characters"' is available to allow the other portions of your locale work seamlessly in Perl.

$word =~ /^[[:alnum:]]+$/   # $word contains only Posix alphanumeric characters.
$word =~ /^[[:alnum:]]+$/a  # $word contains only ASCII alphanumeric characters.
{ use locale;
  $word =~ /^[[:alnum:]]+$/;# $word contains only alphanum characters for your locale
}

现在,这是一个错误吗?如果程序无法按预期运行,则它是一个简单明了的错误.如果您确实想要ASCII序列[a-z],那么程序员应该将[[:lower:]]/a限定符一起使用.如果希望所有可能的小写字符(包括其他语言的小写字符),只需使用[[:lower:]].

Now, is this a bug? If the program doesn't work as intended, it is a bug plain and simple. If you really want the ASCII sequence, [a-z], then the programmer should have used [[:lower:]] with the /a qualifier. If you want all possible lowercase characters including those in other languages, you should simply use [[:lower:]].

这篇关于我们应该考虑将范围[a-z]用作错误吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆