正则表达式不区分大小写的速度较慢吗? [英] Is regex case insensitivity slower?

查看:102
本文介绍了正则表达式不区分大小写的速度较慢吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

来源

RegexOptions.IgnoreCase比我想象的要昂贵(例如,应该几乎无法测量)

假设这适用于PHP,Python,Perl,Ruby等以及C#(我假设Jeff使用的是C#),它会造成多大的影响,并且使用/[a-zA-z]/我会受到类似的惩罚吗?我会和/[a-z]/i吗?

解决方案

是的,[A-Za-z]将比设置RegexOptions.IgnoreCase快得多,这在很大程度上是因为Unicode字符串.但这还更具局限性-[A-Za-z] 匹配重音国际字符,实际上是A-Za-z ASCII集,仅此而已.

我不知道您是否看到蒂姆·布雷(Tim Bray)对我的消息的回答,但这是一个很好的消息:

大写和小写是国际化搜索中最棘手的问题之一.这种区分大小写的概念仅限于使用拉丁文,希腊文和西里尔文字符集编写的语言.讲英语的人自然希望搜索是不区分大小写的,只是因为他们很懒:如果Nadia Jones想在Google上查找自己,她可能只会输入nadia jones并希望系统能够处理.

因此,对于搜索系统来说,通过将所有单词全部都转换为小写或大写(用于索引和查询)来标准化"单词是很普遍的.

麻烦的是,案例之间的映射并不总是像英语那样简单.例如,德语的小写字母ß"在大写时变为"SS",而良好的旧大写字母"I"在土耳其语中变为小圆点ı"(是的,它们的字母为"i",大写版本为İ").我已经读过(但未经第一手验证),在法国和魁北克,加重诸如é"这样的重音符号的规则是不同的.所有这一切的结果之一是,诸如java.String.toLowerCase()之类的软件在尝试处理所有这些极端情况时,往往运行速度惊人得惊人.

http://www.tbray.org/ongoing/时间/200x/2003/10/11/SearchI18n

Source

RegexOptions.IgnoreCase is more expensive than I would have thought (eg, should be barely measurable)

Assuming that this applies to PHP, Python, Perl, Ruby etc as well as C# (which is what I assume Jeff was using), how much of a slowdown is it and will I incur a similar penalty with /[a-zA-z]/ as I will with /[a-z]/i ?

解决方案

Yes, [A-Za-z] will be much faster than setting the RegexOptions.IgnoreCase, largely because of Unicode strings. But it's also much more limiting -- [A-Za-z] does not match accented international characters, it's literally the A-Za-z ASCII set and nothing more.

I don't know if you saw Tim Bray's answer to my message, but it's a good one:

One of the trickiest issues in internationalized search is upper and lower case. This notion of case is limited to languages written in the Latin, Greek, and Cyrillic character sets. English-speakers naturally expect search to be case-insensitive if only because they’re lazy: if Nadia Jones wants to look herself up on Google she’ll probably just type in nadia jones and expect the system to take care of it.

So it’s fairly common for search systems to "normalize" words by converting them all to lower- or upper-case, both for indexing and queries.

The trouble is that the mapping between cases is not always as straightforward as it is in English. For example, the German lower-case character "ß" becomes "SS" when upper-cased, and good old capital "I" when down-cased in Turkish becomes the dotless "ı" (yes, they have "i", its upper-case version is "İ"). I have read (but not verified first-hand) that the rules for upcasing accented characters such "é" are different in France and Québec. One of the results of all this is that software such as java.String.toLowerCase() tends to run astonishingly slow as it tries to work around all these corner-cases.

http://www.tbray.org/ongoing/When/200x/2003/10/11/SearchI18n

这篇关于正则表达式不区分大小写的速度较慢吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆