\\ d大于效率较低[0-9] [英] \d is less efficient than [0-9]

查看:116
本文介绍了\\ d大于效率较低[0-9]的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

昨天我在哪里有人用过一个答案做出评论 [0123456789] 中的常规EX pression 而非 [0-9] \\ D 。我说,这可能是更有效的使用范围或数字说明不是字符集。

我决定今天测试并发现了令我惊讶的是(在C#正则表达式引擎至少) \\ D 似乎比任何的效率较低另外两个这似乎并没有太大的差别。这里是我的测试输出1000个随机字符超过10000的随机字符串与5077实际包含一个数字:

 正前pression \\ D拿了00:00:00.2141226结果:10000分之5077
普通的前pression [0-9]拿了00:00:00.1357972结果:首先10000分之507763.42%
普通的前pression [0123456789]拿了00:00:00.1388997结果:首先10000分之507764.87%

这是一个让我吃惊的原因有两个:


  1. 我本来以为会范围内更有效地比设定来实现。

  2. 我不明白为什么 \\ D 恶化[0-9] 。是否有更多的 \\ D 不是简单地简写 [0-9]

下面是测试code:

 使用系统;
使用System.Collections.Generic;
使用System.Linq的;
使用System.Text;
使用System.Diagnostics程序;
使用System.Text.RegularEx pressions;命名空间SO_RegexPerformance
{
    类节目
    {
        静态无效的主要(字串[] args)
        {
            VAR兰特=新的随机(1234);
            VAR串=新的List<串GT;();
            // 10K随机字符串
            对于(VAR I = 0; I< 10000;我++)
            {
                //生成随机字符串
                VAR SB =新的StringBuilder();
                对于(VAR C = 0; C< 1000; C ++)
                {
                    //添加A-Z随机
                    sb.Append((char)的('A'+ rand.Next(26)));
                }
                //在其中的50%左右,把数字
                如果(rand.Next(2)== 0)
                {
                    //用数字替换一个字符,0-9
                    SB [rand.Next(sb.Length)] =(char)的('0'+ rand.Next(10));
                }
                strings.Add(sb.ToString());
            }            VAR baseTime = testPerfomance(字符串,@\\ D);
            Console.WriteLine();
            变种,原料与材料= testPerfomance(字符串,[0-9]);
            Console.WriteLine({0:P2}第一,testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
            原料与材料= testPerfomance(字符串[0123456789]);
            Console.WriteLine({0:P2}第一,testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
        }        私有静态时间跨度testPerfomance(列表<串GT;字符串,字符串的正则表达式)
        {
            变种SW =新的秒表();            诠释成功= 0;            VAR雷克斯=新的正则表达式(正则表达式);            sw.Start();
            的foreach(字符串中的VAR STR)
            {
                如果(rex.Match(STR).Success)
                {
                    成功++;
                }
            }
            sw.Stop();            Console.Write(正则表达式{0,} -12拿了{1}的结果:{2} / {3},正则表达式,sw.Elapsed,成功,strings.Count);            返回sw.Elapsed;
        }
    }
}


解决方案

\\ D 检查所有的统一code 的数字,而 [0-9] 仅限于这些10个字符。例如,波斯数字, 123456789 ,是单向的例子这是与 \\ D匹配code数字,而不是 [0-9]

您可以生成使用以下code所有这些字符的列表:

  VAR SB =新的StringBuilder();
对于(UINT16我= 0; I< UInt16.MaxValue;我++)
{
    字符串str = Convert.ToChar(I)的ToString();
    如果(Regex.IsMatch(STR,@\\ D))
        sb.Append(STR);
}
Console.WriteLine(sb.ToString());

生成:


  

0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789


I made a comment yesterday on an answer where someone had used [0123456789] in a regular expression rather than [0-9] or \d. I said it was probably more efficient to use a range or digit specifier than a character set.

I decided to test that out today and found out to my surprise that (in the C# regex engine at least) \d appears to be less efficient than either of the other two which don't seem to differ much. Here is my test output over 10000 random strings of 1000 random characters with 5077 actually containing a digit:

Regular expression \d           took 00:00:00.2141226 result: 5077/10000
Regular expression [0-9]        took 00:00:00.1357972 result: 5077/10000  63.42 % of first
Regular expression [0123456789] took 00:00:00.1388997 result: 5077/10000  64.87 % of first

It's a surprise to me for two reasons:

  1. I would have thought the range would be implemented much more efficiently than the set.
  2. I can't understand why \d is worse than [0-9]. Is there more to \d than simply shorthand for [0-9]?

Here is the test code:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Text.RegularExpressions;

namespace SO_RegexPerformance
{
    class Program
    {
        static void Main(string[] args)
        {
            var rand = new Random(1234);
            var strings = new List<string>();
            //10K random strings
            for (var i = 0; i < 10000; i++)
            {
                //Generate random string
                var sb = new StringBuilder();
                for (var c = 0; c < 1000; c++)
                {
                    //Add a-z randomly
                    sb.Append((char)('a' + rand.Next(26)));
                }
                //In roughly 50% of them, put a digit
                if (rand.Next(2) == 0)
                {
                    //Replace one character with a digit, 0-9
                    sb[rand.Next(sb.Length)] = (char)('0' + rand.Next(10));
                }
                strings.Add(sb.ToString());
            }

            var baseTime = testPerfomance(strings, @"\d");
            Console.WriteLine();
            var testTime = testPerfomance(strings, "[0-9]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
            testTime = testPerfomance(strings, "[0123456789]");
            Console.WriteLine("  {0:P2} of first", testTime.TotalMilliseconds / baseTime.TotalMilliseconds);
        }

        private static TimeSpan testPerfomance(List<string> strings, string regex)
        {
            var sw = new Stopwatch();

            int successes = 0;

            var rex = new Regex(regex);

            sw.Start();
            foreach (var str in strings)
            {
                if (rex.Match(str).Success)
                {
                    successes++;
                }
            }
            sw.Stop();

            Console.Write("Regex {0,-12} took {1} result: {2}/{3}", regex, sw.Elapsed, successes, strings.Count);

            return sw.Elapsed;
        }
    }
}

解决方案

\d checks all Unicode digits, while [0-9] is limited to these 10 characters. For example, Persian digits, ۱۲۳۴۵۶۷۸۹, are an example of Unicode digits which are matched with \d, but not [0-9].

You can generate a list of all such characters using the following code:

var sb = new StringBuilder();
for(UInt16 i = 0; i < UInt16.MaxValue; i++)
{
    string str = Convert.ToChar(i).ToString();
    if (Regex.IsMatch(str, @"\d"))
        sb.Append(str);
}
Console.WriteLine(sb.ToString());

Which generates:

0123456789٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789

这篇关于\\ d大于效率较低[0-9]的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆