没有实际解析的TryParse或任何其他具有性能优势的文本格式检查选项 [英] TryParse without Actual Parsing or any other Alternative for Checking Text Format with Performance Benefit

查看:69
本文介绍了没有实际解析的TryParse或任何其他具有性能优势的文本格式检查选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在创建自己的库,称为TextCheckerExtension,该库基本上试图在进一步处理之前检查文本格式(如下所示的简短代码段).

I currently am making my own library, called TextCheckerExtension which basically tries to check Text Format before further processing (short code snippet shown below).

现在,我知道我在做什么与ParseTryParse十分相似.此与所有Parse之间的唯一区别是,它不会生成任何已解析的对象.它只是检查字符串.

Now, I know what I am doing is quite similar to Parse or TryParse The only difference between this and all the Parse is that this does not generate any parsed object. It simply checks the string.

我的问题是:

  1. ParseTryParse均生成已解析的对象.万一我们只想检查string输入的有效性,生成解析对象的开销真的会影响方法的性能吗(这种情况下的任何示例)?也就是说,自行创建的检查方法不会生成解析的对象会更快.
  2. C#中是否有其他方法(内置)可以在不生成解析对象的情况下检查各种字符串格式的有效性?
  3. Regex可以替代吗?
  1. Both Parse and TryParse generate parsed object. Do the overhead of generating Parsed object in case we only want to check the validity of the string input really affects the performance of the methods (Any example for this case)? That is, self-created checking method without generating parsed object will be way faster.
  2. Is there any alternative way (built-in) in C# to check various string format validity without generated parsed object?
  3. Could Regex be an alternative option?

对此事的任何投入将不胜感激.

Any input for this matter will be very much appreciated.

public static bool IsPureHex(string str) {
  return IsPureHex(str, int.MaxValue); //assuming very high value!
}

public static bool IsPureHex(string str, int maxNibble) {
  if (str.Length > maxNibble) //if the length is violated, it is considered failed
    return false;
  for (int i = 0; i < Math.Min(maxNibble, str.Length); i++)
    if (!((char.IsDigit(str, i)) || ((str[i] >= 'A') && (str[i] <= 'F')) || ((str[i] >= 'a') && (str[i] <= 'f'))))
      return false;
  return true;
}

public static bool IsHex(string str) {
  if (str.Length <= 2 || (str[0] != '0') || !((str[1] == 'x') || (str[1] == 'X'))) //Check input validity
    return false;
  for (int i = 2; i < str.Length; i++)
    if (!((char.IsDigit(str, i)) || ((str[i] >= 'A') && (str[i] <= 'F')) || ((str[i] >= 'a') && (str[i] <= 'f'))))
      return false;
  return true;
}

public static bool IsFloat(string str) { //another criterion for float, giving "f" in the last part?
  int dotCounter = 0;
  for (int i = 0; i < str.Length; i++) { //Check if it is float
    if (!(char.IsDigit(str, i)) && (str[i] != '.'))
      return false;
    else if (str[i] == '.')
      ++dotCounter; //Increase the dotCounter whenever dot is found
    if (dotCounter > 1) //If there is more than one dot for whatever reason, return error
      return false;
  }
  return dotCounter == 1 && str.Length > 1;
}

public static bool IsDigitsOnly(string str) {
  foreach (char c in str)
    if (c < '0' || c > '9')
      return false;      
  return str.Length >= 1; //there must be at least one character here to continue
}

public static bool IsInt(string str) { //is not designed to handle null input or empty string
  return str[0] == '-' && str.Length > 1 ? IsDigitsOnly(str.Substring(1)) : IsDigitsOnly(str);
}

推荐答案

确实 起作用.

It does make a difference.

令我惊讶的是:出于好奇,我继续进行此项目时,发现进行了真正的解析,并只是检查字符串是否具有特定格式 时间表现上的显着差异.

To my surprise: as I continue this project out of curiosity, I found out that doing the actual parsing and simply checking if a string is of certain format does make a significant difference in time performance.

在下面的实验中,通过创建没有解析器的检查器,与使用内置的TryParse相比,我们可以获得33.77%至58.26%的时间增益.此外,我还将扩展名与Microsoft.VisualBasic.Information dll中的VB.Net IsNumeric进行了比较.

In my experiment below, by creating checker without parser, we could gain 33.77% to 58.26% time gain as compared to using built-in TryParse. In addition, I also compare my extension with VB.Net IsNumeric in Microsoft.VisualBasic.Information dll.

以下是(1)测试代码,(2)测试方案,(3)测试代码和(4)测试结果(必要时在每个部分中添加注释):

Here are the (1) tested code, (2) testing scenario, (3) testing code, and (4) testing result (notes are added in each part whenever necessary):

这是经过测试的代码,我的扩展名为Extension.Checker.Text.到目前为止,我只测试了通用integerfloat/double(带/不带点-也许更好地称为分数级数)的方案. 泛型 integer的意思是未选中最大值和最小值 range (例如,对于8位有符号整数,为-128到127).该代码仅用于确定文本是否为人类所理解的integer ,而无需查看其范围. float/double也是如此.

Here is the tested code, my extension code named Extension.Checker.Text. I only tested scenarios for generic integer and float/double (with/without dot - perhaps better termed fraction-ed number) so far. By generic integer I mean that the maximum and minimum value range (such as -128 to 127 for 8-bit signed integer) is unchecked. This code is just to determine if a text is integer as human understands it without looking at its range. That goes the same for float/double.

帖子比较,该帖子有400 +在发布答案时对其答案表示赞同,我相信可以肯定地说,一般来说,我们将首先使用int.TryParse测试文本是否为integer(尽管其范围是有限的)到-2e92e9),以获取通用的integer文本. 有些 帖子也显示出相同的趋势.我们从这些帖子中可以看到的另一种方法是通过Visual Basic IsNumeric进行检查.因此,我也在benchmarking中包含了该方法.

Compare with this post which has 400+ upvotes on its answer by the time this answer is posted, I believe it is safe to assume that generally we will use int.TryParse to test if a text is an integer or not as a first try (albeit its range is limited to -2e9 to 2e9) for generic integer text. Some other posts also show the same trend alike. Another way which we could see from those posts are to check by Visual Basic IsNumeric. Thus, I included that method for the benchmarking too.

public static bool IsFloatOrDoubleByDot(string str) { //another criterion for float, giving "f" in the last part?
        if (string.IsNullOrWhiteSpace(str))
            return false;
        int dotCounter = 0;
        for (int i = str[0] == '-' ? 1 : 0; i < str.Length; i++) { //Check if it is float
    if (!(char.IsDigit(str, i)) && (str[i] != '.'))
      return false;
    else if (str[i] == '.')
      ++dotCounter; //Increase the dotCounter whenever dot is found
    if (dotCounter > 1) //If there is more than one dot for whatever reason, return error
      return false;
  }
  return dotCounter == 0 || dotCounter == 1 && str.Length > 1;
}

public static bool IsDigitsOnly(string str) {
  foreach (char c in str)
    if (c < '0' || c > '9')
      return false;      
  return str.Length >= 1; //there must be at least one character here to continue
}

public static bool IsInt(string str) { //is not designed to handle null input or empty string
        if (string.IsNullOrWhiteSpace(str))
            return false;           
  return str[0] == '-' && str.Length > 1 ? IsDigitsOnly(str.Substring(1)) : IsDigitsOnly(str);
}





到目前为止,我已经测试了四种不同的情况:

So far, I have tested four different scenarios:

  • 整数(在int.TryParse的可分析范围内)
  • 包含dot的浮动文本(最大7位精度,在float.TryParse所能精确解析的范围内)
  • 包含dot的双精度文本(最大11位精度,在double.TryParse的准确解析范围内)
  • 整数文本读取为浮点/双精度文本(在double.TryParse的可解析范围内)
  • integer (in the parse-able range by int.TryParse)
  • float text containing dot (max of 7-digit precision, in the accurate parse-able range by float.TryParse)
  • double text containing dot (max of 11-digit precision, in the accurate parse-able range by double.TryParse)
  • integer text read as float/double text (in the parse-able range by double.TryParse)

对于每种情况,我有四种情况要测试:

And for each scenario, I have four cases to test:

  • 有效的正值文本
  • 有效的负值文本
  • 无效的正值文本
  • 无效的负值文本

对于每种情况,我通过以下方式测试了进行检查所需的时间:

And for each case I tested the time needed to do the checking by:

  • 合适的TryParse
  • 合适的Extension.Checker.Text
  • Visual Basic IsNumeric
  • 其他特定于类型的技巧,例如用于整数的string.All(char.IsDigit)
  • Suitable TryParse
  • Suitable Extension.Checker.Text
  • Visual Basic IsNumeric
  • Other type-specific tricks like string.All(char.IsDigit) for integer





为了测试上述情况,我使用以下数据:

To test the above scenarios, I use the following data:

string intpos = "1342517340";
string intneg = "-1342517340";
string intfalsepos = "134251734u";
string intfalseneg = "-134251734u";
string floatpos = "56.34251";
string floatneg = "-56.34251";
string floatfalsepos = "56.3425h";
string floatfalseneg = "-56.3425h";
string doublepos = "56.342515312";
string doubleneg = "-56.342515312";
string doublefalsepos = "56.34251531y";
string doublefalseneg = "-56.34251531y";
List<string> liststr = new List<string>() {
    intpos, intneg, intfalsepos, intfalseneg,
    floatpos, floatneg, floatfalsepos, floatfalseneg,
    doublepos, doubleneg, doublefalsepos, doublefalseneg
};
List<string> liststrcode = new List<string>() {
    "i+", "i-", "if+", "if-",
    "f+", "f-", "ff+", "ff-",
    "d+", "d-", "df+", "df-"
};
bool parsed = false; //to store checking result
int intval; //for int.TryParse result
float fval; //for float.TryParse result
double dval; //for double.TryParse result

文本代码的格式为.例子:

text code is in the format of . Examples:

  • if + =整数假阳性
  • f- =浮动负数

然后我使用以下测试循环来获取每种情况下每种方法的时间性能:

And I use the following testing loop to get the time performance of each method per case:

//time snap
for (int i = 0; i < 10000000; ++i) //for integer case
    parsed = int.TryParse(str, out intval); //built-in TryParse
//time snap
//Print the result
//time snap
for (int i = 0; i < 10000000; ++i)
    parsed = Extension.Checker.Text.IsInt(str); //extension Text checker
//time snap
//Print the result
//time snap
for (int i = 0; i < 10000000; ++i)
    parsed = Information.IsNumeric(str); //Microsoft.VisualBasic
//time snap
//Print the result
//time snap
for (int i = 0; i < 10000000; ++i)
    parsed = str[0] == '-' ? str.Substring(1).All(char.IsDigit) : str.All(char.IsDigit); //misc methods
//time snap
//Print the result
//Print the result difference

使用笔记本电脑,每种方法每个测试用例测试了多达1000万次迭代.

I tested as many as 10 million iterations per testing case per method using my laptop.

注意:请注意,我的Extension.Checker.Text的行为与内置的TryParse并不完全等效,例如检查字符串或其他格式的字符串的数值范围对于TryParse情况是可以接受的,但对于我来说不是.这是因为Extension.Checker.Text的主要目的不是必须将C#中的给定文本转换为内置TryParse的某些数据类型.这就是我的Extension.Checker.Text的重点.此处所做的比较仅是为了比较(从时间性能方面的优势)(1)

Note: it is noted that the behavior of my Extension.Checker.Text is not completely equivalent with built-in TryParse such as checking the range of the numerical value of the string or string with other formats which might be acceptable for TryParse case but not in my case. This is because the main purpose of my Extension.Checker.Text is not to necessarily convert the given text into certain data type in C# as built-in TryParse. And that is the very point of my Extension.Checker.Text. The comparisons made here is merely done to compare - in terms of time performance benefits - (1) the popular way of checking certain text format with (2) the extension method we could possibly made given that we do not need the result of the TryParse, but only if a text is of certain format or not. That goes the same for comparison with VB IsNumeric





我打印出了parse/check结果,以确保扩展名具有与内置TryParseVB.Net IsNumeric以及给定情况下的其他替代技巧相同的结果.我还会打印原始文本,以方便阅读/检查.然后,通过测试之间的时间间隔,我可以获得每个测试用例的时间性能以及时差,我也将其打印出来.但是,时间增益比较仅使用TryParse完成.这是完整的结果.

I printed out the parse/check result to ensure that my extension has the same result as the built-in TryParse, VB.Net IsNumeric, and other alternative tricks for the given cases. I also print the original text for easy reading/checking. Then, by the time snap in between the testing, I could get the time performance as well as time difference for each testing case, which I also printed out. The time gain comparison however, is only done with the TryParse. Here is the complete result.

[2016-01-05 06:04:25.466 UTC] Integer:
[2016-01-05 06:04:26.999 UTC] TryParse i+:  1531 ms Result: True    Text: 1342517340
[2016-01-05 06:04:27.639 UTC] Extension i+:     639 ms  Result: True    Text: 1342517340
[2016-01-05 06:04:30.345 UTC] VB.IsNumeric i+:  2705 ms Result: True    Text: 1342517340
[2016-01-05 06:04:31.468 UTC] All is digit i+:  1124 ms Result: True    Text: 1342517340
[2016-01-05 06:04:31.469 UTC] Gain on TryParse i+:  892 ms  Percent: -58.26%
[2016-01-05 06:04:31.469 UTC] 
[2016-01-05 06:04:32.996 UTC] TryParse i-:  1527 ms Result: True    Text: -1342517340
[2016-01-05 06:04:33.846 UTC] Extension i-:     849 ms  Result: True    Text: -1342517340
[2016-01-05 06:04:36.413 UTC] VB.IsNumeric i-:  2566 ms Result: True    Text: -1342517340
[2016-01-05 06:04:37.693 UTC] All is digit i-:  1280 ms Result: True    Text: -1342517340
[2016-01-05 06:04:37.694 UTC] Gain on TryParse i-:  678 ms  Percent: -44.40%
[2016-01-05 06:04:37.694 UTC] 
[2016-01-05 06:04:39.058 UTC] TryParse if+:     1364 ms Result: False   Text: 134251734u
[2016-01-05 06:04:39.845 UTC] Extension if+:    786 ms  Result: False   Text: 134251734u
[2016-01-05 06:04:42.436 UTC] VB.IsNumeric if+:     2590 ms Result: False   Text: 134251734u
[2016-01-05 06:04:43.540 UTC] All is digit if+:     1103 ms Result: False   Text: 134251734u
[2016-01-05 06:04:43.540 UTC] Gain on TryParse if+:     578 ms  Percent: -42.38%
[2016-01-05 06:04:43.540 UTC] 
[2016-01-05 06:04:44.937 UTC] TryParse if-:     1397 ms Result: False   Text: -134251734u
[2016-01-05 06:04:45.745 UTC] Extension if-:    807 ms  Result: False   Text: -134251734u
[2016-01-05 06:04:48.275 UTC] VB.IsNumeric if-:     2530 ms Result: False   Text: -134251734u
[2016-01-05 06:04:49.541 UTC] All is digit if-:     1267 ms Result: False   Text: -134251734u
[2016-01-05 06:04:49.542 UTC] Gain on TryParse if-:     590 ms  Percent: -42.23%
[2016-01-05 06:04:49.542 UTC] 
[2016-01-05 06:04:49.542 UTC] Float by Dot:
[2016-01-05 06:04:51.136 UTC] TryParse f+:  1594 ms Result: True    Text: 56.34251
[2016-01-05 06:04:51.967 UTC] Extension f+:     830 ms  Result: True    Text: 56.34251
[2016-01-05 06:04:54.328 UTC] VB.IsNumeric f+:  2360 ms Result: True    Text: 56.34251
[2016-01-05 06:04:54.329 UTC] Time Gain f+:     764 ms  Percent: -47.93%
[2016-01-05 06:04:54.329 UTC] 
[2016-01-05 06:04:55.962 UTC] TryParse f-:  1634 ms Result: True    Text: -56.34251
[2016-01-05 06:04:56.790 UTC] Extension f-:     827 ms  Result: True    Text: -56.34251
[2016-01-05 06:04:59.102 UTC] VB.IsNumeric f-:  2313 ms Result: True    Text: -56.34251
[2016-01-05 06:04:59.103 UTC] Time Gain f-:     807 ms  Percent: -49.39%
[2016-01-05 06:04:59.103 UTC] 
[2016-01-05 06:05:00.623 UTC] TryParse ff+:     1519 ms Result: False   Text: 56.3425h
[2016-01-05 06:05:01.429 UTC] Extension ff+:    802 ms  Result: False   Text: 56.3425h
[2016-01-05 06:05:03.730 UTC] VB.IsNumeric ff+:     2301 ms Result: False   Text: 56.3425h
[2016-01-05 06:05:03.730 UTC] Time Gain ff+:    717 ms  Percent: -47.20%
[2016-01-05 06:05:03.731 UTC] 
[2016-01-05 06:05:05.312 UTC] TryParse ff-:     1581 ms Result: False   Text: -56.3425h
[2016-01-05 06:05:06.147 UTC] Extension ff-:    835 ms  Result: False   Text: -56.3425h
[2016-01-05 06:05:08.485 UTC] VB.IsNumeric ff-:     2337 ms Result: False   Text: -56.3425h
[2016-01-05 06:05:08.486 UTC] Time Gain ff-:    746 ms  Percent: -47.19%
[2016-01-05 06:05:08.486 UTC] 
[2016-01-05 06:05:08.487 UTC] Double by Dot:
[2016-01-05 06:05:10.341 UTC] TryParse d+:  1854 ms Result: True    Text: 56.342515312
[2016-01-05 06:05:11.492 UTC] Extension d+:     1151 ms Result: True    Text: 56.342515312
[2016-01-05 06:05:14.035 UTC] VB.IsNumeric d+:  2541 ms Result: True    Text: 56.342515312
[2016-01-05 06:05:14.035 UTC] Time Gain d+:     703 ms  Percent: -37.92%
[2016-01-05 06:05:14.036 UTC] 
[2016-01-05 06:05:15.916 UTC] TryParse d-:  1879 ms Result: True    Text: -56.342515312
[2016-01-05 06:05:17.051 UTC] Extension d-:     1133 ms Result: True    Text: -56.342515312
[2016-01-05 06:05:19.542 UTC] VB.IsNumeric d-:  2492 ms Result: True    Text: -56.342515312
[2016-01-05 06:05:19.543 UTC] Time Gain d-:     746 ms  Percent: -39.70%
[2016-01-05 06:05:19.543 UTC] 
[2016-01-05 06:05:21.210 UTC] TryParse df+:     1667 ms Result: False   Text: 56.34251531y
[2016-01-05 06:05:22.315 UTC] Extension df+:    1104 ms Result: False   Text: 56.34251531y
[2016-01-05 06:05:24.797 UTC] VB.IsNumeric df+:     2481 ms Result: False   Text: 56.34251531y
[2016-01-05 06:05:24.798 UTC] Time Gain df+:    563 ms  Percent: -33.77%
[2016-01-05 06:05:24.798 UTC] 
[2016-01-05 06:05:26.509 UTC] TryParse df-:     1711 ms Result: False   Text: -56.34251531y
[2016-01-05 06:05:27.596 UTC] Extension df-:    1086 ms Result: False   Text: -56.34251531y
[2016-01-05 06:05:30.039 UTC] VB.IsNumeric df-:     2442 ms Result: False   Text: -56.34251531y
[2016-01-05 06:05:30.040 UTC] Time Gain df-:    625 ms  Percent: -36.53%
[2016-01-05 06:05:30.041 UTC] 
[2016-01-05 06:05:30.041 UTC] Integer as Double by Dot:
[2016-01-05 06:05:31.794 UTC] TryParse (doubled) i+:    1752 ms Result: True    Text: 1342517340
[2016-01-05 06:05:32.904 UTC] Extension (doubled) i+:   1109 ms Result: True    Text: 1342517340
[2016-01-05 06:05:35.590 UTC] VB.IsNumeric (doubled) d+:    2684 ms Result: True    Text: 1342517340
[2016-01-05 06:05:35.590 UTC] Time Gain d+:     643 ms  Percent: -36.70%
[2016-01-05 06:05:35.591 UTC] 
[2016-01-05 06:05:37.390 UTC] TryParse (doubled) i-:    1799 ms Result: True    Text: -1342517340
[2016-01-05 06:05:38.515 UTC] Extension (doubled) i-:   1125 ms Result: True    Text: -1342517340
[2016-01-05 06:05:41.139 UTC] VB.IsNumeric (doubled) d-:    2623 ms Result: True    Text: -1342517340
[2016-01-05 06:05:41.139 UTC] Time Gain d-:     674 ms  Percent: -37.47%
[2016-01-05 06:05:41.140 UTC] 
[2016-01-05 06:05:42.840 UTC] TryParse (doubled) if+:   1700 ms Result: False   Text: 134251734u
[2016-01-05 06:05:43.933 UTC] Extension (doubled) if+:  1092 ms Result: False   Text: 134251734u
[2016-01-05 06:05:46.575 UTC] VB.IsNumeric (doubled) df+:   2642 ms Result: False   Text: 134251734u
[2016-01-05 06:05:46.576 UTC] Time Gain df+:    608 ms  Percent: -35.76%
[2016-01-05 06:05:46.577 UTC] 
[2016-01-05 06:05:48.328 UTC] TryParse (doubled) if-:   1750 ms Result: False   Text: -134251734u
[2016-01-05 06:05:49.434 UTC] Extension (doubled) if-:  1106 ms Result: False   Text: -134251734u
[2016-01-05 06:05:52.042 UTC] VB.IsNumeric (doubled) df-:   2607 ms Result: False   Text: -134251734u
[2016-01-05 06:05:52.042 UTC] Time Gain df-:    644 ms  Percent: -36.80%
[2016-01-05 06:05:52.043 UTC] 

到目前为止,我从结果中得出的结论是:

The conclusions I got from the results so far:

  • Best performance gain we can obtain using an extension method such as above is when the text type is valid positive integer. The time performance gain we could get is as much as 58.26% for the given case. Perhaps this owes to the simplicity of the valid positive integer text.
  • Worst performance gain we can obtain using an extension method such as above is when the text type is invalid positive double. The time performance gain we could get is only as much as 33.77% for the given case.
  • For the integer and float/double (with/without dot) text format, to check if a text is of those formats without the need to actually parse it yet, it is possible to speed up the checking process by building our own text extension checker as compared to using built-in TryParse. VB IsNumeric is rather slower than the rests for all cases (this is also to my surprise, because according to the benchmarking in this post, VB seems to be pretty fast - though not the best).

此扩展名检查的一种可能用法是,当您收到某个字符串并且您知道它可以具有多种格式类型(例如,整数或双精度)时,但是您想检查实际的文本类型首先在检查时不进行实际解析.对于这种情况,扩展方法可以加快处理过程.

One possible use of this extension checking is in the case where you receive a certain string and you know that it can be of more than one format types (say, integer or double), but you want to check the actual text type first without an actual parsing at the time of checking. For such given case, an extension method may speed up the process.

另一种用途是在计算语言学区域,在该区域中,您经常想知道文本的类型,而无需实际解析该文本以进行计算.

Another use is in the computational linguistic area, where often you want to know the type a text without actually parsing it to be used computationally.

这篇关于没有实际解析的TryParse或任何其他具有性能优势的文本格式检查选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆