C#正则表达式性能非常慢 [英] C# Regex Performance very slow
问题描述
我是正则表达式主题的新手.我想使用以下正则表达式解析日志文件:
(?< time>(.*?))[|](?< placeholder4>(.*?))[|](?< source>(.*?))[|](?level [1-3])[|](?message>(.*?))[|] [|] [|](?< placeholder1>(.*?))[|] [|](?< placeholder2>(.*?))[|](?< placeholder3>(.*))
日志行如下所示:
<代码> 2001.07.13 09:40:20 | 1 | SomeSection | 3 | ======一些日志消息:: Type:test = sdfsdf |||.\ SomeFile.cpp || 60 | -1
带有appr的日志文件.3000行需要很长时间才能解析它.您是否有一些提示可以提高性能?谢谢...
更新:我使用正则表达式是因为我使用了不同的日志文件,这些文件的结构不同,并且使用这种方式:
string [] fileContent = File.ReadAllLines(filePath);正则表达式模式=新正则表达式(LogFormat.GetLineRegex(logFileFormat));foreach(fileContent中的var行){//分割日志行匹配match = pattern.Match(line);字符串logDate = match.Groups ["time"].Value.Trim();字符串logLevel = match.Groups ["level"].Value.Trim();//等等...}
解决方案:
谢谢你的帮助.我已经测试了以下结果:
1.)仅添加了RegexOptions.Compiled:
从00:01:10.9611143 到00:00:38.8928387
2.)二手Thomas Ayoub regex
从00:00:38.8928387到00:00:06.3839097
3.)二手WiktorStribiżewregex
从00:00:06.3839097 到00:00:03.2150095
让我将我的评论转换"为答案,因为现在我知道您可以对正则表达式的性能做些什么.
然后,使用 RegexOptions.Compiled
:
Regex模式=新Regex(LogFormat.GetLineRegex(logFileFormat),RegexOptions.Compiled);
I am very new in regex topic. I want to parse log files with following regex:
(?<time>(.*?))[|](?<placeholder4>(.*?))[|](?<source>(.*?))[|](?<level>[1-3])[|](?<message>(.*?))[|][|][|](?<placeholder1>(.*?))[|][|](?<placeholder2>(.*?))[|](?<placeholder3>(.*))
A log line looks like this:
2001.07.13 09:40:20|1|SomeSection|3|====== Some log message::Type: test=sdfsdf|||.\SomeFile.cpp||60|-1
A log file with appr. 3000 lines takes very long to parse it. Do you have some hints to speed up the performance? Thank you...
Update: I use regex because I use different log files which do not have the same structure and I use it that way:
string[] fileContent = File.ReadAllLines(filePath);
Regex pattern = new Regex(LogFormat.GetLineRegex(logFileFormat));
foreach (var line in fileContent)
{
// Split log line
Match match = pattern.Match(line);
string logDate = match.Groups["time"].Value.Trim();
string logLevel = match.Groups["level"].Value.Trim();
// And so on...
}
Solution:
Thank you for help. I've tested it with following results:
1.) Only added RegexOptions.Compiled:
From 00:01:10.9611143 to 00:00:38.8928387
2.) Used Thomas Ayoub regex
From 00:00:38.8928387 to 00:00:06.3839097
3.) Used Wiktor Stribiżew regex
From 00:00:06.3839097 to 00:00:03.2150095
Let me "convert" my comment into an answer since now I see what you can do about the regex performance.
As I have mentioned above, replace all .*?
with [^|]*
, and also all repeating [|][|][|]
with [|]{3}
(or similar, depending on the number of [|]
. Also, do not use nested capturing groups, that also influences performance!
var logFileFormat = @"(?<time>[^|]*)[|](?<placeholder4>[^|]*)[|](?<source>[^|]*)[|](?<level>[1-3])[|](?<message>[^|]*)[|]{3}(?<placeholder1>[^|]*)[|]{2}(?<placeholder2>[^|]*)[|](?<placeholder3>.*)";
Only the last .*
can remain "wildcardish" since it will grab the rest of the line.
Here is a comparison of your and my regex patterns at RegexHero.
Then, use RegexOptions.Compiled
:
Regex pattern = new Regex(LogFormat.GetLineRegex(logFileFormat), RegexOptions.Compiled);
这篇关于C#正则表达式性能非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!