是否有通过与正则表达式一个大文件解析的快捷方式? [英] Is there a fast way to parse through a large file with regex?
问题描述
问题:
非常非常大的文件,我需要通过线解析线路从每行送3的值。一切正常,但它需要很长的时间在整个文件来分析。是否有可能在几秒钟内做到这一点?典型的时间其采取为1分钟和2分钟之间。
示例文件大小为148,208KB
我使用正则表达式通过每一行解析:
下面是我的C#代码:
私有静态无效ReadTheLines(INT最大,抢答器RP,串INPUTFILE)
{
名单,LT; INT>率=新的List< INT>();
双计数器= 1;
试
{
使用(VAR SR =新的StreamReader(INPUTFILE,Encoding.UTF8,真实,1024))
{
串线;
Console.WriteLine(读......);
,而((行= sr.ReadLine())!= NULL)
{
如果(计数器< =最大值)
{
计数器++;
率= rp.GetRateLine(线);
}
,否则,如果(最大值== 0)
{
计数器++;
率= rp.GetRateLine(线);
}
}
rp.GetRate(率);
到Console.ReadLine();
}
}
赶上(例外五)
{
Console.WriteLine(文件无法读取:);
Console.WriteLine(e.Message);
}
}
下面是我的正则表达式:
公开名单< INT> GetRateLine(字符串justALine)
{
常量字符串章= @^ \d {1,}。+ \ [(。*)\s [\-] \d {1 ,}] + GET * HTTP * \d {3} [\s](\d {1,})[\s](\d {1,})$。;
匹配匹配= Regex.Match(justALine,章,
RegexOptions.IgnoreCase);
//这里我们检查匹配实例。
如果(match.Success)
{
//最后,我们得到了集团的价值和显示。
串theRate = match.Groups [3] .value的;
Ratestorage.Add(Convert.ToInt32(theRate));
}
,否则
{
Ratestorage.Add(0);
}
返回Ratestorage;
}
下面是一个例子线来分析,通常大约20万行:
10.10.10.10 - - [27 /十一月/ 2002:16:46:20 -0500]GET / Solr的/ HTTP / 1.1200 4926 789
块引用>
解决方案
- 创建多个随机存取观点持久MMF。每个视图对应一个文件
- 的特定部分的定义解析方法,如
LT参数的IEnumerable&;串>
,基本上是抽象的一组不解析行
- 创建和启动一台TPL每个与
解析1 MMF视图任务(IEnumerable的<串>)
作为任务操作
- 每个职工的任务增加了一个分析数据转化为的共享队列BlockingCollection 类型
- 的其他任务听BC( GetConsumingEnumerable())和进程已经由工作任务
不得不说这个解决方案是
.NET框架> = 4
Problem: Very very, large file I need to parse line by line to get 3 values from each line. Everything works but it takes a long time to parse through the whole file. Is it possible to do this within seconds? Typical time its taking is between 1 minute and 2 minutes.
Example file size is 148,208KB
I am using regex to parse through every line:
Here is my c# code:
private static void ReadTheLines(int max, Responder rp, string inputFile) { List<int> rate = new List<int>(); double counter = 1; try { using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 1024)) { string line; Console.WriteLine("Reading...."); while ((line = sr.ReadLine()) != null) { if (counter <= max) { counter++; rate = rp.GetRateLine(line); } else if (max == 0) { counter++; rate = rp.GetRateLine(line); } } rp.GetRate(rate); Console.ReadLine(); } } catch (Exception e) { Console.WriteLine("The file could not be read:"); Console.WriteLine(e.Message); } }
Here is my regex:
public List<int> GetRateLine(string justALine) { const string reg = @"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$"; Match match = Regex.Match(justALine, reg, RegexOptions.IgnoreCase); // Here we check the Match instance. if (match.Success) { // Finally, we get the Group value and display it. string theRate = match.Groups[3].Value; Ratestorage.Add(Convert.ToInt32(theRate)); } else { Ratestorage.Add(0); } return Ratestorage; }
Here is an example line to parse, usually around 200,000 lines:
10.10.10.10 - - [27/Nov/2002:16:46:20 -0500] "GET /solr/ HTTP/1.1" 200 4926 789
解决方案Memory Mapped Files and Task Parallel Library for help.
- Create persisted MMF with multiple random access views. Each view corresponds to a particular part of a file
- Define parsing method with parameter like
IEnumerable<string>
, basically to abstract a set of not parsed lines- Create and start one TPL task per one MMF view with
Parse(IEnumerable<string>)
as a Task action- Each of worker tasks adds a parsed data into the shared queue of BlockingCollection type
- An other Task listen to BC (GetConsumingEnumerable()) and processes all data which already parsed by worker Tasks
See Pipelines pattern on MSDN
Must say this solution is for
.NET Framework >=4
这篇关于是否有通过与正则表达式一个大文件解析的快捷方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!