是否有通过与正则表达式一个大文件解析的快捷方式? [英] Is there a fast way to parse through a large file with regex?

查看:168
本文介绍了是否有通过与正则表达式一个大文件解析的快捷方式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:
非常非常大的文件,我需要通过线解析线路从每行送3的值。一切正常,但它需要很长的时间在整个文件来分析。是否有可能在几秒钟内做到这一点?典型的时间其采取为1分钟和2分钟之间。



示例文件大小为148,208KB



我使用正则表达式通过每一行解析:



下面是我的C#代码:

 私有静态无效ReadTheLines(INT最大,抢答器RP,串INPUTFILE)
{
名单,LT; INT>率=新的List< INT>();
双计数器= 1;

{
使用(VAR SR =新的StreamReader(INPUTFILE,Encoding.UTF8,真实,1024))
{
串线;
Console.WriteLine(读......);
,而((行= sr.ReadLine())!= NULL)
{
如果(计数器< =最大值)
{
计数器++;
率= rp.GetRateLine(线);
}
,否则,如果(最大值== 0)
{
计数器++;
率= rp.GetRateLine(线);
}
}
rp.GetRate(率);
到Console.ReadLine();
}
}
赶上(例外五)
{
Console.WriteLine(文件无法读取:);
Console.WriteLine(e.Message);
}
}

下面是我的正则表达式:

 公开名单< INT> GetRateLine(字符串justALine)
{
常量字符串章= @^ \d {1,}。+ \ [(。*)\s [\-] \d {1 ,}] + GET * HTTP * \d {3} [\s](\d {1,})[\s](\d {1,})$。;
匹配匹配= Regex.Match(justALine,章,
RegexOptions.IgnoreCase);

//这里我们检查匹配实例。
如果(match.Success)
{
//最后,我们得到了集团的价值和显示。

串theRate = match.Groups [3] .value的;
Ratestorage.Add(Convert.ToInt32(theRate));
}
,否则
{
Ratestorage.Add(0);
}
返回Ratestorage;
}

下面是一个例子线来分析,通常大约20万行:




10.10.10.10 - - [27 /十一月/ 2002:16:46:20 -0500]GET / Solr的/ HTTP / 1.1200 4926 789



解决方案

内存映射文件任务并行库寻求帮助。




  1. 创建多个随机存取观点持久MMF。每个视图对应一个文件

  2. 的特定部分的定义解析方法,如 LT参数的IEnumerable&;串> ,基本上是抽象的一组不解析行

  3. 创建和启动一台TPL每个与解析1 MMF视图任务(IEnumerable的<串>)作为任务操作

  4. 每个职工的任务增加了一个分析数据转化为的共享队列BlockingCollection 类型

  5. 的其他任务听BC( GetConsumingEnumerable())和进程已经由工作任务



请参阅Pipelines图案MSDN



不得不说这个解决方案是 .NET框架> = 4


Problem: Very very, large file I need to parse line by line to get 3 values from each line. Everything works but it takes a long time to parse through the whole file. Is it possible to do this within seconds? Typical time its taking is between 1 minute and 2 minutes.

Example file size is 148,208KB

I am using regex to parse through every line:

Here is my c# code:

private static void ReadTheLines(int max, Responder rp, string inputFile)
{
    List<int> rate = new List<int>();
    double counter = 1;
    try
    {
        using (var sr = new StreamReader(inputFile, Encoding.UTF8, true, 1024))
        {
            string line;
            Console.WriteLine("Reading....");
            while ((line = sr.ReadLine()) != null)
            {
                if (counter <= max)
                {
                    counter++;
                    rate = rp.GetRateLine(line);
                }
                else if (max == 0)
                {
                    counter++;
                    rate = rp.GetRateLine(line);
                }
            }
            rp.GetRate(rate);
            Console.ReadLine();
        }
    }
    catch (Exception e)
    {
        Console.WriteLine("The file could not be read:");
        Console.WriteLine(e.Message);
    }
}

Here is my regex:

public List<int> GetRateLine(string justALine)
{
    const string reg = @"^\d{1,}.+\[(.*)\s[\-]\d{1,}].+GET.*HTTP.*\d{3}[\s](\d{1,})[\s](\d{1,})$";
    Match match = Regex.Match(justALine, reg,
                                RegexOptions.IgnoreCase);

    // Here we check the Match instance.
    if (match.Success)
    {
        // Finally, we get the Group value and display it.

        string theRate = match.Groups[3].Value;
        Ratestorage.Add(Convert.ToInt32(theRate));
    }
    else
    {
        Ratestorage.Add(0);
    }
    return Ratestorage;
}

Here is an example line to parse, usually around 200,000 lines:

10.10.10.10 - - [27/Nov/2002:16:46:20 -0500] "GET /solr/ HTTP/1.1" 200 4926 789

解决方案

Memory Mapped Files and Task Parallel Library for help.

  1. Create persisted MMF with multiple random access views. Each view corresponds to a particular part of a file
  2. Define parsing method with parameter like IEnumerable<string>, basically to abstract a set of not parsed lines
  3. Create and start one TPL task per one MMF view with Parse(IEnumerable<string>) as a Task action
  4. Each of worker tasks adds a parsed data into the shared queue of BlockingCollection type
  5. An other Task listen to BC (GetConsumingEnumerable()) and processes all data which already parsed by worker Tasks

See Pipelines pattern on MSDN

Must say this solution is for .NET Framework >=4

这篇关于是否有通过与正则表达式一个大文件解析的快捷方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆