阅读和并行处理C#文件 [英] Read and process files in parallel C#

查看:133
本文介绍了阅读和并行处理C#文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有,我有读取和处理非常大的文件。这可以并行使用线程来完成?

下面是一个有点code的,我已经做到了。但它doesen't似乎得到更短的执行时间的读出和处理的文件一个接一个

 的String []文件= openFileDialog1.FileNames;Parallel.ForEach(文件,F =>
{
    readTraceFile(F);
});私人无效readTraceFile(字符串文件)
{
    StreamReader的读者=新的StreamReader(文件);
    串线;    而((行= reader.ReadLine())!= NULL)
    {
        字符串模式=\\\\Š{4};        的foreach(字符串一丝Regex.Split(行模式))
        {
            如果(跟踪!=的String.Empty)
            {
                的String [] =细节Regex.Split(痕迹,\\\\ S +);                指令指令=新指令(详见[0]
                    int.Parse(详细资料[1]),
                    int.Parse(细节[2]));
                Console.WriteLine(计算...);
                instructions.Add(指令);
            }
        }
    }
}


解决方案

它看起来像你的应用程序的性能主要是由IO限制。不过,你仍然有一点CPU密集型的工作,在你的code。这两个工作位是相互依存的:你的CPU绑定的工作无法启动直到IO已完成其工作,直到你的CPU已与previous一个成品的IO不会移动到下一个工作项目。他们都抱在一起了。因此,它是的可能的(在最底层解释),你将看到一个吞吐量提高,如果你在并行执行的IO-和CPU密集型的工作,像这样:

 无效ReadAndProcessFiles(字符串[]文件路径)
{
    //用于移交我们的线程安全的集合。
    VAR线=新BlockingCollection<串GT;();    //管道修建。
    VAR阶段1 = Task.Run(()=>
    {
        尝试
        {
            的foreach(在文件路径文件路径VAR)
            {
                使用(VAR读者=新的StreamReader(文件路径))
                {
                    串线;                    而((行= reader.ReadLine())!= NULL)
                    {
                        //移交给第2阶段和继续读书。
                        lines.Add(线);
                    }
                }
            }
        }
        最后
        {
            lines.CompleteAdding();
        }
    });    阶段2 VAR = Task.Run(()=>
    {
        在一个线程池线程//处理线
        //当他们变得可用。
        的foreach(在lines.GetConsumingEnumerable()VAR线)
        {
            字符串模式=\\\\Š{4};            的foreach(字符串一丝Regex.Split(行模式))
            {
                如果(跟踪!=的String.Empty)
                {
                    的String [] =细节Regex.Split(痕迹,\\\\ S +);                    指令指令=新指令(详见[0]
                        int.Parse(详细资料[1]),
                        int.Parse(细节[2]));
                    Console.WriteLine(计算...);
                    instructions.Add(指令);
                }
            }
        }
    });    //阻止,直到两个任务已经完成。
    //这使得这种方法容易出现死锁。
    //使用考虑等待Task.WhenAll'代替。
    Task.WaitAll(阶段1,阶段2);
}

我高度怀疑,这是你的CPU的工作耽误的,但如果它正好是的话,你也可以parallelise第2阶段,像这样:

  VAR阶段2 = Task.Run(()=>
    {
        VAR parallelOptions =新ParallelOptions {MaxDegreeOfParallelism = Environment.ProcessorCount};        Parallel.ForEach(lines.GetConsumingEnumerable(),parallelOptions,行=>
        {
            字符串模式=\\\\Š{4};            的foreach(字符串一丝Regex.Split(行模式))
            {
                如果(跟踪!=的String.Empty)
                {
                    的String [] =细节Regex.Split(痕迹,\\\\ S +);                    指令指令=新指令(详见[0]
                        int.Parse(详细资料[1]),
                        int.Parse(细节[2]));
                    Console.WriteLine(计算...);
                    instructions.Add(指令);
                }
            }
        });
    });

你要知道,如果你的CPU的工作部分是相比于IO组件可以忽略不计,你不会看到太多的加速。更甚至工作量,更好的管道会与顺序处理的比较来执行。

由于我们正在谈论的性能。请注意,我不是特别激动不已阻断上述$​​ C $ C调用的次数。如果我在我自己的项目这样做,我会去异步/的await路线。我没有选择在这种情况下这样做,因为我希望让事情变得容易理解和易于集成。

I have very big files that I have to read and process. Can this be done in parallel using Threading?

Here is a bit of code that I've done. But it doesen't seem to get a shorter execution time the reading and processing the files one after the other.

String[] files = openFileDialog1.FileNames;

Parallel.ForEach(files, f =>
{
    readTraceFile(f);
});        

private void readTraceFile(String file)
{
    StreamReader reader = new StreamReader(file);
    String line;

    while ((line = reader.ReadLine()) != null)
    {
        String pattern = "\\s{4,}";

        foreach (String trace in Regex.Split(line, pattern))
        {
            if (trace != String.Empty)
            {
                String[] details = Regex.Split(trace, "\\s+");

                Instruction instruction = new Instruction(details[0],
                    int.Parse(details[1]),
                    int.Parse(details[2]));
                Console.WriteLine("computing...");
                instructions.Add(instruction);
            }
        }
    }
}

解决方案

It looks like your application's performance is mostly limited by IO. However, you still have a bit of CPU-bound work in your code. These two bits of work are interdependent: your CPU-bound work cannot start until the IO has done its job, and the IO does not move on to the next work item until your CPU has finished with the previous one. They're both holding each other up. Therefore, it is possible (explained at the very bottom) that you will see an improvement in throughput if you perform your IO- and CPU-bound work in parallel, like so:

void ReadAndProcessFiles(string[] filePaths)
{
    // Our thread-safe collection used for the handover.
    var lines = new BlockingCollection<string>();

    // Build the pipeline.
    var stage1 = Task.Run(() =>
    {
        try
        {
            foreach (var filePath in filePaths)
            {
                using (var reader = new StreamReader(filePath))
                {
                    string line;

                    while ((line = reader.ReadLine()) != null)
                    {
                        // Hand over to stage 2 and continue reading.
                        lines.Add(line);
                    }
                }
            }
        }
        finally
        {
            lines.CompleteAdding();
        }
    });

    var stage2 = Task.Run(() =>
    {
        // Process lines on a ThreadPool thread
        // as soon as they become available.
        foreach (var line in lines.GetConsumingEnumerable())
        {
            String pattern = "\\s{4,}";

            foreach (String trace in Regex.Split(line, pattern))
            {
                if (trace != String.Empty)
                {
                    String[] details = Regex.Split(trace, "\\s+");

                    Instruction instruction = new Instruction(details[0],
                        int.Parse(details[1]),
                        int.Parse(details[2]));
                    Console.WriteLine("computing...");
                    instructions.Add(instruction);
                }
            }
        }
    });

    // Block until both tasks have completed.
    // This makes this method prone to deadlocking.
    // Consider using 'await Task.WhenAll' instead.
    Task.WaitAll(stage1, stage2);
}

I highly doubt that it's your CPU work holding things up, but if it happens to be the case, you can also parallelise stage 2 like so:

    var stage2 = Task.Run(() =>
    {
        var parallelOptions = new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };

        Parallel.ForEach(lines.GetConsumingEnumerable(), parallelOptions, line =>
        {
            String pattern = "\\s{4,}";

            foreach (String trace in Regex.Split(line, pattern))
            {
                if (trace != String.Empty)
                {
                    String[] details = Regex.Split(trace, "\\s+");

                    Instruction instruction = new Instruction(details[0],
                        int.Parse(details[1]),
                        int.Parse(details[2]));
                    Console.WriteLine("computing...");
                    instructions.Add(instruction);
                }
            }
        });
    });

Mind you, if your CPU work component is negligible in comparison to the IO component, you won't see much speed-up. The more even the workload is, the better the pipeline is going to perform in comparison with sequential processing.

Since we're talking about performance note that I am not particularly thrilled about the number of blocking calls in the above code. If I were doing this in my own project, I would have gone the async/await route. I chose not to do so in this case because I wanted to keep things easy to understand and easy to integrate.

这篇关于阅读和并行处理C#文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆