NET的多线程多VS处理:可怕Parallel.ForEach性能 [英] .NET's Multi-threading vs Multi-processing: Awful Parallel.ForEach Performance

查看:133
本文介绍了NET的多线程多VS处理:可怕Parallel.ForEach性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经编写了一个非常简单的字数程序,读取文件和计数每个字的文件中出现。下面是代码的一部分:

 类Alaki 
{
私人静态列表<串>输入=新的List<串GT;();

私有静态无效EXEC(INT经纬)
{
ParallelOptions选项=新ParallelOptions();
options.MaxDegreeOfParallelism = THREADCOUNT;
Parallel.ForEach(Partitioner.Create(0,input.Count),期权(范围)=>
{
变种DIC =新词典<字符串列表< INT>> ();
的for(int i = range.Item1; I< range.Item2;我++)!
{
//做一些延迟
//为(INT点¯x = 0; X< 400000; X ++);

VAR令牌=输入[I] .Split();在令牌
的foreach(VAR令牌)
{$ b $ (!dic.ContainsKey(令牌))b如果
DIC [令牌] =新的List< INT>();
DIC [令牌]。新增(1);
}
}
});

}

公共静态无效的主要(字串[] args)
{
StreamReader的读者=新的StreamReader((@C:\txt -set\agg.txt));
,而(真)
{
无功行= reader.ReadLine();
如果(行== NULL)
中断;
input.Add(线);
}

日期时间T0 = DateTime.Now;
EXEC(Environment.ProcessorCount);
Console.WriteLine(并行+(DateTime.Now - T0));
T0 = DateTime.Now;
的exec(1);
Console.WriteLine(串行+(DateTime.Now - T0));
}
}



这是简单和直接的。我使用字典来计算每个单词的出现。风格大致基础上,的MapReduce 的编程模型。正如你所看到的,每一个任务是用自己的私人字典。因此,不存在共享变量;只是一堆或自行计算单词的任务。下面是当代码是一个四核i7 CPU上运行的输出:



并行:00:00:01.6220927结果
系列:00: 00:02.0471171



加速比约为1.25,这意味着一个悲剧!但是,当我处理每一行的时候添加一些延迟,我可以约4达到加速值。



在没有延迟原来并行执行,CPU的使用率几乎达到30%因此增速并不乐观。但是,当我们添加一些延迟,CPU的利用率达到97%以上。



首先,我认为原因是该程序的IO-约束性(但我想插入成一本字典在某种程度上是CPU密集型),也因为所有的线程从共享内存总线读取数据似乎顺理成章。然而,令人惊讶的一点是,当我运行串行程序的4个实例(无延迟)同时,CPU的利用率达到约加薪和所有的四个实例在约2.3秒内完成!



这意味着,当该代码在多处理配置中运行,它达到约3.5,但是当它是在多线程的配置中运行的加速值,加速比是约1.25。



你有什么想法?
有什么错我的代码?因为我认为这是完全没有共享数据,我认为的代码将不会遇到任何争论。
是否有.NET的运行时的一个漏洞?



在此先感谢。


解决方案

的Parallel.For 不分投入的 N 的件(其中的 N 的是 MaxDegreeOfParallelism );相反,它创造了许多小批量,并确保最多的 N 的被同时处理。 (这是这样,如果一个批次需要很长的时间来处理,的Parallel.For 仍然可以运行在其他线程的工作见的Parallelism在.NET - 5部分,更多细节工作的Partioning)<。 / p>

由于这个设计,你的代码是创建和扔掉几十Dictionary对象,上百列表中的对象,还有数千String对象。这是把巨大的压力在垃圾收集器。



运行的PerfMonitor 在我的计算机报告的总运行时间的43%都花在GC。如果你重写代码使用的临时对象更少,你应该看到所需的4倍的速度提升。从PerfMonitor报告如下一些摘录:




总CPU时间超过10%的垃圾收集器花了。
最知名调整应用程序是在0-10%的范围内。这通常是$引起的一种分配方式,允许对象就住在长
到需要昂贵的Gen 2的集合体B $ B



这方案有在10 MB /秒的峰值GC堆分配率。
这是相当高的。这并非罕见,这是一个简单的
性能缺陷




编辑:按您的意见,我会试图解释您报告的时间安排。在我的电脑,用PerfMonitor,我43%和在GC上花费的时间52%之间测量。为了简单起见,让我们假设的CPU时间的50%是工作,和50%的GC含量。因此,如果我们做的工作4×更快(通过多线程),但保持GC相同的金额(这不会发生,因为正在处理的批数正好是在并行和串行配置相同),最佳改善我们可以得到的是原时间62.5%,或1.6倍。



不过,我们只看到一个1.25×加速,因为GC默认情况下未多线程(在工作站GC)。按照的href=\"http://msdn.microsoft.com/en-us/library/ee787088.aspx\">基础,所​​有托管线程一0代或第二代1收集过程中暂停。 (并发和背景GC,在.NET 4和.NET 4.5,可以在后台线程收集第2代)。您的方案经验只有1.25×加速(和你看到的30%的CPU使用率总体),因为线程花费其大部分被暂停的时间GC(因为这个测试程序的内存分配模式是非常差)。



如果您启用的服务器GC ,它会在多个线程执行垃圾收集。如果我这样做,程序运行速度更快2×(几乎100%的CPU使用率)。



当您同时运行该程序的四个实例,每个人都有自己的托管堆和垃圾收集四个过程可以并行执行。这就是为什么你看到100%的CPU使用率(每个进程正在使用一个CPU的100%)。稍长总时间(达到2.3s的所有VS 2.05s一)可能是由于测量不准确,争夺磁盘,加载文件所用的时间,不必初始化线程池,上下文切换开销,或其他一些环境因素。


I have coded a very simple "Word Count" program that reads a file and counts each word's occurrence in the file. Here is a part of the code:

class Alaki
{
    private static List<string> input = new List<string>();

    private static void exec(int threadcount)
    {
        ParallelOptions options = new ParallelOptions();
        options.MaxDegreeOfParallelism = threadcount;
        Parallel.ForEach(Partitioner.Create(0, input.Count),options, (range) =>
        {
            var dic = new Dictionary<string, List<int>>();
            for (int i = range.Item1; i < range.Item2; i++)
            {
                //make some delay!
                //for (int x = 0; x < 400000; x++) ;                    

                var tokens = input[i].Split();
                foreach (var token in tokens)
                {
                    if (!dic.ContainsKey(token))
                        dic[token] = new List<int>();
                    dic[token].Add(1);
                }
            }
        });

    }

    public static void Main(String[] args)
    {            
        StreamReader reader=new StreamReader((@"c:\txt-set\agg.txt"));
        while(true)
        {
            var line=reader.ReadLine();
            if(line==null)
                break;
            input.Add(line);
        }

        DateTime t0 = DateTime.Now;
        exec(Environment.ProcessorCount);
        Console.WriteLine("Parallel:  " + (DateTime.Now - t0));
        t0 = DateTime.Now;
        exec(1);
        Console.WriteLine("Serial:  " + (DateTime.Now - t0));
    }
}

It is simple and straight forward. I use a dictionary to count each word's occurrence. The style is roughly based on the MapReduce programming model. As you can see, each task is using its own private dictionary. So, there is NO shared variables; just a bunch of tasks that count words by themselves. Here is the output when the code is run on a quad-core i7 CPU:

Parallel: 00:00:01.6220927
Serial: 00:00:02.0471171

The speedup is about 1.25 which means a tragedy! But when I add some delay when processing each line, I can reach speedup values about 4.

In the original parallel execution with no delay, CPU's utilization hardly reaches to 30% and therefore the speedup is not promising. But, when we add some delay, CPU's utilization reaches to 97%.

Firstly, I thought the cause is the IO-bound nature of the program (but I think inserting into a dictionary is to some extent CPU intensive) and it seems logical because all of the threads are reading data from a shared memory bus. However, The surprising point is when I run 4 instances of serial programs (with no delays) simultaneously, CPU's utilization reaches to about raises and all of the four instances finish in about 2.3 seconds!

This means that when the code is being run in a multiprocessing configuration, it reaches to a speedup value about 3.5 but when it is being run in multithreading config, the speedup is about 1.25.

What is your idea? Is there anything wrong about my code? Because I think there is no shared data at all and I think the code shall not experience any contentions. Is there a flaw in .NET's run-time?

Thanks in advance.

解决方案

Parallel.For doesn't divide the input into n pieces (where n is the MaxDegreeOfParallelism); instead it creates many small batches and makes sure that at most n are being processed concurrently. (This is so that if one batch takes a very long time to process, Parallel.For can still be running work on other threads. See Parallelism in .NET - Part 5, Partioning of Work for more details.)

Due to this design, your code is creating and throwing away dozens of Dictionary objects, hundreds of List objects, and thousands of String objects. This is putting enormous pressure on the garbage collector.

Running PerfMonitor on my computer reports that 43% of the total run time is spent in GC. If you rewrite your code to use fewer temporary objects, you should see the desired 4x speedup. Some excerpts from the PerfMonitor report follow:

Over 10% of the total CPU time was spent in the garbage collector. Most well tuned applications are in the 0-10% range. This is typically caused by an allocation pattern that allows objects to live just long enough to require an expensive Gen 2 collection.

This program had a peak GC heap allocation rate of over 10 MB/sec. This is quite high. It is not uncommon that this is simply a performance bug.

Edit: As per your comment, I will attempt to explain the timings you reported. On my computer, with PerfMonitor, I measured between 43% and 52% of time spent in GC. For simplicity, let's assume that 50% of the CPU time is work, and 50% is GC. Thus, if we make the work 4× faster (through multi-threading) but keep the amount of GC the same (this will happen because the number of batches being processed happened to be the same in the parallel and serial configurations), the best improvement we could get is 62.5% of the original time, or 1.6×.

However, we only see a 1.25× speedup because GC isn't multithreaded by default (in workstation GC). As per Fundamentals of Garbage Collection, all managed threads are paused during a Gen 0 or Gen 1 collection. (Concurrent and background GC, in .NET 4 and .NET 4.5, can collect Gen 2 on a background thread.) Your program experiences only a 1.25× speedup (and you see 30% CPU usage overall) because the threads spend most of their time being paused for GC (because the memory allocation pattern of this test program is very poor).

If you enable server GC, it will perform garbage collection on multiple threads. If I do this, the program runs 2× faster (with almost 100% CPU usage).

When you run four instances of the program simultaneously, each has its own managed heap, and the garbage collection for the four processes can execute in parallel. This is why you see 100% CPU usage (each process is using 100% of one CPU). The slightly longer overall time (2.3s for all vs 2.05s for one) is possibly due to inaccuracies in measurement, contention for the disk, time taken to load the file, having to initialise the threadpool, overhead of context switching, or some other environment factor.

这篇关于NET的多线程多VS处理:可怕Parallel.ForEach性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆