任务并行不稳定,有时使用100%CPU [英] Task Parallel is unstable, using 100% CPU at times

查看:120
本文介绍了任务并行不稳定,有时使用100%CPU的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在测试Parallel for C#.通常,它工作正常,并且使用并行比正常的foreach循环要快.但是,有时(例如五分之一),我的CPU使用率将达到100%,从而导致并行任务非常缓慢.我的CPU设置是8GB内存的i5-4570.有谁知道为什么会出现此问题?

I'm currently testing out Parallel for C#. Generally it works fine, and using parallel is faster than the normal foreach loops. However, at times (like 1 out of 5 times), my CPU will reach 100% usage, causing parallel tasks to be very slow. My CPU setup is i5-4570 with 8gb ram. Does anyone have any idea why this problem occurs?

下面是我用来测试该功能的代码

Below are the codes I used to test out the function

            // Using normal foreach
            ConcurrentBag<int> resultData = new ConcurrentBag<int>();
            Stopwatch sw = new Stopwatch();
            sw.Start();
            foreach (var item in testData)
            {
                if (item.Equals(1))
                {
                    resultData.Add(item);
                }
            }
            Console.WriteLine("Normal ForEach " + sw.ElapsedMilliseconds);

            // Using list parallel for
            resultData = new ConcurrentBag<int>();
            sw.Restart();
            System.Threading.Tasks.Parallel.For(0, testData.Count() - 1, (i, loopState) =>
            {
                int data = testData[i];
                if (data.Equals(1))
                {
                    resultData.Add(data);
                }
            });
            Console.WriteLine("List Parallel For " + sw.ElapsedMilliseconds);

            // Using list parallel foreach
            //resultData.Clear();
            resultData = new ConcurrentBag<int>();
            sw.Restart();
            System.Threading.Tasks.Parallel.ForEach(testData, (item, loopState) =>
            {
                if (item.Equals(1))
                {
                    resultData.Add(item);
                }
            });
            Console.WriteLine("List Parallel ForEach " + sw.ElapsedMilliseconds);

            // Using concurrent parallel for 
            ConcurrentStack<int> resultData2 = new ConcurrentStack<int>();
            sw.Restart();
            System.Threading.Tasks.Parallel.For(0, testData.Count() - 1, (i, loopState) =>
            {
                int data = testData[i];
                if (data.Equals(1))
                {
                    resultData2.Push(data);
                }
            });
            Console.WriteLine("Concurrent Parallel For " + sw.ElapsedMilliseconds);

            // Using concurrent parallel foreach
            resultData2.Clear();
            sw.Restart();
            System.Threading.Tasks.Parallel.ForEach(testData, (item, loopState) =>
            {
                if (item.Equals(1))
                {
                    resultData2.Push(item);
                }
            });
            Console.WriteLine("Concurrent Parallel ForEach " + sw.ElapsedMilliseconds);

正常输出

普通ForEach 493

Normal output

Normal ForEach 493

平行列出315

列出并行ForEach 328

List Parallel ForEach 328

并发286(并行)

并行并发292次

普通ForEach 476

Normal ForEach 476

列出并行的8047

列出并行ForEach 276

List Parallel ForEach 276

并发并行281

每3960年并行并发

(这可以在任何并行任务中发生,以上只是一个实例)

(This can occur during any of the parallel tasks, the above is only one instance)

通过使用@willaien提供的PLINQ方法并运行100次,此问题不再发生.我仍然不知道为什么这个问题会首先出现.

By using the PLINQ method provided by @willaien and running it 100 times, this problem no longer occurs. I still have no idea why this issue would surface in the first place though.

var resultData3 = testData.AsParallel().Where(x => x == 1).ToList();

推荐答案

首先,请小心Parallel-它不会使您免受线程安全问题的困扰.在原始代码中,填写结果列表时使用了非线程安全的代码.通常,您要避免共享任何状态(尽管在这种情况下,对列表的只读访问权限很好).如果您确实要使用Parallel.ForParallel.ForEach进行过滤和聚合(确实,在这种情况下,您需要使用AsParallel),则应使用具有线程本地状态的重载-您将得到最终结果localFinally委托中的聚合(请注意,它仍在不同的线程上运行,因此您需要确保线程安全;但是,在这种情况下,锁定是可以的,因为您只对每个线程执行一次,而不是对每个线程执行一次每次迭代).

First of all, careful with Parallel - it doesn't shield you from thread safety issues. In your original code, you used non-thread-safe code when filling the list of results. In general, you want to avoid sharing any state (although the read-only access to the list is fine in a case like this). If you really want to use Parallel.For or Parallel.ForEach for filtering and aggregation (really, AsParallel is what you want in those cases), you should use the overload with thread-local state - you'd do the final results aggregation in the localFinally delegate (note that it's still run on a different thread, so you need to ensure thread-safety; however, locking is fine in that case, since you're only doing this once per thread, rather than on every iteration).

现在,要尝试解决此类问题,显而易见的第一件事就是使用探查器.所以我做到了.结果如下:

Now, the obvious first thing to try in a problem like this is to use a profiler. So I've done that. The results are as follows:

  • 在这些解决方案中,几乎没有任何内存分配.即使对于相对较小的测试数据(我在测试时使用1M,10M和100M的整数),它们也与初始测试数据分配完全相形见
  • 正在完成的工作例如Parallel.ForParallel.ForEach主体本身,而不在您的代码中(简单的if (data[i] == 1) results.Add(data[i])).
  • There is hardly any memory allocations in either of those solutions. They are entirely dwarfed by the initial test data allocation, even for relatively small test data (I used 1M, 10M and 100M of integers when testing).
  • The work being done is in the e.g. Parallel.For or Parallel.ForEach bodies themselves, not in your code (the simple if (data[i] == 1) results.Add(data[i])).

第一个意思是我们可以说GC可能不是罪魁祸首.确实,它没有任何运行的机会.第二个更加好奇-这意味着在某些情况下Parallel的开销是不合时宜的-但它似乎是随机的,有时可以顺利运行,有时需要半秒钟.这通常指向GC,但我们已经排除了这一点.

The first means we can say GC probably isn't the culprit. Indeed, it doesn't get any chance to run. The second is more curious - it means that in some cases, the overhead of Parallel is way out of line - but it's seemingly random, sometimes it works without a hitch, and sometimes it takes half a second. This would usually point to the GC, but we've ruled that out already.

我尝试使用没有循环状态的重载,但这没有帮助.我曾尝试限制MaxDegreeOfParallelism,但它只会伤人.现在,显然,此代码绝对由缓存访问控制-几乎没有CPU工作,也没有I/O-这将始终支持单线程解决方案;但即使使用MaxDegreeOfParallelism的1都无济于事-的确,2似乎是我系统上最快的.更多是无用的-再次,缓存访问占主导.仍然很好奇-我正在使用服务器CPU进行测试,它一次可以为所有数据提供足够的缓存,而我们并没有进行100%的顺序访问(这几乎完全消除了延迟) ),它应该足够连续.无论如何,我们在单线程解决方案中拥有内存吞吐量的基准线,并且它在运行良好的情况下非常接近并行化情况的速度(并行化时,我在单线程解决方案上读取的运行时间比单线程少40%.四核服务器CPU解决了一个令人尴尬的并行问题-显然,内存访问是极限.)

I've tried using the overload without the loop state, but that didn't help. I've tried limiting the MaxDegreeOfParallelism, but it only ever hurt things. Now, obviously, this code is absolutely dominated by cache access - there's hardly any CPU work and no I/O - which will always favour a single-threaded solution; but even using MaxDegreeOfParallelism of 1 doesn't help - indeed, 2 seems to be the fastest on my system. More is useless - again, cache access dominates. It's still curious - I'm using a server CPU for the tests, which has plenty of cache for all of the data at once, and while we're not doing a 100% sequential access (which pretty much gets rid of the latency entirely), it should be quite sequential enough. Regardless, we have the base line of memory throughput in the single-threaded solution, and it's very close to the speed of the parallelised case when it works well (parallelised, I'm reading 40% less runtime than single-threaded, on a four-core server CPU for a embarassingly parallel problem - again, obviously, memory access is the limit).

因此,是时候检查Parallel.For的参考源了.在这种情况下,它只是根据工作人员的数量创建范围-每个范围一个.因此,这不是范围-没有任何开销. 核心只是运行在给定范围内迭代的任务.有一些有趣的地方-例如,如果任务花费的时间太长,它将被挂起".但是,它似乎不太适合数据-为什么这样的事情会导致与数据大小无关的随机延迟?无论工作量多么小,以及MaxDegreeOfParallelism多么低,我们都会得到随机"的减速.这可能是个问题,但我不知道如何检查.

So, it's time to check the reference source for Parallel.For. In a case like this, it simply creates ranges based on the amount of workers - one range for each. So it's not the ranges - there's no overhead from that. The core simply runs a task that iterates over the given range. There's a few interesting bits - for example, the task will get "suspended" if it takes too long. However, it doesn't seem to fit the data too well - why would something like this cause random delays unrelated to the data size? No matter how small the work job, and no matter how low MaxDegreeOfParallelism is, we get "random" slowdowns. It might be a problem, but I have no idea how to check it.

最有趣的是,扩展测试数据对异常没有任何作用-尽管它使好"并行运行得更快(甚至在我的测试中接近完美的效率,奇怪的是),而坏"并行仍然一样糟糕.实际上,在我的一些测试运行中,它们荒谬不好(最多是正常"循环的十倍).

The most interesting thing is that expanding the test data does nothing with the anomaly - while it makes the "good" parallel runs much faster (even getting close to perfect efficiency in my tests, oddly enough), the "bad" ones are still just as bad. In fact, in a few of my test runs, they are absurdly bad (up to ten times the "normal" loop).

因此,让我们看一下线程.我特意提高了ThreadPool中的线程数量,以确保扩展线程池不会成为瓶颈(如果一切正常,则不应该如此,但是...).这是第一个惊喜-尽管好"运行只使用有意义的4-8个线程,但坏"运行会扩展池中所有可用的线程,即使它们有一百个.糟糕?

So, let's have a look at the threads. I've artifically bumped the amount of threads in the ThreadPool to ensure that expanding the thread pool isn't a bottleneck (it shouldn't if everything worked well, but...). And here comes the first surprise - while the "good" runs simply use the 4-8 threads that make sense, the "bad" runs expand over all the available threads in the pool, even if there's a hundred of them. Oops?

让我们再次深入源代码. Parallel在内部使用Task.RunSynchronously运行根分区工作,然后对结果进行Wait.当我查看并行堆栈时,有97个线程在执行循环体,而实际上只有一个在堆栈上具有RunSynchronously(这是主线程).其他是普通线程池线程.任务ID还讲述了一个故事-在进行迭代时,有成千上万的单个任务正在创建.显然,这里有些非常错误.即使我删除了整个循环主体,这种情况仍然会发生,因此也不是闭包怪异.

Let's dive into source code once again. Parallel internally uses Task.RunSynchronously to run the root partitioned work job, and Waits on the result. When I look at parallel stacks, there's 97 threads executing the loop body, and only one that actually has RunSynchronously on stack (as expected - that's the main thread). The others are plain threadpool threads. The task IDs also tell a story - there's thousands of individual tasks being created while doing the iteration. Obviously, something is very wrong here. Even if I remove the whole loop body, this still happens, so it's not some closure weirdness either.

显式设置MaxDegreeOfParallelism可以在某种程度上抵消这一点-使用的线程数量不再爆炸-但是,任务数量仍然可以.但是我们已经看到范围只是运行的并行任务的数量-那么为什么要继续创建越来越多的任务呢?使用调试器可以确认这一点-MaxDOP为4,只有五个范围(有些对齐导致第五个范围).有趣的是,其中一个已完成的范围(第一个完成得比其他人领先得多?)的索引 高于其迭代的范围-这是因为调度程序"在其中分配了范围分区最多16个切片.

Explicitly setting MaxDegreeOfParallelism offsets this somewhat - the amount of threads used doesn't explode anymore - however, the amount of tasks still does. But we've already seen that the ranges are just the amount of parallel tasks running - so why keep creating more and more tasks? Using the debugger confirms this - with MaxDOP of four, there's only five ranges (there's some alignment that causes the fifth range). Interestingly, one of the completed ranges (how did the first one finish so much ahead of the rest?) has index higher than the range it iterates - this is because the "scheduler" assigns range-partitions in slices of up to 16.

root任务是自我复制的,因此无需显式启动例如四个任务处理数据,它等待调度程序复制任务以处理更多数据.这很难读-我们正在谈论复杂的多线程无锁代码,但似乎总是分配的工作片比分区范围小得多.在我的测试中,切片的最大大小为16,这与我正在运行的数百万个数据相去甚远.像这样的主体进行16次迭代完全没有时间,这可能会导致算法出现许多问题(最大的问题是基础结构比实际的迭代器主体占用更多的CPU工作量).在某些情况下,缓存垃圾回收可能会进一步影响性能(也许在主体运行时存在很大差异时),但是在大多数情况下,访问是足够连续的.

The root task is self-replicating, so instead of explicitly starting e.g. four tasks to handle the data, it waits for the scheduler to replicate the task to handle more data. It's kind of hard to read - we're talking about complex multi-threaded lock-less code, but it seems that it always assigns work in slices much smaller than the partitioned ranges. In my testing, the maximal size of the slice was 16 - a far cry from the millions of data I'm running. 16 iterations with a body like this is no time at all, which might lead to many issues with the algorithm (the biggest being the infrastructure taking more CPU work than the actual iterator body). In some cases, cache trashing might impact performance even further (perhaps when there's a lot of variation in the body runtimes), but most of the time, the access is sequential enough.

TL; DR

如果每次操作的时间非常短(以毫秒为单位),请不要使用Parallel.ForParallel.ForEach. AsParallel或仅运行单线程迭代将很可能快得多.

Don't use Parallel.For and Parallel.ForEach if your work-per-iteration is very short (on the order of milliseconds). AsParallel or just running the iteration single-threaded will most likely be much faster.

稍长的解释:

Parallel.ForParaller.ForEach似乎是为您要迭代的单个项目花费大量时间来执行的场景而设计的(即每个项目需要大量工作,而不是大量工作项).当迭代器主体太短时,它们的性能似乎很差.如果您不在迭代器主体中做大量工作,请使用AsParallel而不是Parallel.*.最有效点似乎在每片150毫秒以下(每次迭代约10毫秒).否则,Parallel.*会在自己的代码中花费大量时间,几乎不会花时间进行迭代(在我的情况下,通常的数字大约是体内的5-10%,非常糟糕).

It seems that Parallel.For and Paraller.ForEach are designed for scenarios where the individual items you're iterating over take a substantial amount of time to execute (i.e. lots of work per item, not tiny amounts of work over a lot of items). They seem to perform badly when the iterator body is too short. If you're not doing substantial work in the iterator body, use AsParallel instead of Parallel.*. The sweetspot seems to be somewhere under 150ms per slice (around 10ms per iteration). Otherwise, Parallel.* will spend tons of time in its own code, and hardly any time doing your iteration (in my case, the usual number was somewhere around 5-10% in the body - embarassingly bad).

可悲的是,我在MSDN上没有发现任何警告-甚至有样本正在处理大量数据,但没有暗示这样做会带来可怕的性能损失.在我的计算机上测试了相同的示例代码,我发现它确实确实通常比单线程迭代慢,并且在最佳情况下,它几乎没有更快(在运行时节省了大约30-40%的时间四个CPU内核-效率不高).

Sadly, I didn't find any warning about this on MSDN - there's even samples going over substantial amounts of data, but there's no hint at the terrible performance hit of doing so. Testing the very same sample code on my computer, I've found that it is indeed often slower than a single-threaded iteration, and at the best of times, barely faster (around 30-40% time savings while running on four CPU cores - not very efficient).

Willaien在MSDN上发现了有关此问题及其解决方法的提及-

Willaien found a mention on MSDN about this very issue, and how to solve it - https://msdn.microsoft.com/en-us/library/dd560853(v=vs.110).aspx. The idea is to use a custom partitioner and iterate over that in the Parallel.For body (e.g. loop in Parallel.For's loop). However, for most cases, using AsParallel is probably still a better choice - simple loop bodies usually mean some kind of a map/reduce operation, and AsParallel and LINQ in general are great at that. For example, your sample code could be rewritten simply as:

var result = testData.AsParallel().Where(i => i == 1).ToList();

使用AsParallel是一个糟糕的主意的唯一情况与所有其他LINQ相同-当循环主体具有副作用时.有些可能是可以忍受的,但完全避免它们是更安全的.

The only case where using AsParallel is a bad idea is the same as with all the other LINQs - when your loop body has side-effects. Some might be tolerable, but it's safer to avoid them altogether.

这篇关于任务并行不稳定,有时使用100%CPU的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆