并行分区 [英] Parallel.For partitioning

查看:46
本文介绍了并行分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何对

Parallel.For(0, buffer.Length, (i)=> buffer[i] = 0);

我的假设是,对于 n 个核心计算机,工作将以 n种方式进行分区,而 n个线程将执行工作负载.这意味着例如 buffer.Length = 100和n = 4,每个线程将获得0-24、25-49、50-74、75-99 块.(以100个元素的数组为例说明分区,但是请考虑一个由数百万个元素组成的数组.)

My assumption was that for an n core machine, work would be partitioned n way and n threads will carry out the work payload. Which means for example buffer.Length = 100 and n = 4, each thread will get 0-24, 25-49, 50-74, 75-99 blocks. (100 element array is an example to illustrate partitioning, but please consider an array of millions of items.)

这是一个公平的假设吗?请讨论.

Is this a fair assumption? Please discuss.

我注意到 Array.Clear(...)在这种特定情况下的执行速度要快得多.您如何对此进行合理化?

I noticed that Array.Clear(...) would perform much faster in this specific scenario. How do you rationalize this?

推荐答案

首先介绍简单的部分.一个100个元素的数组是如此之小,以至于可以很容易地放入内核的缓存中.此外,清除数组等同于将内存区域设置为0s,这可以作为CPU命令使用,因此可以使其尽快完成.

First for the easy part. A 100-element array is so small that it can easily fit in a core's cache. Besides, clearing the array is equivalent to setting a memory area to 0s, something that is available as a CPU command and therefore as fast as you can make it.

实际上,SSE命令和并行优化的内存控制器意味着芯片组可以仅使用一个CPU命令就可以并行清除内存.

In fact, SSE commands and parallel-optimized memory controllers mean that the chipset can probalbly clear memory in parallel using only a single CPU command.

另一方面,Parallel.For引入了一些开销.它必须对数据进行分区,创建适当的任务以对其进行处理,收集结果并返回最终结果.在Parallel.For下,运行时必须将数据复制到每个内核,处理内存同步,收集结果等.在您的示例中,这可能比将内存位置清零所需的实际时间大得多.

On the other hand, Parallel.For introduces some overhead. It has to partition the data, create the appropriate tasks to work on them, collect the results and return the final result. Below Parallel.For, the runtime has to copy the data to each core, handle memory synchronization, collect the results etc. In your example, this can be significantly larger that the actual time needed to zero the memory locations.

实际上,对于小尺寸而言,由于每个内核试图访问相同的内存页面,很有可能99.999%的开销是内存同步.请记住,内存锁定处于页面级别,您可以在4K内存页面中容纳2K 16位整数.

In fact, for small sizes it is quite possible that 99.999% of the overhead is memory synchronization as each core tries to access the same memory page. Remember, memory locking is at the page level and you can fit 2K 16-bit ints in a 4K memory page.

关于PLINQ如何安排任务-根据您使用的运算符,使用了许多不同的分区方案.检查在LINQ中进行分区以获得一个不错的介绍.无论如何,分区程序都会尝试确定分区是否会带来任何好处,并且可能根本不会对数据进行分区.

As for how PLINQ schedules tasks - there are many different partitioning schemes used, depending on the operators you use. Check Partitioning in LINQ for a nice intro. In any case, the partitioner will try to determine whether there is any benefit to be gained from partitioning and may not partition the data at all.

在您的情况下,分区程序可能会使用远程分区.您的有效负载仅使用几个CPU周期,因此您所看到的只是分区,创建任务,管理同步和收集结果的开销.

In your case, the partitioner will probably use Ranged partitioning. Your payload uses only a few CPU cycles so all you see is the overhead of partitioning, creating tasks, managing synchronization and collecting the results.

更好的基准是在大型阵列上运行一些聚合,例如.计数和平均值之类的.

A better benchmark would be to run some aggregations on a large array, eg. counts and averages and the like.

这篇关于并行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆