实现并行非对称嵌套for循环的最佳方法是什么? [英] What's the best way to implement parallel asymmetric nested for loops?

查看:80
本文介绍了实现并行非对称嵌套for循环的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法有效地并行化嵌套for循环,例如:



Is there a way to efficiently parallelize nested for loops like:

for (int i = 0; i < m; i++)
  for (int j = i + 1; j < n; j++)
    doSomethingWithIandJ(i, j);





以每个处理线程大约做的方式相同数量的操作?



in a way that each processing thread do approximately the same number of operations?

推荐答案

各种并行框架(如OpenMP)提供了自动分散工作负载的方法,而无需对其进行控制。如果您正在使用它,请检查您首选的并行框架文档。



如果没有框架,使用的好模式是主/从或工作模式:一个主线程将负责创建工作项,奴隶或工人线程将接收并处理它们。您可以让主服务器将工作项委托给特定的从服务器,但是使用自动从服务器在空闲时从池中获取下一个工作项通常更有效。



在您的情况下,每个工作项只包含一对 int ,并且可能包含对包含处理结果的某些数据结构的引用。主人只会抽出(放入工作池)(i,j)工作项目,然后奴隶会尽快处理它们。



当然,您需要管理线程,并在处理完所有工作项后以某种方式通知主人。



请注意,并行化会产生开销:如果 doSomething()函数是微不足道的,那么谨慎的做法是不要并行化,或者只能并行化外循环。 确保实施性能测试并确认您确实获得了性能!



您可以搜索关键字master / slave的更多信息模式或工作线程或工作池。





PS:

一个可能的改进在主/从模式上是使用你有的任何上下文信息来计算一个奴隶应该拿起的最佳工作量,然后让每个奴隶拿起 p> 1 工作项而不是一次一项。您可能需要先进行一些性能测试,然后才能确定最佳效果,或者是否可以根据 n m 单独,但如果处理 doSomething 的时间为一对(i,j)大致不变,这应该是可能的。



如果你不能确定先验的好工作量,你也可以监控运行时,并动态调整每个工作人员的工作项数。这可以更加可扩展。



更好的是,您可以使用上述任一方法从一个(i,j)对中增加每个工作项的工作量到多对。这样做将大大减少管理线程所需的通信和同步量。
Various parallel frameworks such as OpenMP offer ways to automatically spread workload without your need to control it. Check your preferred parallel frameworks documentation if you're using one.

Without framework, a good pattern to use is master/slave or worker pattern: One "master" thread will be responsible for creating "work items", and "slave" or "worker" threads will pick them up and process them. You can have the master delegate the work items to specific slaves, but it's often more efficient to use autonomous slaves that pick up the next work item from a pool when they're idle.

In your case, each work item would consist of just one pair of ints, and possibly a reference to some data structure containing the results of the processing. The master would just pump out (put into the work pool) (i,j) work items, and the slaves would then pick them up as fast as they can process each one.

Of course you need to manage the threads, and implement some way to notify the master when all work items have been processed.

Please note that parallelization creates overhead: If the doSomething() function is trivial, it may be prudent not to parallelize it, or maybe only parallelize the outer loop. Be sure to implement performance tests and verify that you actually gain performance!

You can search for more info with the keywords "master/slave pattern" or "worker threads" or "work pool".


P.S.:
One possible improvement on the master/slave pattern is to use any context information you have to calculate the "optimal" amount of work a slave should pick up and then have each slave pick up p>1 work items rather than one at a time. You'll probably need to do a number of performance tests before you are able to gauge what is optimal, or whether you can come up with a good formula dependend on n and m alone, but if the time to process doSomething for one pair (i,j) is roughly constant, this should be possible.

If you can't determine a good work amount à priori, you can also monitor performance at runtime, and adapt the number of work items per worker dynamically. This could be much more scalable.

Better yet you can use either of the above methods to increase the workload for each work item from just one (i,j) pair to multiple pairs. Doing so will greatly decrease the amount of communication and synchronization required to manage the threads.


这篇关于实现并行非对称嵌套for循环的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆