在线程之间划分循环迭代 [英] Dividing loop iterations among threads

查看:74
本文介绍了在线程之间划分循环迭代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近写了一个小的数字运算程序,该程序基本上在N维网格上循环并在每个点执行一些计算.

I recently wrote a small number-crunching program that basically loops over an N-dimensional grid and performs some calculation at each point.

for (int i1 = 0; i1 < N; i1++)
  for (int i2 = 0; i2 < N; i2++)
    for (int i3 = 0; i3 < N; i3++)
      for (int i4 = 0; i4 < N; i4++)
        histogram[bin_index(i1, i2, i3, i4)] += 1; // see bottom of question

它工作正常,yadda yadda yadda,得到了漂亮的图形;-)但是后来我想到,我的计算机上有2个内核,为什么不使该程序成为多线程程序,所以我可以使其运行速度快两倍?

It worked fine, yadda yadda yadda, lovely graphs resulted ;-) But then I thought, I have 2 cores on my computer, why not make this program multithreaded so I could run it twice as fast?

现在,我的循环总共运行了大约十亿次计算,并且我需要某种方式将它们拆分到线程之间.我认为我应该将计算分组为任务"(例如,最外层循环的每次迭代都是一个任务),然后将任务分发给线程.我考虑过了

Now, my loops run a total of, let's say, around a billion calculations, and I need some way to split them up among threads. I figure I should group the calculations into "tasks" - say each iteration of the outermost loop is a task - and hand out the tasks to threads. I've considered

  • 仅向线程#n提供最外层循环的所有迭代,其中i1 % nthreads == n-本质上是预先确定哪些任务去了哪些线程
  • 尝试设置一些互斥保护的变量,该变量保存需要执行的下一个任务的参数(在本例中为i1)-动态地将任务分配给线程
  • just giving thread #n all iterations of the outermost loop where i1 % nthreads == n - essentially predetermining which tasks go to which threads
  • trying to set up some mutex-protected variable which holds the parameter(s) (i1 in this case) of the next task that needs executing - assigning tasks to threads dynamically

为什么选择一种方法而不是另一种方法?还是我没有想到的另一种方法?甚至有关系吗?

What reasons are there to choose one approach over the other? Or another approach I haven't thought about? Does it even matter?

顺便说一句,我用C语言编写了这个特定程序,但是我想我也将用其他语言再次做同样的事情,因此答案不必是特定于C语言的. (但是,如果有人知道Linux的C库可以执行这种操作,我很想知道)

By the way, I wrote this particular program in C, but I imagine I'll be doing the same kind of thing again in other languages as well so answers need not be C-specific. (If anyone knows a C library for Linux that does this sort of thing, though, I'd love to know about it)

EDIT :在这种情况下,bin_index是确定性函数,除了其自身的局部变量外,不更改任何其他内容.像这样:

EDIT: in this case bin_index is a deterministic function which doesn't change anything except its own local variables. Something like this:

int bin_index(int i1, int i2, int i3, int i4) {
    // w, d, h are constant floats
    float x1 = i1 * w / N,  x2 = i2 * w / N, y1 = i3 * d / N, y2 = i4 * d / N;
    float l = sqrt((x1 - x2) * (x1 - x2) + (y1 - y2) * (y1 - y2) + h * h);
    float th = acos(h / l);
    // th_max is a constant float (previously computed as a function of w, d, h)
    return (int)(th / th_max);
}

(尽管我很感谢所有评论,甚至包括那些不适用于确定性bin_index的评论)

(although I appreciate all the comments, even those which don't apply to a deterministic bin_index)

推荐答案

第一种方法很简单.如果您希望负载将在线程之间平均平衡,这也就足够了.在某些情况下,尤其是bin_index的复杂度非常依赖于参数值时,其中一个线程可能会比其他线程承担更重的任务.请记住:任务在最后一个线程完成时就完成了.

The first approach is simple. It is also sufficient if you expect that the load will be balanced evenly over the threads. In some cases, especially if the complexity of bin_index is very dependant on the parameter values, one of the threads could end up with a much heavier task than the rest. Remember: the task is finished when the last threads finishes.

第二种方法稍微复杂一点,但是如果任务足够细(任务数比线程数大得多),则可以更均匀地平衡负载.

The second approach is a bit more complicated, but balances the load more evenly if the tasks are finegrained enough (the number of tasks is much larger than the number of threads).

请注意,将计算放在单独的线程中可能会遇到问题.当多个线程同时执行bin_index时,请确保其正常工作.提防使用全局或静态变量获取中间结果.

Note that you may have issues putting the calculations in separate threads. Make sure that bin_index works correctly when multiple threads execute it simultaneously. Beware of the use of global or static variables for intermediate results.

另外,"histogram [bin_index(i1,i2,i3,i4)] + = 1"可能会被另一个线程中断,从而导致结果不正确(如果赋值获取了该值,则将其递增并存储结果数组中的值).您可以为每个线程引入一个局部直方图,并在所有线程完成后将结果合并为一个直方图.您还可以确保只有一个线程在同一时间修改直方图,但这可能导致线程在大多数情况下相互阻塞.

Also, "histogram[bin_index(i1, i2, i3, i4)] += 1" could be interrupted by another thread, causing the result to be incorrect (if the assignment fetches the value, increments it and stores the resulting value in the array). You could introduce a local histogram for each thread and combine the results to a single histogram when all threads have finished. You could also make sure that only one thread is modifying the histogram at the same time, but that may cause the threads to block each other most of the time.

这篇关于在线程之间划分循环迭代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆