如何对多线程“尾部调用"进行多线程处理.使用TBB进行递归 [英] How to multithread "tail call" recursion using TBB

查看:82
本文介绍了如何对多线程“尾部调用"进行多线程处理.使用TBB进行递归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用tbb对现有的递归算法进行多线程处理.单线程版本使用尾调用递归,在结构上看起来像这样:

  void my_func(){my_recusive_func(0);}bool doSomeWork(int i,int& a,int& b,int& c){//做一些工作}void my_recusive_func(int i){int a,b,c;bool notDone = doSomeWork(i,a,b,c);如果(notDone){my_recusive_func(a);my_recusive_func(b);my_recusive_func(c);}} 

我是tbb新手,所以我的第一次尝试是使用parallel_invoke函数:

  void my_recusive_func(int i){int a,b,c;bool notDone = doSomeWork(i,a,b,c);如果(notDone){tbb :: parallel_invoke([a] {my_recusive_func(a);},[b] {my_recusive_func(b);},[c] {my_recusive_func(c);});}} 

这确实有效,并且比单线程版本运行得更快,但是它似乎无法随内核数量很好地扩展.我所针对的计算机具有16个内核(32个超线程),因此可扩展性对该项目非常重要,但是此版本最多只能在该计算机上获得约8倍的加速,并且在算法运行时许多内核似乎处于空闲状态./p>

我的理论是tbb在parallel_invoke之后正在等待子任务完成,因此可能有许多任务闲置地等待着不必要?这能解释闲置的内核吗?有什么方法可以让父级任务在不等待子级的情况下返回?我在想也许像这样的事情,但是我对调度程序的了解还不够,还不知道这是否可以:

  void my_func(){tbb :: task_group g;my_recusive_func(0,g);g.wait();}void my_recusive_func(int i,tbb :: task_group& g){int a,b,c;bool notDone = doSomeWork(i,a,b,c);如果(notDone){g.run([a,& g] {my_recusive_func(a,g);});g.run([b,& g] {my_recusive_func(b,g);});my_recusive_func(c,g);}} 

我的第一个问题是 tbb :: task_group :: run()线程安全吗?我无法从文档中找出答案.此外,还有更好的方法可以解决此问题吗?也许我应该改用低级调度程序调用?

(我没有编译就输入了这段代码,因此请原谅输入错误.)

解决方案

这里确实有两个问题:

  1. TBB实现task_group :: run线程安全吗?是的.(我们应该更清楚地记录下来).
  2. 是否有很多线程在task_group可扩展的相同上调用方法run()?否.(我相信Microsoft文档在某处提到了此问题.)原因是task_group成为争用的集中点.这只是实现中的获取和添加,但是由于受影响的缓存行必须反弹,因此最终仍然无法扩展.

通常最好从task_group生成少量任务.如果使用递归并行性,则为每个级别分配自己的task_group.尽管性能可能不会比使用parallel_invoke好.

低级别的tbb :: task接口是最好的选择.您甚至可以使用tasK :: execute返回指向尾部调用任务的指针的技巧来编写尾部递归代码.

但是我有点担心空闲线程.我想知道是否有足够的工作来保持线程繁忙.考虑先进行工作范围分析.如果您使用的是Intel编译器(或gcc 4.9),则可以先尝试使用Cilk版本.如果这样不能加快速度,那么即使是低级的tbb :: task接口也不太可能提供帮助,并且需要检查较高级的问题(工作和跨度).

I am trying to use tbb to multi-thread an existing recursive algorithm. The single-thread version uses tail-call recursion, structurally it looks something like this:

void my_func() {
    my_recusive_func (0);
}

bool doSomeWork (int i, int& a, int& b, int& c) {
    // do some work
}

void my_recusive_func (int i) {
    int a, b, c;
    bool notDone = doSomeWork (i, a, b, c);
    if (notDone) {
        my_recusive_func (a);
        my_recusive_func (b);
        my_recusive_func (c);
    }
}

I am a tbb novice so my first attempt used the parallel_invoke function:

void my_recusive_func (int i) {
    int a, b, c;
    bool notDone = doSomeWork (i, a, b, c);
    if (notDone) {
        tbb::parallel_invoke (
                [a]{my_recusive_func (a);},
                [b]{my_recusive_func (b);},
                [c]{my_recusive_func (c);});
    }
}

This does work and it runs faster than the single-threaded version but it doesn't seem to scale well with number of cores. The machine I'm targeting has 16 cores (32 hyper-threads) so scalability is very important for this project, but this version only gets about 8 times speedup at best on that machine and many cores seem idle while the algorithm is running.

My theory is that tbb is waiting for the child tasks to complete after the parallel_invoke so there may be many tasks sitting around idle waiting unnecessarily? Would this explain the idle cores? Is there any way to get the parent task to return without waiting for the children? I was thinking perhaps something like this but I don't know enough about the scheduler yet to know if this is OK or not:

void my_func()
{
    tbb::task_group g;
    my_recusive_func (0, g);
    g.wait();
}

void my_recusive_func (int i, tbb::task_group& g) {
    int a, b, c;
    bool notDone = doSomeWork (i, a, b, c);
    if (notDone) {
        g.run([a,&g]{my_recusive_func(a, g);});
        g.run([b,&g]{my_recusive_func(b, g);});
        my_recusive_func (c, g);
    }
}

My first question is is tbb::task_group::run() thread-safe? I couldn't figure that out from the documentation. Also, is there better way to go about this? Perhaps I should be using the low-level scheduler calls instead?

(I typed this code without compiling so please forgive typos.)

解决方案

There are really two questions here:

  1. Is the TBB implementation of task_group::run thread-safe? Yes. (We should document this more clearly).
  2. Is having many threads invoke method run() on the same task_group scalable? No. (I believe the Microsoft documentation mentions this somewhere.) The reason is that the task_group becomes a centralized point of contention. It's just a fetch-and-add in the implementation, but that's still ultimately unscalable since the affected cache line has to bounce around.

It's generally best to spawn a small number of tasks from a task_group. If using recursive parallelism, give each level its own task_group. Though the performance will likely not be any better than using parallel_invoke.

The low-level tbb::task interfaces is the best bet. You can even code the tail-recursion in that, using the trick where tasK::execute returns a pointer to the tail-call task.

But I'm a bit concerned about the idling threads. I'm wondering if there is enough work to keep the threads busy. Consider doing work-span analysis first. If you are using the Intel compiler (or gcc 4.9) you might try experimenting with a Cilk version first. If that won't speed up, then even the low-level tbb::task interface is unlikely to help, and higher-level issues (work and span) need to be examined.

这篇关于如何对多线程“尾部调用"进行多线程处理.使用TBB进行递归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆