将程序分割为4个线程比单个线程慢 [英] Splitting up a program into 4 threads is slower than a single thread

查看:139
本文介绍了将程序分割为4个线程比单个线程慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去一周我一直在写一个raytracer,并且已经达到了一个点,在这个点上,多线程是有意义的。我已经尝试使用OpenMP来并行化,但是使用更多的线程运行它实际上比运行它使用一个。



阅读其他类似的问题,特别是关于OpenMP建议是gcc优化串行代码更好。但是,使用 export OMP_NUM_THREADS = 1 运行编译代码的速度是使用 export OMP_NUM_THREADS = 4 的两倍。也就是说



使用时间运行程序:

 > export OMP_NUM_THREADS = 1;时间./raytracer 
real 0m34.344s
用户0m34.310s
sys 0m0.008s


> export OMP_NUM_THREADS = 4;时间./raytracer
real 0m53.189s
用户0m20.677s
sys 0m0.096s

用户时间比真实小,这在使用多个内核时是不常见的 - 用户应大于

 >  void Raytracer :: render(Camera& cam){

//让相机知道使用这个raytracer来探测场景
cam.setSamplingFunc(getSamplingFunction ));

int i,j;

#pragma omp parallel private(i,j)
{

//为每个像素构造一个光线。
#pragma omp for schedule(dynamic,4)
for(i = 0; i for(j = 0; ; cam.width(); ++ j){
cam.computePixel(i,j);
}
}
}
}

阅读此问题我认为我找到了我的答案。它谈到gclib rand()的同步调用自身的实现,以保持线程之间随机数生成的状态。我使用rand()相当多的蒙特卡罗抽样,所以我认为这是问题。我摆脱了对rand的调用,将它们替换为单个值,但使用多个线程仍然较慢。 EDIT:oops 证明我没有正确测试,这是随机值!



现在,我将讨论在每次调用 computePixel 时所做的工作,所以希望能找到一个解决方案。



在我的raytracer我基本上有一个场景树,其中所有的对象。当测试对象的交集时,在 computePixel 中遍历此树,但是,不会对此树或任何对象执行任何写入。 computePixel 本质上读取场景一堆次,调用对象上的方法(所有的方法都是const方法),并在最后写入一个值到自己的像素数组。这是我知道多个线程将尝试写入到同一个成员变量的唯一部分。没有任何同步,因为没有两个线程可以写入到像素阵列中的相同单元格。



任何人都可以建议可能会有某种争用的地方?要尝试吗?



提前谢谢。



EDIT: $ b




  • 编译器gcc 4.6(带-O2优化)

  • Ubuntu Linux 11.10

  • OpenMP 3

  • Intel i3-2310M四核2.1Ghz



计算像素代码:

  class Camera {

//构造函数析构函数
private:
//这是正在写入但不能从中读取的数组。
颜色* _sensor; //在构建时使用new分配。
}

void Camera :: computePixel(int i,int j)const {

颜色col;

//为像素构造适当的ray的简单代码
Ray3D ray(/ * params * /);
col + = _sceneSamplingFunc(ray); //调用一个遍历场景的const方法。

_sensor [i * _scrWidth + j] + = col;
}

从建议,可能是树遍历导致减速。一些其他方面:一旦调用抽样函数(递归弹出射线),就会涉及到很多递归 - 这是否会导致这些问题?

解决方案

感谢大家的建议,但进一步剖析,并摆脱其他因素后,随机数生成确实被证明是罪魁祸首。



如上面的问题所述,rand()需要跟踪从一个调用到下一个调用的状态。如果几个线程试图修改这个状态,它将引起一个竞争条件,所以glibc中的默认实现是在每次调用时锁定,以使函数线程安全。这对于性能是可怕的。



不幸的是,我在stackoverflow上看到的这个问题的解决方案都是本地的,即处理问题 rand()被调用。相反,我建议一个快速和脏的解决方案,任何人都可以在他们的程序中使用为每个线程实现独立的随机数生成,不需要同步。



代码,并且它工作 - 没有锁定,并且没有明显的减速作为对threadrand的调用的结果。随时可以指出任何明显的错误。



threadrand.h

  #ifndef _THREAD_RAND_H_ 
#define _THREAD_RAND_H_

//存储的线程状态的最大数量
const int maxThreadNum = 100;

void init_threadrand();

//需要openmp,用于线程号
int threadrand();

#endif // _THREAD_RAND_H_

threadrand.cpp

b
$ b

  #includethreadrand.h
#include< cstdlib>
#include< boost / scoped_ptr.hpp>
#include< omp.h>

//可以用普通指针数组替换,但需要
//显式删除以前的指针分配,并进行空检查。
//
//重要的是,双重间接尝试避免将所有
//线程状态放在同一缓存行上,这将导致缓存无效
//发生在其他核心每次rand_r将修改状态。
//(即假共享)
//一个更好的实现是将每个状态存储在结构中
//这是一个高速缓存行的大小
static boost: :scoped_ptr< unsigned int> randThreadStates [maxThreadNum];

//重新初始化线程状态指针数组,使用随机
//种子值。
void init_threadrand(){
for(int i = 0; i randThreadStates [i] .reset(new unsigned int(std :: rand )));
}
}

//需要openmp,用于线程号,索引到状态数组。
int threadrand(){
int i = omp_get_thread_num();
return rand_r(randThreadStates [i] .get());
}

现在你可以初始化使用 init_threadrand(),然后使用 threadrand()在OpenMP中使用多个线程。


I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.

Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.

Running the program with time:

> export OMP_NUM_THREADS=1; time ./raytracer
real    0m34.344s
user    0m34.310s
sys     0m0.008s


> export OMP_NUM_THREADS=4; time ./raytracer
real    0m53.189s
user    0m20.677s
sys     0m0.096s

User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.

Code that I have parallelized using OpenMP

void Raytracer::render( Camera& cam ) {

    // let the camera know to use this raytracer for probing the scene
    cam.setSamplingFunc(getSamplingFunction());

    int i, j;

    #pragma omp parallel private(i, j)
    {

        // Construct a ray for each pixel.
        #pragma omp for schedule(dynamic, 4)
        for (i = 0; i < cam.height(); ++i) {
            for (j = 0; j < cam.width(); ++j) {
                cam.computePixel(i, j);
            }
        }
    }
}

When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!

Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.

In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.

Can anyone suggest places where there could be some kind of contention? Things to try?

Thank you in advance.

EDIT: Sorry, was stupid not to provide more info on my system.

  • Compiler gcc 4.6 (with -O2 optimization)
  • Ubuntu Linux 11.10
  • OpenMP 3
  • Intel i3-2310M Quad core 2.1Ghz (on my laptop at the moment)

Code for compute pixel:

class Camera {

    // constructors destructors
    private:
        // this is the array that is being written to, but not read from.
        Colour* _sensor; // allocated using new at construction.
}

void Camera::computePixel(int i, int j) const {

    Colour col;

    // simple code to construct appropriate ray for the pixel
    Ray3D ray(/* params */);
    col += _sceneSamplingFunc(ray); // calls a const method that traverses scene. 

    _sensor[i*_scrWidth+j] += col;
}

From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?

解决方案

Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.

As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.

Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.

I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.

threadrand.h

#ifndef _THREAD_RAND_H_
#define _THREAD_RAND_H_

// max number of thread states to store
const int maxThreadNum = 100;

void init_threadrand();

// requires openmp, for thread number
int threadrand();

#endif // _THREAD_RAND_H_

threadrand.cpp

#include "threadrand.h"
#include <cstdlib>
#include <boost/scoped_ptr.hpp>
#include <omp.h>

// can be replaced with array of ordinary pointers, but need to
// explicitly delete previous pointer allocations, and do null checks.
//
// Importantly, the double indirection tries to avoid putting all the
// thread states on the same cache line, which would cause cache invalidations
// to occur on other cores every time rand_r would modify the state.
// (i.e. false sharing)
// A better implementation would be to store each state in a structure
// that is the size of a cache line
static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];

// reinitialize the array of thread state pointers, with random
// seed values.
void init_threadrand() {
    for (int i = 0; i < maxThreadNum; ++i) {
        randThreadStates[i].reset(new unsigned int(std::rand()));
    }
}

// requires openmp, for thread number, to index into array of states.
int threadrand() {
    int i = omp_get_thread_num();
    return rand_r(randThreadStates[i].get());
}

Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.

这篇关于将程序分割为4个线程比单个线程慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆