C ++ OpenMP比带有默认线程数的串行要慢 [英] C++ OpenMP slower than serial with default thread count

查看:122
本文介绍了C ++ OpenMP比带有默认线程数的串行要慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用OpenMP并行执行程序的某些for循环,但未能获得显着的速度改进(观察到实际性能下降).我的目标计算机将具有4-6个内核,我目前依靠OpenMP运行时为我获取线程计数,因此我尚未尝试任何线程计数组合.

I try using OpenMP to parallel some for-loop of my program but failed to get significant speed improvement (actual degradation is observed). My target machine will have 4-6 cores and I currently rely on the OpenMP runtime to get the thread count for me, so I haven't tried any threadcount combination yet.

  • 目标/开发平台:Windows 64位
  • 使用MinGW64 4.7.2(rubenvb构建)

使用OpenMP采样输出

Thread count: 4
Dynamic :0
OMP_GET_NUM_PROCS: 4
OMP_IN_PARALLEL: 1
5.612 // <- returned by omp_get_wtime()
5.627 (sec) // <- returned by clock()
Wall time elapsed: 5.62703

不带OpenMP的样本输出

2.415 (sec) // <- returned by clock()
Wall time elapsed: 2.415

我如何测量时间

struct timeval start, end;
gettimeofday(&start, NULL);

#ifdef _OPENMP
    double t1 = (double) clock();
    double wt = omp_get_wtime();
    sim->resetEnvironment(run);
    tout << omp_get_wtime() - wt << std::endl;
    timeEnd(tout, t1);
#else
    double = (double) clock();
    sim->resetEnvironment(run);
    timeEnd(tout, t1);
#endif

gettimeofday(&end, NULL);
tout << "Wall time elapsed: "
     << ((end.tv_sec - start.tv_sec) * 1000000u + (end.tv_usec - start.tv_usec)) / 1.e6
     << std::endl;

代码

void Simulator::resetEnvironment(int run)
{
    #pragma omp parallel
    {
        // (a)
        #pragma omp for schedule(dynamic)
        for (size_t i = 0; i < vector_1.size(); i++) // size ~ 20
            reset(vector_1[i]);
        #pragma omp for schedule(dynamic)
        for (size_t i = 0; i < vector_2.size(); i++) // size ~ 2.3M
            reset(vector_2[i]);
        #pragma omp for schedule(dynamic)
        for (size_t i = 0; i < vector_3.size(); i++) // size ~ 0.3M
            reset(vector_3[i]);
        for (int level = 0; level < level_count; level++) // (b) level = 3
        {
            #pragma omp for schedule(dynamic)
            for (size_t i = 0; i < vector_4[level].size(); i++) // size ~500 - 1K
                reset(vector_4[level][i]);
        }

        #pragma omp for schedule(dynamic)
        for (long i = 0; i < populationSize; i++) // size ~7M
            resetAgent(agents[i]);
    } // end #parallel
} // end: Simulator::resetEnvironment()

随机性 在reset()函数调用中,我使用了RNG为后续的任务植入了一些代理. 下面是我的RNG实现,正如我看到的建议那样,每个线程每个线程使用一个RNG来保证线程安全.

Randomness Inside reset() function calls, I used a RNG for seeding some agents for subsequent tasks. Below is my RNG implementation, as I saw suggestion that using one RNG per per-thread for thread-safety.

class RNG {
public:
typedef std::mt19937 Engine;

RNG()
    : real_uni_dist_(0.0, 1.0)
#ifdef _OPENMP
    , engines()
#endif
    {
#ifdef _OPENMP
        int threads = std::max(1, omp_get_max_threads());
        for (int seed = 0; seed < threads; ++seed)
            engines.push_back(Engine(seed));
#else
        engine_.seed(time(NULL));
#endif
    } // end_ctor(RNG)

    /** @return next possible value of the uniformed distribution */
    double operator()()
    {
    #ifdef _OPENMP
         return real_uni_dist_(engines[omp_get_thread_num()]);
    #else
         return real_uni_dist_(engine_);
    #endif
    }

private:
    std::uniform_real_distribution<double> real_uni_dist_;
#ifdef _OPENMP
    std::vector<Engine> engines;
#else
    std::mt19937 engine_;
#endif
}; // end_class(RNG)

问题:

  • 在(a)中,不使用快捷方式'parallel for'以避免创建团队的开销好吗?
  • 实施的哪一部分可能导致性能下降?
  • 为什么clock()和omp_get_wtime()报告的时间如此相似,正如我所期望的那样,clock()会比omp_get_wtime()更长一些

  • 在(b)处,我打算在内部循环中包含OpenMP指令的原因是,外部循环的迭代是如此之小(只有3个),所以我认为我可以跳过该步骤,直接进入循环vector_4的内部循环[等级].这是否被认为不合适(或者这将指示OpenMP将外部循环重复4,从而实际上使内部循环12而不是3循环(例如,当前线程数为4)?

谢谢

推荐答案

如果测得的挂钟时间(由omp_get_wtime()报告)接近于总CPU时间(由clock()报告),则可能表示几件不同的事情:

If the measured wall-clock time (as reported by omp_get_wtime()) is close to the total CPU time (as reported by clock()), this could mean several different things:

  • 代码在单线程上运行,但是总的CPU时间将低于挂钟时间;
  • 存在非常高的同步和缓存一致性开销,与线程进行的实际工作相比,这是巨大的.

您的情况是第二种情况,原因是您使用了schedule(dynamic).仅在每次迭代可能花费不同时间的情况下才应使用动态调度.如果这些迭代静态地分布在线程之间,则可能会发生工作不平衡. schedule(dynamic)通过将每个任务(在您的情况下为循环的每个单次迭代)分配给下一个线程来完成其工作并变得空闲来解决此问题.同步线程并记账工作项的分配会产生一定的开销,因此,与开销相比,仅当每个线程的工作量很大时才应使用它. OpenMP允许您将更多的迭代分组到多个迭代块中,并且指定的编号类似于schedule(dynamic,100)-这将使每个线程在请求​​一个新的迭代之前执行100个连续迭代的一个块(或大块). 动态调度的默认块大小为1 ,即,每个单独线程处理的向量元素.我不知道在reset()中完成了多少处理,并且在vector_*中存在什么样的元素,但是考虑到串行运行时间,它根本就不多.

Your case is the second one and the reason is that you use schedule(dynamic). Dynamic scheduling should only be used in cases when each iteration can take a varying amount of time. If such iterations are statically distributed among the threads, work imbalance could occur. schedule(dynamic) takes care of this by giving each task (in your case each single iteration of the loop) to the next thread to finish its work and become idle. There is a certain overhead in synchronising the threads and bookkeeping the distribution of the work items and therefore it should only be used when the amount of work per thread is huge in comparison to the overhead. OpenMP allows you to group more iterations into iteration blocks and this number is specified like schedule(dynamic,100) - this would make each thread execute a block (or chunk) of 100 consecutive iterations before asking for a new one. The default block size for dynamic scheduling is 1, i.e. each vector element in processed by a separate thread. I have no idea how much processing is done in reset() and what kind of elements are there in vector_*, but given the serial run time it is not much at all.

使用动态调度时,另一个导致速度下降的原因是丢失了数据局部性.根据这些向量的元素类型,通过不同的线程处理相邻元素会导致错误共享.这意味着,例如vector_1[i]vector_1的某些其他元素位于同一高速缓存行中,例如vector_1[i-1]vector_1[i+1].当线程1修改vector_1[i]时,缓存行将重新加载到所有在相邻元素上工作的其他内核中.如果仅写入vector_1[],则编译器可以足够智能以生成非时间存储(那些绕过高速缓存),但仅与向量存储一起使用,并且每个内核一次执行一次迭代就意味着根本不进行向量化.可以通过切换到静态调度或在reset()确实花费不同的时间量的情况下,通过在schedule(dynamic)子句中设置合理的块大小来改善数据局部性.最佳块大小通常取决于处理器,通常必须对其进行调整才能获得最佳性能.

Another source of slowdown is the loss of data locality when you use dynamic scheduling. Depending on the type of elements of those vectors, processing neighbouring elements by different threads leads to false sharing. That means that, e.g. vector_1[i] lies in the same cache line with some other elements of vector_1, e.g. vector_1[i-1] and vector_1[i+1]. When thread 1 modifies vector_1[i], the cache line is reloaded in all other cores that work on the neighbouring elements. If vector_1[] is only written to, the compiler can be smart enough to generate non-temporal stores (those bypass the cache) but it only works with vector stores and having each core do a single iteration at a time means no vectorisation at all. Data locality can be improved by either switching to static scheduling or, if reset() really takes varying amount of time, by setting a reasonable chunk size in the schedule(dynamic) clause. The best chunk size is usually dependent on the processor and often one has to tweak it in order to get the best performance.

因此,我强烈建议您先通过将所有schedule(dynamic)替换为schedule(static)来切换到静态调度,然后尝试进一步优化.您不必在静态情况下指定块大小,因为默认值只是简单的迭代总数除以线程数,即每个线程将获得一个连续的迭代块.

So I would strongly suggest that you first switch to static scheduling by replacing all schedule(dynamic) to schedule(static) and then try to optimise further. You don't have to specify the chunk size in the static case as the default is simply the total number of iterations divided by the number of threads, i.e. each thread would get one contiguous block of iterations.

这篇关于C ++ OpenMP比带有默认线程数的串行要慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆