红黑高斯Seidel和OpenMP [英] Red-Black Gauss Seidel and OpenMP

查看:517
本文介绍了红黑高斯Seidel和OpenMP的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图证明与MPICH相比,OpenMP的重要性,并且我整理了以下示例来演示在OpenMP中实现某些高性能是多么容易.

I was trying to prove a point with OpenMP compared to MPICH, and I cooked up the following example to demonstrate how easy it was to do some high performance in OpenMP.

Gauss-Seidel迭代被分为两个独立的运行,因此在每个扫描中,每个操作都可以按任何顺序执行,并且每个任务之间不应有依赖关系.因此,从理论上讲,每个处理器都不必等待其他进程执行任何类型的同步.

The Gauss-Seidel iteration is split into two separate runs, such that in each sweep every operation can be performed in any order, and there should be no dependency between each task. So in theory each processor should never have to wait for another process to perform any kind of synchronization.

我遇到的问题是,我发现,与问题的大小无关,仅2个处理器的速度较弱,而如果有2个以上的处理器,则速度可能会更慢. 我可以通过许多其他线性并行例程获得很好的缩放比例,但是这一点很棘手.

The problem I am encountering, is that I, independent of problem size, find there is only a weak speed-up of 2 processors and with more than 2 processors it might even be slower. Many other linear paralleled routine I can obtain very good scaling, but this one is tricky.

我担心我无法向编译器解释"我在数组上执行的操作是线程安全的,因此无法真正有效.

My fear is that I am unable to "explain" to the compiler that operation that I perform on the array, is thread-safe, such that it is unable to be really effective.

请参见下面的示例.

有人知道如何通过OpenMP使其更有效吗?

Anyone has any clue on how to make this more effective with OpenMP?

void redBlackSmooth(std::vector<double> const & b,
                    std::vector<double> & x,
                    double h)
{
    // Setup relevant constants.
    double const invh2 = 1.0/(h*h);
    double const h2 = (h*h);
    int const N = static_cast<int>(x.size());
    double sigma = 0;

    // Setup some boundary conditions.
    x[0] = 0.0;
    x[N-1] = 0.0;

    // Red sweep.
    #pragma omp parallel for shared(b, x) private(sigma)
    for (int i = 1; i < N-1; i+=2)
    {
        sigma = -invh2*(x[i-1] + x[i+1]);
        x[i] = (h2/2.0)*(b[i] - sigma);
    }

    // Black sweep.
    #pragma omp parallel for shared(b, x) private(sigma)
    for (int i = 2; i < N-1; i+=2)
    {
        sigma = -invh2*(x[i-1] + x[i+1]);
        x[i] = (h2/2.0)*(b[i] - sigma);
    }
}

添加: 我现在还尝试了原始指针实现,它的行为与使用STL容器的行为相同,因此可以排除它是来自STL的一些伪关键行为.

Addition: I have now also tried with a raw pointer implementation and it has the same behavior as using STL container, so it can be ruled out that it is some pseudo-critical behavior comming from STL.

推荐答案

首先,确保x向量与缓存边界对齐.我做了一些测试,如果我强制内存对齐,我的机器(核心二人组)上的代码会得到100%的改善:

First of all, make sure that the x vector is aligned to cache boundaries. I did some test, and I get something like a 100% improvement with your code on my machine (core duo) if I force the alignment of memory:

double * x;
const size_t CACHE_LINE_SIZE = 256;
posix_memalign( reinterpret_cast<void**>(&x), CACHE_LINE_SIZE, sizeof(double) * N);

第二,您可以尝试为每个线程分配更多的计算量(通过这种方式,您可以将缓存行分隔开),但是我怀疑openmp在后台已经做了类似的事情,因此对于大的N来说可能毫无用处.

Second, you can try to assign more computation to each thread (in this way you can keep cache-lines separated), but I suspect that openmp already does something like this under the hood, so it may be worthless with large N.

在我的情况下,当x未与缓存对齐时,此实现会更快.

In my case this implementation is much faster when x is not cache-aligned.

const int workGroupSize = CACHE_LINE_SIZE / sizeof(double);
assert(N % workGroupSize == 0); //Need to tweak the code a bit to let it work with any N
const int workgroups = N / workGroupSize;
int j, base , k, i;

#pragma omp parallel for shared(b, x) private(sigma, j, base, k, i)
for ( j = 0; j < workgroups; j++ ) {
    base = j * workGroupSize;
    for (int k = 0; k < workGroupSize; k+=2)
    {
        i = base + k + (redSweep ? 1 : 0);
        if ( i == 0 || i+1 == N) continue;
        sigma = -invh2* ( x[i-1] + x[i+1] );
        x[i] = ( h2/2.0 ) * ( b[i] - sigma );
    }
}

总而言之,您肯定存在缓存争用的问题,但是考虑到openmp的工作方式(很遗憾,我不熟悉),它足以与正确分配的缓冲区一起工作.

In conclusion, you definitely have a problem of cache-fighting, but given the way openmp works (sadly I am not familiar with it) it should be enough to work with properly allocated buffers.

这篇关于红黑高斯Seidel和OpenMP的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆