为什么没有这个code线的规模? [英] Why doesn't this code scale linearly?

查看:181
本文介绍了为什么没有这个code线的规模?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写这SOR求解code。不要打扰太多这种算法做什么,这是不是这里的关注。但只是为了完整起见:它可以解决线性方程系统,根据系统是如何调节

I wrote this SOR solver code. Don't bother too much what this algorithm does, it is not the concern here. But just for the sake of completeness: it may solve a linear system of equations, depending on how well conditioned the system is.

我用病态的2097152行sparce矩阵(从不收敛)运行,每行最多7个非零列。

I run it with an ill conditioned 2097152 rows sparce matrix (that never converges), with at most 7 non-zero columns per row.

翻译:外做的,而循环将执行10000次迭代(我通过为 max_iters值),中间将进行2097152迭代,劈 work_line 的块,OpenMP的线程分割。最里面的循环将有7次迭代,除了在极少数情况下(低于1%),在那里可以少一点。

Translating: the outer do-while loop will perform 10000 iterations (the value I pass as max_iters), the middle for will perform 2097152 iterations, split in chunks of work_line, divided among the OpenMP threads. The innermost for loop will have 7 iterations, except in very few cases (less than 1%) where it can be less.

有在溶胶数组的值线程间数据的依赖。为的中间的每次迭代更新一个元素,但读取到阵列的6其他元素。由于SOR不是一门精确的算法,读书的时候,它可以有任何关于该位置上的previous或电流值(如果你熟悉求解器,这是一个高斯•希德的容忍上一些地方的雅可比行为并行的缘故)。

There is data dependency among the threads in the values of sol array. Each iteration of the middle for updates one element but reads up to 6 other elements of the array. Since SOR is not an exact algorithm, when reading, it can have any of the previous or the current value on that position (if you are familiar with solvers, this is a Gauss-Siedel that tolerates Jacobi behavior on some places for the sake of parallelism).

typedef struct{
    size_t size;

    unsigned int *col_buffer;
    unsigned int *row_jumper;
    real *elements;
} Mat;

int work_line;

// Assumes there are no null elements on main diagonal
unsigned int solve(const Mat* matrix, const real *rhs, real *sol, real sor_omega, unsigned int max_iters, real tolerance)
{
    real *coefs = matrix->elements;
    unsigned int *cols = matrix->col_buffer;
    unsigned int *rows = matrix->row_jumper;
    int size = matrix->size;
    real compl_omega = 1.0 - sor_omega;
    unsigned int count = 0;
    bool done;

    do {
        done = true;
        #pragma omp parallel shared(done)
        {
            bool tdone = true;

            #pragma omp for nowait schedule(dynamic, work_line)
            for(int i = 0; i < size; ++i) {
                real new_val = rhs[i];
                real diagonal;
                real residual;
                unsigned int end = rows[i+1];

                for(int j = rows[i]; j < end; ++j) {
                    unsigned int col = cols[j];
                    if(col != i) {
                        real tmp;
                        #pragma omp atomic read
                        tmp = sol[col];

                        new_val -= coefs[j] * tmp;
                    } else {
                        diagonal = coefs[j];
                    }
                }

                residual = fabs(new_val - diagonal * sol[i]);
                if(residual > tolerance) {
                    tdone = false;
                }

                new_val = sor_omega * new_val / diagonal + compl_omega * sol[i];
                #pragma omp atomic write
                sol[i] = new_val;
            }

            #pragma omp atomic update
            done &= tdone;
        }
    } while(++count < max_iters && !done);

    return count;
}

正如你所看到的,有并行区域内无锁的,因此,他们总是教导我们,是那种100%平行的问题。那是不是我在实践中看到的。

As you can see, there is no lock inside the parallel region, so, for what they always teach us, it is the kind of 100% parallel problem. That is not what I see in practice.

我所有的测试都是一个英特尔(R)至强(R)CPU上运行E5-2670 V2 @ 2.50GHz,2处理器,10核,每个核,超线程启用,总结了40个逻辑核心。

All my tests were run on a Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 2 processors, 10 cores each, hyper-thread enabled, summing up to 40 logical cores.

在我的第一套运行时, work_line 固定2048,和线程数从1到40(40奔跑计)变化。这是与每个运行(线程秒×号)的执行时间的曲线

On my first set runs, work_line was fixed on 2048, and the number of threads varied from 1 to 40 (40 runs in total). This is the graph with the execution time of each run (seconds x number of threads):

令人惊讶的是对数曲线,所以我想,既然工作线是如此之大,共享的高速缓存中没有很好地使用,所以我挖出了这个虚拟文件 / SYS /设备/系统/ CPU / CPU0 /缓存/的index0 / coherency_line_size 的告诉我,这款处理器的L1缓存同步的数组中的64个字节(8双打组更新溶胶)。所以我设置了 work_line 以8:

The surprise was the logarithmic curve, so I thought that since the work line was so large, the shared caches were not very well used, so I dug up this virtual file /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size that told me this processor's L1 cache synchronizes updates in groups of 64 bytes (8 doubles in the array sol). So I set the work_line to 8:

这时我想到8太低,以免NUMA档位,并设置 work_line 16

Then I thought 8 was too low to avoid NUMA stalls and set work_line to 16:

在运行上面,我想:我是谁,以predict什么 work_line 好呢?让我们只看到...,并计划运行每 work_line 8至2048,8(即每高速缓存行的倍数,从1到256)的步骤。 20,结果40个线程(中间循环中,线程分割的分裂秒×尺寸):

While running the above, I thought "Who am I to predict what work_line is good? Lets just see...", and scheduled to run every work_line from 8 to 2048, steps of 8 (i.e. every multiple of the cache line, from 1 to 256). The results for 20 and 40 threads (seconds x size of the split of the middle for loop, divided among the threads):

我相信,随着低 work_line 的情况下遭受严重的缓存同步,而更大的 work_line 提供超过任何好处线程一定数量的(我假设,因为内存途径是瓶颈)。这是非常可悲的,似乎真机上100%的并行presents这种不良行为的问题。所以,我相信多核系统之前是一个非常畅销的谎言,我问你在这里第一次:

I believe the cases with low work_line suffers badly from cache synchronization, while bigger work_line offers no benefit beyond a certain number of threads (I assume because the memory pathway is the bottleneck). It is very sad that a problem that seems 100% parallel presents such bad behavior on a real machine. So, before I am convinced multi-core systems are a very well sold lie, I am asking you here first:

我怎样才能让这个code线规模为内核的数量?我在想什么?是否有东西,使得它不如这个问题,因为它似乎在第一?

How can I make this code scale linearly to the number of cores? What am I missing? Is there something in the problem that makes it not as good as it seems at first?

更新

以下的建议,我测试都与静态动态调度,但在取出原子能读上/写阵列溶胶。作为参考,蓝色和橙色的线是从previous图形相同的(只是到 work_line = 248; )。黄色和绿色的线是新的。什么我可以看到:静态使得低 work_line ,但96后的好处<$ C一显著差异$ C>动态远远超过其开销,使其更快。原子操作使得在所有的没有什么区别。

Following suggestions, I tested both with static and dynamic scheduling, but removing the atomics read/write on the array sol. For reference, the blue and orange lines are the same from the previous graph (just up to work_line = 248;). The yellow and green lines are the new ones. For what I could see: static makes a significant difference for low work_line, but after 96 the benefits of dynamic outweighs its overhead, making it faster. The atomic operations makes no difference at all.

推荐答案

稀疏矩阵向量乘法是内存限制的(见<一href=\"http://stackoverflow.com/questions/13636464/slow-sparse-matrix-vector-product-csr-using-open-mp\">here)并且它可以显示一个简单的车顶的模型。存储器约束问题从多插座NUMA系统的更高的存储器带宽受益,但只有当数据初始化以这样的方式,数据在两个NUMA结构域之间的分配完成。我有一些原因相信您加载串行,因此所有的内存在单个NUMA节点上分配的矩阵。在这种情况下,你不会从双内存带宽中受益可用双插槽系统上,它真的没关系如果你使用时间表(动态)时间表(静态)。你可以做的是使内存交错NUMA策略为了有两个NUMA节点之间的内存分配小号$ p $垫。因此,每个线程最终会得到50%的本地内存访问和50%的远程内存访问而无需第二个CPU上的所有线程的100%远程内存被击中。访问启用策略的最简单方法是使用 numactl的

The sparse matrix vector multiplication is memory bound (see here) and it could be shown with a simple roofline model. Memory bound problems benefit from higher memory bandwidth of multisocket NUMA systems but only if the data initialisation is done in such a way that the data is distributed among the two NUMA domains. I have some reasons to believe that you are loading the matrix in serial and therefore all its memory is allocated on a single NUMA node. In that case you won't benefit from the double memory bandwidth available on a dual-socket system and it really doesn't matter if you use schedule(dynamic) or schedule(static). What you could do is enable memory interleaving NUMA policy in order to have the memory allocation spread among both NUMA nodes. Thus each thread would end up with 50% local memory access and 50% remote memory access instead of having all threads on the second CPU being hit by 100% remote memory access. The easiest way to enable the policy is by using numactl:

$ OMP_NUM_THREADS=... OMP_PROC_BIND=1 numactl --interleave=all ./program ...

OMP_PROC_BIND = 1 使螺纹钉,并应提高性能一点。

OMP_PROC_BIND=1 enables thread pinning and should improve the performance a bit.

我也想指出的是这样的:

I would also like to point out that this:

done = true;
#pragma omp parallel shared(done)
{
    bool tdone = true;

    // ...

    #pragma omp atomic update
    done &= tdone;
}

是一个可能是一个不是很有效,再执行:

is a probably a not very efficient re-implementation of:

done = true;
#pragma omp parallel reduction(&:done)
{
    // ...
        if(residual > tolerance) {
            done = false;
        }
    // ...
}

它不会有,因为在内部循环中完成的工作量的两种实现之间有明显的性能差异,但仍是不重新实现便携性和可读性的缘故现有的OpenMP元一个好主意。

It won't have a notable performance difference between the two implementations because of the amount of work done in the inner loop, but still it is not a good idea to reimplement existing OpenMP primitives for the sake of portability and readability.

这篇关于为什么没有这个code线的规模?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆