如何加速我的稀疏矩阵求解器？ [英] How to speed up my sparse matrix solver?

查看：190 发布时间：2016/10/14 12:04:17 c++ performance optimization sparse-matrix

本文介绍了如何加速我的稀疏矩阵求解器？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用Gauss-Seidel方法写一个稀疏矩阵求解器。通过分析，我确定我的程序的大约一半的时间花在解算器。性能关键部分如下：

I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:

size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
    for (size_t x = 1; x < d_nx - 1; ++x) {
        d_x[ic] = d_b[ic]
                - d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
                - d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
        ++ic; ++iw; ++ie; ++is; ++in;
    }
    ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}

所有涉及的数组都是 float 类型。实际上，它们不是数组，而是具有重载的 [] 运算符的对象，我们应该对其进行优化，但定义如下：

All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:

inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }

对于 d_nx = d_ny = 128 这可以在Intel i7 920上每秒运行3500次。这意味着内循环体每秒运行3500 * 128 * 128 = 5700万次。因为只涉及一些简单的算术，这对于2.66 GHz处理器来说是一个很小的数字。

For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.

也许不是由CPU功率限制，而是由内存带宽限制？嗯，一个128 * 128 float 数组占用65 kB，因此所有6个数组应该很容易适合CPU的L3缓存（8 MB）。假设没有什么被缓存在寄存器中，我在内部循环体中计数15个存储器访问。在64位系统上，每次迭代为120字节，因此5700万* 120字节= 6.8 GB / s。 L3缓存运行在2.66 GHz，因此它的数量级相同。

Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.

为了加快速度，我尝试了以下操作：

To speed this up, I've attempted the following:

编译 g ++ -O3 。

使用OpenMP编译指示并行化4个核心。我必须改变到Jacobi算法，以避免从同一个数组读取和写入。这需要我做两次迭代，导致大约相同的速度的净结果。

Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.

使用循环体的实现细节，使用指针而不是索引。没有效果。

Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.

什么是最好的方法来加速这个家伙？它会帮助重写内部的身体在汇编（我不得不学习，第一）？我应该在GPU上运行这个（我知道该怎么做，但它是一个麻烦）？任何其他明亮的想法？

What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?

（NB我做采取否的答案，如：它不能做到明显更快，因为...）

(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")

更新：根据要求，这里有一个完整的程序：

Update: as requested, here's a full program:

#include <iostream>
#include <cstdlib>
#include <cstring>

using namespace std;

size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;

void step() {
    size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
    for (size_t y = 1; y < d_ny - 1; ++y) {
        for (size_t x = 1; x < d_nx - 1; ++x) {
            d_x[ic] = d_b[ic]
                - d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
                - d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
            ++ic; ++iw; ++ie; ++is; ++in;
        }
        ic += 2; iw += 2; ie += 2; is += 2; in += 2;
    }
}

void solve(size_t iters) {
    for (size_t i = 0; i < iters; ++i) {
        step();
    }
}

void clear(float *a) {
    memset(a, 0, d_nx * d_ny * sizeof(float));
}

int main(int argc, char **argv) {
    size_t n = d_nx * d_ny;
    d_x = new float[n]; clear(d_x);
    d_b = new float[n]; clear(d_b);
    d_w = new float[n]; clear(d_w);
    d_e = new float[n]; clear(d_e);
    d_s = new float[n]; clear(d_s);
    d_n = new float[n]; clear(d_n);
    solve(atoi(argv[1]));
    cout << d_x[0] << endl; // prevent the thing from being optimized away
}

我编译并运行它如下：

$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0

real    0m1.052s
user    0m1.050s
sys     0m0.010s

（这是8000，而不是每秒3500次迭代，因为我的真正的程序也做了很多其他的东西，但它是代表。）

(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)

更新2 ：我已被告知，初始化值可能不具有代表性，因为NaN和Inf值可能会减慢操作速度。现在清除示例代码中的内存。

Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.

如何加速我的稀疏矩阵求解器？ [英] How to speed up my sparse matrix solver?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何加速我的稀疏矩阵求解器？ [英] How to speed up my sparse matrix solver?

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭