如何加速我的稀疏矩阵求解器? [英] How to speed up my sparse matrix solver?

查看:190
本文介绍了如何加速我的稀疏矩阵求解器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Gauss-Seidel方法写一个稀疏矩阵求解器。通过分析,我确定我的程序的大约一半的时间花在解算器。性能关键部分如下:

I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:

size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
    for (size_t x = 1; x < d_nx - 1; ++x) {
        d_x[ic] = d_b[ic]
                - d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
                - d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
        ++ic; ++iw; ++ie; ++is; ++in;
    }
    ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}

所有涉及的数组都是 float 类型。实际上,它们不是数组,而是具有重载的 [] 运算符的对象,我们应该对其进行优化,但定义如下:

All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:

inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }

对于 d_nx = d_ny = 128 这可以在Intel i7 920上每秒运行3500次。这意味着内循环体每秒运行3500 * 128 * 128 = 5700万次。因为只涉及一些简单的算术,这对于2.66 GHz处理器来说是一个很小的数字。

For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.

也许不是由CPU功率限制,而是由内存带宽限制?嗯,一个128 * 128 float 数组占用65 kB,因此所有6个数组应该很容易适合CPU的L3缓存(8 MB)。假设没有什么被缓存在寄存器中,我在内部循环体中计数15个存储器访问。在64位系统上,每次迭代为120字节,因此5700万* 120字节= 6.8 GB / s。 L3缓存运行在2.66 GHz,因此它的数量级相同。

Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.

为了加快速度,我尝试了以下操作:

To speed this up, I've attempted the following:


  • 编译 g ++ -O3

使用OpenMP编译指示并行化4个核心。我必须改变到Jacobi算法,以避免从同一个数组读取和写入。这需要我做两次迭代,导致大约相同的速度的净结果。

Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.

使用循环体的实现细节,使用指针而不是索引。没有效果。

Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.

什么是最好的方法来加速这个家伙?它会帮助重写内部的身体在汇编(我不得不学习,第一)?我应该在GPU上运行这个(我知道该怎么做,但它是一个麻烦)?任何其他明亮的想法?

What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?

(NB我采取否的答案,如:它不能做到明显更快,因为...)

(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")

更新:根据要求,这里有一个完整的程序:

Update: as requested, here's a full program:

#include <iostream>
#include <cstdlib>
#include <cstring>

using namespace std;

size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;

void step() {
    size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
    for (size_t y = 1; y < d_ny - 1; ++y) {
        for (size_t x = 1; x < d_nx - 1; ++x) {
            d_x[ic] = d_b[ic]
                - d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
                - d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
            ++ic; ++iw; ++ie; ++is; ++in;
        }
        ic += 2; iw += 2; ie += 2; is += 2; in += 2;
    }
}

void solve(size_t iters) {
    for (size_t i = 0; i < iters; ++i) {
        step();
    }
}

void clear(float *a) {
    memset(a, 0, d_nx * d_ny * sizeof(float));
}

int main(int argc, char **argv) {
    size_t n = d_nx * d_ny;
    d_x = new float[n]; clear(d_x);
    d_b = new float[n]; clear(d_b);
    d_w = new float[n]; clear(d_w);
    d_e = new float[n]; clear(d_e);
    d_s = new float[n]; clear(d_s);
    d_n = new float[n]; clear(d_n);
    solve(atoi(argv[1]));
    cout << d_x[0] << endl; // prevent the thing from being optimized away
}

我编译并运行它如下:

$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0

real    0m1.052s
user    0m1.050s
sys     0m0.010s

(这是8000,而不是每秒3500次迭代,因为我的真正的程序也做了很多其他的东西,但它是代表。)

(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)

更新2 :我已被告知,初始化值可能不具有代表性,因为NaN和Inf值可能会减慢操作速度。现在清除示例代码中的内存。

Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.

推荐答案

几个想法:


  1. 使用SIMD。您可以每次从每个阵列加载4个浮点到SIMD寄存器(例如,Intel上的SSE,PowerPC上的VMX)。这样做的缺点是一些d_x值将是陈旧的,所以你的收敛速度将受到影响(但不如jacobi迭代那么糟糕);

  1. Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.

使用 SOR 。这很简单,即使对于相对保守的松弛值(例如1.5),也不会增加很多计算,并且可以很好地提高收敛速度。

Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).

共轭梯度。如果这是用于流体模拟的投影步骤(即,实施不可压缩性),则应该能够应用CG并且获得更好的收敛速率。

Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.

使用专门的求解器。如果线性系统来自于 Poisson方程,您可以使用基于FFT的方法做更好的共轭梯度。 / p>

Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.

如果您可以解释更多关于您想要解决的系统,对#3和#4的建议。

If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.

这篇关于如何加速我的稀疏矩阵求解器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆