解决C ++中稀疏线性系统的最佳方法-GPU可能吗? [英] Best way of solving sparse linear systems in C++ - GPU Possible?

查看:426
本文介绍了解决C ++中稀疏线性系统的最佳方法-GPU可能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在一个需要解决的项目中

I am currently working on a project where we need to solve

|Ax - b|^2.

在这种情况下,A是一个非常稀疏的矩阵,而A'A每行最多具有5个非零元素.

In this case, A is a very sparse matrix and A'A has at most 5 nonzero elements in each row.

我们正在处理图像,并且A'A的尺寸为NxN,其中N是像素数.在这种情况下,N = 76800.我们计划转到RGB,然后尺寸将为3Nx3N.

We are working with images and the dimension of A'A is NxN where N is the number of pixels. In this case N = 76800. We plan to go to RGB and then the dimension will be 3Nx3N.

在matlab中,使用double求解(A'A)\(A'b)大约需要0.15 s.

In matlab solving (A'A)\(A'b) takes about 0.15 s, using doubles.

我现在已经对Eigens稀疏求解器进行了一些实验.我已经尝试过:

I have now done some experimenting with Eigens sparse solvers. I have tried:

SimplicialLLT
SimplicialLDLT
SparseQR
ConjugateGradient

和一些不同的顺序.迄今为止最好的是

and some different orderings. The by far best so far is

SimplicialLDLT

,使用AMDOrdering大约需要0.35 - 0.5.

例如,当我使用ConjugateGradient时,大约需要花费6 s,而将0用作初始化.

When I for example use ConjugateGradient it takes roughly 6 s, using 0 as initilization.

用于解决该问题的代码是:

The code for solving the problem is:

    A_tot.makeCompressed();
     // Create solver
    Eigen::SimplicialLDLT<Eigen::SparseMatrix<float>, Eigen::Lower, Eigen::AMDOrdering<int> > solver;
    // Eigen::ConjugateGradient<Eigen::SparseMatrix<float>, Eigen::Lower> cg;
    solver.analyzePattern(A_tot);
    t1 = omp_get_wtime();
    solver.compute(A_tot);
    if (solver.info() != Eigen::Success)
     {
         std::cerr << "Decomposition Failed" << std::endl;
         getchar();
     }
    Eigen::VectorXf opt = solver.solve(b_tot);
    t2 = omp_get_wtime();
    std::cout << "Time for normal equations: " << t2 - t1 << std::endl;

这是我第一次在C ++及其求解器中使用稀疏矩阵.对于此项目,速度至关重要,低于0.1 s是最低速度.

This is the first time I use sparse matrices in C++ and its solvers. For this project speed is crucial and below 0.1 s is a minimum.

我想获得一些反馈,这将是最好的策略.例如,应该可以使用Eigen中的SuiteSparseOpenMP.您对这些类型的问题有何经验?例如,是否有一种提取结构的方法?而且conjugateGradient真的应该那么慢吗?

I would like to get some feedback on which would be the best strategy here. For example one is supposed to be able to use SuiteSparse and OpenMP in Eigen. What are your experiences about these types of problems? Is there a way of extracting the structure for example? And should conjugateGradient really be that slow?

感谢您提供宝贵的评论!今晚,我一直在Nvidia上阅读有关cuSparse的内容.它似乎能够分解甚至求解系统.特别是似乎可以重用模式等等.问题是,速度有多快?可能的开销是多少?

Thanks for som valuable comments! Tonight I have been reading a bit about cuSparse on Nvidia. It seems to be able to do factorisation an even solve systems. In particular it seems one can reuse pattern and so forth. The question is how fast could this be and what is the possible overhead?

鉴于我的矩阵A中的数据量与图像中的数据量相同,因此内存复制不应该是这样的问题.几年前,我曾做过用于实时3D重建的软件,然后您复制每一帧的数据,而慢速版本仍以50 Hz以上的频率运行.因此,如果分解速度更快,是否有可能加快速度?特别是,该项目的其余部分将位于GPU上,因此,如果有人可以直接在那里解决它并保留该解决方案,那我就没有缺点了.

Given that the amount of data in my matrix A is the same as in an image, the memory copying should not be such an issue. I did some years ago software for real-time 3D reconstruction and then you copy data for each frame and a slow version still runs in over 50 Hz. So if the factorization is much faster it is a possible speed-up? In particualar the rest of the project will be on the GPU, so if one can solve it there directly and keep the solution it is no drawback I guess.

库达(Cuda)领域发生了很多事情,但我还不是最新.

A lot has happened in the field of Cuda and I am not really up to date.

我找到了两个链接:基准?

Here are two links I found: Benchmark?, MatlabGPU

推荐答案

您的矩阵非常稀疏,对应于2D域上的离散化,因此预计SimplicialLDLT是最快的.由于稀疏模式是固定的,因此请先调用analyzePattern,然后再调用factorize,而不是compute.这应该节省一些毫秒.此外,由于您使用的是常规网格,因此您也可以尝试使用NaturalOrdering绕过重新排序的步骤(不是100%的确定,您必须进行测试).如果仍然不够快,则可以搜索为天际矩阵量身定制的Cholesky解算器(在这种情况下,Cholesky因式分解要简单得多,因此也更快).

Your matrix is extremely sparse and corresponds to a discretization on a 2D domain, so it is expected that SimplicialLDLT is the fastest here. Since the sparsity pattern is fixed, call analyzePattern once, and then factorize instead of compute. This should save some milliseconds. Moreover, since you're working on a regular grid, you might also try to bypass the re-ordering step using NaturalOrdering (not 100% sure, you have to bench). If that's still not fast enough, you might search for a Cholesky solver tailored for skyline matrices (the Cholesky factorization is much simpler and thus faster in this case).

这篇关于解决C ++中稀疏线性系统的最佳方法-GPU可能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆