当使用GCC编译器优化进行多线程处理时,本征3.3共轭梯度变慢 [英] Eigen 3.3 Conjugate Gradient is slower when multi-threaded with GCC compiler optimization
问题描述
我一直在Eigen 3.2中使用ConjugateGradient
求解器,并决定尝试升级到Eigen 3.3.3,希望从新的多线程功能中受益.
I've been using the ConjugateGradient
solver in Eigen 3.2 and decided to try upgrading to Eigen 3.3.3 with the hope of benefiting from the new multi-threading features.
可悲的是,当我在GCC 4.8.4中启用-fopenmp
时,求解器似乎变慢了(〜10%).查看xosview时,我看到所有8个cpus都在使用,但是性能较慢...
Sadly, the solver seems slower (~10%) when I enable -fopenmp
with GCC 4.8.4. Looking at xosview, I see that all 8 cpus are being used, yet performance is slower...
经过一些测试,我发现如果禁用编译器优化(使用-O0
而不是-O3
),则-fopenmp
确实可以将求解器速度提高约50%.
After some testing, I discovered that if I disable compiler optimization (use -O0
instead of -O3
), then -fopenmp
does speed up the solver by ~50%.
当然,仅仅为了受益于多线程而禁用优化并不是真正值得的,因为这总体上会更慢.
Of course, it's not really worth disabling optimization just to benefit from multi-threading, since that would be even slower overall.
根据 https://stackoverflow.com/a/42135567/7974125 的建议,我正在存储全部稀疏文件矩阵,并将Lower|Upper
作为UpLo
参数传递.
Following advice from https://stackoverflow.com/a/42135567/7974125, I am storing the full sparse matrix and passing Lower|Upper
as the UpLo
parameter.
我也尝试了3个预处理器,也尝试使用RowMajor矩阵,但无济于事.
I've also tried each of the 3 preconditioners and also tried using RowMajor matrices, to no avail.
还有其他尝试来获得多线程和编译器优化的全部好处吗?
Is there anything else to try to get the full benefits of both multi-threading and compiler optimization?
我无法发布我的实际代码,但这是使用 Eigen的Laplacian示例进行的快速测试文档,但使用ConjugateGradient
代替SimplicialCholesky
进行了一些更改. (这两个求解器均适用于SPD矩阵.)
I cannot post my actual code, but this is a quick test using the Laplacian example from Eigen's documentation, except for some changes to use ConjugateGradient
instead of SimplicialCholesky
. (Both of these solvers work with SPD matrices.)
#include <Eigen/Sparse>
#include <bench/BenchTimer.h>
#include <iostream>
#include <vector>
using namespace Eigen;
using namespace std;
// Use RowMajor to make use of multi-threading
typedef SparseMatrix<double, RowMajor> SpMat;
typedef Triplet<double> T;
// Assemble sparse matrix from
// https://eigen.tuxfamily.org/dox/TutorialSparse_example_details.html
void insertCoefficient(int id, int i, int j, double w, vector<T>& coeffs,
VectorXd& b, const VectorXd& boundary)
{
int n = int(boundary.size());
int id1 = i+j*n;
if(i==-1 || i==n) b(id) -= w * boundary(j); // constrained coefficient
else if(j==-1 || j==n) b(id) -= w * boundary(i); // constrained coefficient
else coeffs.push_back(T(id,id1,w)); // unknown coefficient
}
void buildProblem(vector<T>& coefficients, VectorXd& b, int n)
{
b.setZero();
ArrayXd boundary = ArrayXd::LinSpaced(n, 0,M_PI).sin().pow(2);
for(int j=0; j<n; ++j)
{
for(int i=0; i<n; ++i)
{
int id = i+j*n;
insertCoefficient(id, i-1,j, -1, coefficients, b, boundary);
insertCoefficient(id, i+1,j, -1, coefficients, b, boundary);
insertCoefficient(id, i,j-1, -1, coefficients, b, boundary);
insertCoefficient(id, i,j+1, -1, coefficients, b, boundary);
insertCoefficient(id, i,j, 4, coefficients, b, boundary);
}
}
}
int main()
{
int n = 300; // size of the image
int m = n*n; // number of unknowns (=number of pixels)
// Assembly:
vector<T> coefficients; // list of non-zeros coefficients
VectorXd b(m); // the right hand side-vector resulting from the constraints
buildProblem(coefficients, b, n);
SpMat A(m,m);
A.setFromTriplets(coefficients.begin(), coefficients.end());
// Solving:
// Use ConjugateGradient with Lower|Upper as the UpLo template parameter to make use of multi-threading
BenchTimer t;
t.reset(); t.start();
ConjugateGradient<SpMat, Lower|Upper> solver(A);
VectorXd x = solver.solve(b); // use the factorization to solve for the given right hand side
t.stop();
cout << "Real time: " << t.value(1) << endl; // 0=CPU_TIMER, 1=REAL_TIMER
return 0;
}
结果输出:
// No optimization, without OpenMP
g++ cg.cpp -O0 -I./eigen -o cg
./cg
Real time: 23.9473
// No optimization, with OpenMP
g++ cg.cpp -O0 -I./eigen -fopenmp -o cg
./cg
Real time: 17.6621
// -O3 optimization, without OpenMP
g++ cg.cpp -O3 -I./eigen -o cg
./cg
Real time: 0.924272
// -O3 optimization, with OpenMP
g++ cg.cpp -O3 -I./eigen -fopenmp -o cg
./cg
Real time: 1.04809
推荐答案
您的问题太小,无法期望多线程带来任何好处.稀疏矩阵至少要大一个数量级.在这种情况下,应调整Eigen的代码以减少线程数.
Your problem is too small to expect any benefits from multi-threading. Sparse matrices are expected to at least one order of magnitude larger. Eigen's code should be adjusted to reduce the number of threads in this case.
此外,我猜您只有4个物理核心,因此使用OMP_NUM_THREADS=4 ./cg
运行可能会有所帮助.
Moreover, I guess that you only have 4 physical cores, so running with OMP_NUM_THREADS=4 ./cg
might help.
这篇关于当使用GCC编译器优化进行多线程处理时,本征3.3共轭梯度变慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!