cusprs csrsv_analysis的性能非常慢 [英] Very slow performance of cusparse csrsv_analysis

查看:604
本文介绍了cusprs csrsv_analysis的性能非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用LU预处理编写了一个Conjugate梯度求解器(用于线性方程组),我使用Maxim Naumov博士的论文作为指导,残差更新步骤,其需要求解下三角矩阵系统,然后求解上三角矩阵系统分成两个阶段:


  1. 分析阶段(利用稀疏模式并决定并行化级别)。

  2. 解决方案阶段本身。

根据与此主题相关的所有帖子,以及Naumov的论文本身,比解决阶段慢,但它执行一次,所以它不应该是考虑整个执行时间的问题,但是,在我的程序中,分析阶段需要约35-45%的整个解决时间(时间需要执行所有的迭代!),这是非常讨厌,另一个事情是,我很确定矩阵的稀疏模式不允许大量的并行化,因为几乎所有的元素都需要以前的元素是已知的(因为CFD每个节点将至少需要其周围的6个节点(六面体),并且在每一行上重复),因此分析阶段将不会非常有用!



matrixLU 包含上下三角矩阵,上三角矩阵使用原始矩阵对角线,下三角矩阵具有单位对角线

  // z = inv(matrixLU)* r 
cusparseMatDescr_t descrL = 0;
cusparseMatDescr_t descrU = 0;
cusparseStatus = cusparseCreateMatDescr(& descrL);
cusparseStatus = cusparseCreateMatDescr(& descrU);

cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatDiagType(descrL,CUSPARSE_DIAG_TYPE_UNIT);
cusparseSetMatFillMode(descrL,CUSPARSE_FILL_MODE_LOWER);

cusparseSetMatType(descrU,CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatIndexBase(descrU,CUSPARSE_INDEX_BASE_ONE);
cusparseSetMatDiagType(descrU,CUSPARSE_DIAG_TYPE_NON_UNIT);
cusparseSetMatFillMode(descrU,CUSPARSE_FILL_MODE_UPPER);

cusparseSolveAnalysisInfo_t inforL = 0;
cusparseSolveAnalysisInfo_t inforU = 0;
cusparseStatus = cusparseCreateSolveAnalysisInfo(& inforL);
cusparseStatus = cusparseCreateSolveAnalysisInfo(& inforU);

double startFA = omp_get_wtime();
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle,CUSPARSE_OPERATION_NON_TRANSPOSE,N,NZ,descrL,matrixLU,iRow,jCol,inforL);
if(cusparseStatus!= CUSPARSE_STATUS_SUCCESS)printf(%s \\\
\\\
,cusparseDcsrsv_analysis1 Error!
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle,CUSPARSE_OPERATION_NON_TRANSPOSE,N,NZ,descrU,matrixLU,iRow,jCol,inforU);
if(cusparseStatus!= CUSPARSE_STATUS_SUCCESS)printf(%s \\\
\\\
,cusparseDcsrsv_analysis2 Error!
double finishFA = omp_get_wtime();

所以,任何人都有一个线索为什么分析阶段这么慢?以及如何加速?是(分析阶段执行时间)/(解决阶段执行时间)比率 GPU依赖?




我在很多情况下尝试过这个求解器,结果很接近,但是在我关心的具体情况下,它有以下条件:




  • 大小( N ):〜 860,000 * 860,000

  • 非零( NZ ):〜 6,000,000

  • 收敛所需的迭代次数: 10
  • 分析阶段执行时间: 210 ms

  • 解决阶段执行时间(总计所有迭代): 350毫秒

  • 双精度格式

  • GPU: GeForce GTX 550 Ti

  • 操作系统: Windows 7 极限, 64位


解决方案

我联系了NVIDIA的线性代数库团队,他们提供了以下反馈。


  1. p>


    1. CG迭代法只执行10次迭代,直到找到线性系统的解。看起来在这种情况下,这不是足够的迭代来分解分析阶段所消耗的时间。 (编辑:)收敛到解决方案所需的步骤数因不同应用程序而异。在一些未经预处理的迭代方法中,不收敛(或者需要数千次迭代),而预处理迭代方法需要几十(或几百)次迭代才能收敛到解。在此特定应用中,迭代法在10次迭代中收敛不一定代表所有应用。


    2. 分析和求解阶段的比率为6 = 210/35,这与我们对其他矩阵的实验一致,其中它通常在区间[4,10]。不清楚如何将分析和求解阶段的时间与稀疏三角解决对CPU所花费的时间相比较。 (这将是有趣的信息知道。)



  2. 对于算法:
    <
  3. 不幸的是,你只能做一些事情来恢复一点点性能(没有真正的方法来加快分析阶段)。例如,您可以将matrixLU分割为单独的下三角形和上三角形。使用单独的三角形部分的计算通常比嵌入在一般矩阵中的三角形部分的计算快。如果您使用具有0填充作为预处理器的不完全因式分解,您还可以利用GPU上的0填充的不完全LU / Cholesky,最近在CUSPARSE库中可用。 ( Edit:)CUDA Toolkit 5.0发行版提供了带有0填充的不完全LU因式分解。目前,此版本的早期访问版本可以通过注册开发人员获得NVIDIA网站。

  4. 我们从来没有看过分析和解决阶段的比例在不同GPU之间是如何变化的;

  5. 分析阶段很慢,因为它必须收集关于哪些行可以一起处理为级别的信息。解决阶段更快,因为它只处理级别(参见本文)。


最后,您还提到您认为问题。但是并行性的数量可能很难从矩阵中查找。如果确实有很少的并行性(每一行取决于前一行),那么CUSPARSE稀疏三角解决可能不是这个问题的正确算法。此外,您可以尝试运行没有预处理的CG迭代方法,或者使用可能更适合您的问题的其他类型的预处理程序。



如前所述,了解此代码在GPU和CPU上的效果会非常有趣。



编辑:
关于一些注释...如前所述,预处理迭代方法的最终迭代次数因不同应用程序而异。在某些情况下,它可能非常小(例如,在您的应用程序中为10),但在其他情况下,它可以相对较大(在100秒内测量)。本文不侧重于胜利,而是该算法的权衡。它试图让用户更深入地了解例程,以便有效地使用它。再次,如果手头的稀疏模式没有任何并行性,这不是你正确的算法。



对于CPU和GPU比较,有一定GPU是巨大的问题(如密集矩阵矩阵或稀疏矩阵矢量乘法),还有其他(如遍历链表或执行一个完全顺序的代码段),而不是。稀疏三角线性系统的解决方案位于这些算法之间(取决于系数矩阵的稀疏模式和线性系统的其他属性,它可能或不适合您)。此外,使用GPU的决定不仅基于单个算法,诸如稀疏三角解,而是基于应用使用的整组算法。最终,GPU通常非常成功地加速和提高整个应用程序的性能。



最后,关于trsm vs. csrsv,对于在密集存储中执行操作的小矩阵比在稀疏存储中更快(即使这些小矩阵可能是稀疏的)。这通常不仅对于三角解,而且对于其他线性代数运算也是如此(它可能发生在不同的交叉点,这取决于矩阵的运算和稀疏模式)。同样,稀疏三角解算法被设计为在迭代设置中运行,其中分析被执行一次并且多次求解。因此,运行它一次(并且在分析阶段计数)导致更高的交叉点,这种操作在稀疏而不是密集格式中更有效,这并不奇怪。


I wrote a Conjugate-gradient solver (for linear system of equations) with LU preconditioning, I used Dr. Maxim Naumov's papers on nvidia's research community as a guideline, the residuals update step, which requires solving a lower triangular matrix system and then solving an upper triangular matrix system is divided into two phases:

  1. analysis phase (which exploits the sparsity pattern and decides the parallelization level).
  2. the solution phase itself.

according to all posts related to this topic, and to Naumov's paper itself, the analysis phase is significantly slower than the solve phase, but it's executed once, so it's not supposed to be an issue when taking the whole execution time in consideration, however, in my program, the analysis phase requires ~35-45% of the whole solution time (time needed to perform all iterations!), which is quite annoying, another thing is that I'm quite sure that the matrix's sparsity pattern doesn't allow a big deal of parallelization, as nearly all elements require previous elements to be known (because in CFD each node will need at least the adjacent 6 nodes (hexahedral volumes) around it, and that's repeated on every row), so the analysis phase won't be very useful anyways !

matrixLU in this code contains both the upper and lower triangular matrices, the upper triangular matrix uses the original matrix diagonals, and the lower triangular matrix has unity diagonals (LU factorization).

// z = inv(matrixLU)*r
cusparseMatDescr_t descrL = 0 ;
cusparseMatDescr_t descrU = 0 ;
cusparseStatus = cusparseCreateMatDescr(&descrL) ;
cusparseStatus = cusparseCreateMatDescr(&descrU) ;

cusparseSetMatType(descrL,CUSPARSE_MATRIX_TYPE_GENERAL) ;
cusparseSetMatIndexBase(descrL,CUSPARSE_INDEX_BASE_ONE) ;
cusparseSetMatDiagType(descrL,CUSPARSE_DIAG_TYPE_UNIT) ;
cusparseSetMatFillMode(descrL,CUSPARSE_FILL_MODE_LOWER) ;

cusparseSetMatType(descrU,CUSPARSE_MATRIX_TYPE_GENERAL) ;
cusparseSetMatIndexBase(descrU,CUSPARSE_INDEX_BASE_ONE) ;
cusparseSetMatDiagType(descrU,CUSPARSE_DIAG_TYPE_NON_UNIT) ;
cusparseSetMatFillMode(descrU,CUSPARSE_FILL_MODE_UPPER) ;

cusparseSolveAnalysisInfo_t inforL = 0 ;
cusparseSolveAnalysisInfo_t inforU = 0 ;
cusparseStatus = cusparseCreateSolveAnalysisInfo(&inforL) ;
cusparseStatus = cusparseCreateSolveAnalysisInfo(&inforU) ;

double startFA = omp_get_wtime() ;
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, NZ, descrL, matrixLU, iRow, jCol, inforL) ;
if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_analysis1 Error !") ;
cusparseStatus = cusparseDcsrsv_analysis(cusparseHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, NZ, descrU, matrixLU, iRow, jCol, inforU) ;
if(cusparseStatus != CUSPARSE_STATUS_SUCCESS) printf("%s \n\n","cusparseDcsrsv_analysis2 Error !") ;
double finishFA = omp_get_wtime() ;

So, anybody got a clue why the analysis phase is so slow ? and how it may be accelerated ? is the (analysis phase execution time)/(solve phase execution time) ratio GPU dependent ?

Edit: I tried this solver for many cases, and the results were close, but in the specific case I'm concerned about, it has the following conditions:

  • Size (N) : ~860,000 * 860,000
  • Number of non zeros (NZ): ~6,000,000
  • Number of iterations required for convergence: 10
  • Analysis phase execution time: 210 ms
  • Solve phase executions time (summing up all iterations): 350 ms
  • All floating point operations are performed in double precision format
  • GPU: GeForce GTX 550 Ti
  • OS: Windows 7 ultimate, 64 bit

解决方案

I contacted our linear algebra libraries team at NVIDIA and they provided the following feedback.

  1. With respect to the performance results:

    1. The CG iterative method performs only 10 iterations until the solution of the linear system is found. It seems that in this case this is not enough iterations to ammortize the time consumed by the analysis phase. (Edit:) The number of steps required to converge to the solution varies across different applications. In some unpreconditioned iterative methods do not converge (or take thousands of iterations) and preconditioned iterative methods take tens (or hundreds) of iterations to converge to a solution. The fact that in this particular application the iterative method converges in 10 iterations is not necessarily representative of all applications.

    2. The ratio of the analysis and solve phases is 6 = 210/35, which is in line with our experiments on other matrices, where it is usually in the interval [4,10]. It is not clear how the time for the analysis and solve phases compares to the time taken by the sparse triangular solve on the CPU. (This would be interesting information to know.)

  2. With respect to the algorithm:

    1. Unfortunately, there are only a few things you can do to recover a little bit of performance (there is no real way to speed up the analysis phase). For example, you can split the matrixLU into separate lower and upper triangular parts. Computing with separate triangular parts is in general faster than computing with triangular parts embedded in a general matrix. If you are using incomplete factorization with 0 fill-in as a preconditioner, you can also take advantage of the incomplete-LU/Cholesky with 0 fill-in on the GPU, that has recently become available in the CUSPARSE library. (Edit:) The incomplete-LU factorization with 0 fill-in is available in CUDA Toolkit 5.0 release. Currently the early access version of this release can be obtained by registered developers on the NVIDIA website.
    2. We have never looked at how the ratio of analysis and solve phases varies across different GPUs; all our experiments done were on C2050.
    3. The analysis phase is slow because it has to collect information on which rows can be processed together into levels. The solve phase is faster because it only processes the levels (See this paper).

Finally, you also mentioned that you think there is not enough parallelism in the problem. But the amount of parallelism can be hard to guess just from looking at the matrix. If indeed there is little parallelism (every row depends on the previous row), then the CUSPARSE sparse triangular solve might not be the right algorithm for this problem. Also, you can try to run the CG iterative method without preconditioning or with other types of preconditioners that might be more suitable to your problem.

As mentioned earlier, it would be interesting to know how the performance of this code compares on the GPU and CPU.

Edit: Regarding some comments... As mentioned earlier the ultimate number of iterations for a preconditioned iterative method varies across different applications. In some cases it can be very small (for example, 10 as it is in your application), but in others it can be relatively large (measured in 100s). The paper focuses not on the "triumph", but rather the tradeoffs of the algorithm. It attempts to give the user a deeper understanding of the routine in order for it to be used effectively. Again, if the sparsity pattern at hand does not have any parallelism, this is not the right algorithm for you.

With respect to the CPU and GPU comparisons, there are certain problems (such as dense matrix-matrix or sparse matrix-vector multiplication) for which GPU is great, there are others (such as traversing a linked list or executing a completely sequential piece of code) for which it is not. The solution of sparse triangular linear systems lies somewhere in between those algorithms (depending on the sparsity pattern of the coefficient matrix and other properties of the linear system it might or not work well for you). Also, the decision to use the GPU is not based solely on a single algorithm, such as sparse triangular solve, but on the entire set of algorithms used by the application. Ultimately, GPUs are often very successful in accelerating and improving the performance of the overall application.

Finally, with respect to trsm vs. csrsv, it does not surprise us that for small matrices performing operations in dense storage is faster than in sparse storage (even though these small matrices might be sparse). This would usually be true not only for triangular solve, but for other linear algebra operations as well (it might happen at different cross points depending on the operation and the sparsity pattern of the matrix, though). Also, once again, the sparse triangular solve algorithm was designed to run in an iterative setting where the analysis is done once and solve done multiple times. So, it is not surprising that running it once (and counting in the analysis phase), results in a higher crossover point for this operation to be more effective in sparse rather than in dense format.

这篇关于cusprs csrsv_analysis的性能非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆