使用Openmp的本征C ++对Jacobi算法进行并行化 [英] Parallelization of Jacobi algorithm using eigen c++ using openmp

查看：120 发布时间：2020/5/21 1:30:25 c++ algorithm parallel-processing openmp eigen

本文介绍了使用Openmp的本征C ++对Jacobi算法进行并行化的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经根据《数字食谱》一书中描述的例程实现了Jacobi算法，但是由于我计划使用非常大的矩阵，因此我尝试使用openmp对其进行并行化.

void ROTATE(MatrixXd &a, int i, int j, int k, int l, double s, double tau)
{
double g,h;
g=a(i,j);
h=a(k,l);
a(i,j)=g-s*(h+g*tau);
a(k,l)=h+s*(g-h*tau);

}

void jacobi(int n, MatrixXd &a, MatrixXd &v, VectorXd &d )
{
int j,iq,ip,i;
double tresh,theta,tau,t,sm,s,h,g,c;

VectorXd b(n);
VectorXd z(n);

v.setIdentity();    
z.setZero();

#pragma omp parallel for 
for (ip=0;ip<n;ip++)
{   
    d(ip)=a(ip,ip);
    b(ip)=d(ip);
}

for (i=0;i<50;i++) 
{
    sm=0.0;
    for (ip=0;ip<n-1;ip++) 
    {
        #pragma omp parallel for reduction (+:sm)
        for (iq=ip+1;iq<n;iq++)
            sm += fabs(a(ip,iq));
    }
    if (sm == 0.0) {
        break;
    }

    if (i < 3)
    tresh=0.2*sm/(n*n); 
    else
    tresh=0.0;  

    #pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
    for (ip=0;ip<n-1;ip++)
    {
    //#pragma omp parallel for private (g,h,t,theta,c,s,tau)
        for (iq=ip+1;iq<n;iq++)
        {
            g=100.0*fabs(a(ip,iq));
            if (i > 3 && (fabs(d(ip))+g) == fabs(d[ip]) && (fabs(d[iq])+g) == fabs(d[iq]))
            a(ip,iq)=0.0;
            else if (fabs(a(ip,iq)) > tresh)
            {
                h=d(iq)-d(ip);
                if ((fabs(h)+g) == fabs(h))
                {
                    t=(a(ip,iq))/h;
                }   
                else 
                {
                    theta=0.5*h/(a(ip,iq));
                    t=1.0/(fabs(theta)+sqrt(1.0+theta*theta));
                    if (theta < 0.0)
                    {
                        t = -t;
                    }
                    c=1.0/sqrt(1+t*t);
                    s=t*c;
                    tau=s/(1.0+c);
                    h=t*a(ip,iq);

                   #pragma omp critical
                    {
                    z(ip)=z(ip)-h;
                    z(iq)=z(iq)+h;
                    d(ip)=d(ip)-h;
                    d(iq)=d(iq)+h;
                    a(ip,iq)=0.0;


                    for (j=0;j<ip;j++)
                        ROTATE(a,j,ip,j,iq,s,tau);
                    for (j=ip+1;j<iq;j++)
                        ROTATE(a,ip,j,j,iq,s,tau);
                    for (j=iq+1;j<n;j++)
                        ROTATE(a,ip,j,iq,j,s,tau);
                    for (j=0;j<n;j++)
                        ROTATE(v,j,ip,j,iq,s,tau);
                    }

                }
            } 
        }
    }


}

}

我想并行化执行大部分计算的循环以及代码中插入的两个注释:

 //#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
 //#pragma omp parallel for private (g,h,t,theta,c,s,tau)

是我的尝试.不幸的是，它们最终都会产生不正确的结果.我怀疑问题可能出在此区块中:

z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;

因为通常这种累加需要减少，但是由于每个线程都访问数组的不同部分，因此我不确定.

我不确定我是否以正确的方式进行并行化，因为我是最近才开始使用openmp的，因此任何建议或建议也将受到欢迎.

旁注:我知道在特征值中有更快的特征值和特征向量确定算法，包括Eigen中的SelfAdjointEigenSolver，但是这些算法并不能满足我对特征向量的精确度要求.

在此先感谢您.

我认为正确的答案是由量子物理学家提供的答案，因为我所做的并没有减少最大4096x4096大小的系统的计算时间.无论如何，我都对代码进行了纠正，以使其能够正常工作，并且也许对于足够大的系统来说，它可能会有些用处.我建议您使用计时器来测试

#pragma omp for

实际上减少了计算时间.

解决方案

我会尽力提供帮助，但是我不确定这是否是您问题的答案.

您的代码有很多问题.我对您的友好建议是:如果您不了解自己所从事工作的含义，请不要做平行的事情.

由于某种原因，您似乎认为将所有内容并行放置#pragma for会使其更快.这是非常错误的.因为产生线程是一件昂贵的事情，并且(相对)花费大量的内存和时间.因此，如果对每个循环重做#pragma for，将为每个循环重新生成线程，这将大大降低程序的速度...除非:您的矩阵确实很大，而且计算时间比成本还高>>产卵它们.

当我想逐个元素地乘以巨大矩阵时，我陷入了一个类似的问题(然后我需要总和来获得量子力学中的某些期望值).为了使用OpenMP，我必须将矩阵展平为线性数组，然后将每个数组块分配给一个线程，然后运行for循环，其中每个循环迭代肯定使用与其他元素无关的元素，而我使它们全部独立发展.这是非常快的.为什么?因为我从来不需要两次重新生成线程.

为什么您得到错误的结果?我相信原因是因为您不遵守共享内存规则.您有 some 个变量，这些变量同时被多个线程修改.它藏在某个地方，您必须找到它！例如，函数z的作用是什么?是否需要参考资料?我在这里看到的内容:

z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;

看起来非常多线程不安全，我不明白您在做什么.您是否在返回必须修改的参考?这是线程不安全的秘诀.为什么不创建干净的数组并处理它们呢?

如何调试:从一个小示例(可能是2x2矩阵)开始，仅使用2个线程，然后尝试了解发生了什么.使用调试器并定义断点，并检查线程之间共享哪些信息.

还可以考虑使用互斥锁检查共享时破坏了哪些数据. 此处是操作方法.

我的建议:除非计划仅一次生成线程，否则请不要使用OpenMP.我实际上相信，由于C ++ 11，OpenMP很快就会死掉.当C ++没有任何本机多线程实现时，OpenMP很漂亮.因此，请学习如何使用std::thread并使用它，如果需要在线程中运行很多事情，请学习如何使用std::thread创建线程池. 这是学习多线程的好书. /p>

I have implemented the Jacobi algorithm based on the routine described in the book Numerical Recipes but since I plan to work with very large matrices I am trying to parallelize it using openmp.

void ROTATE(MatrixXd &a, int i, int j, int k, int l, double s, double tau)
{
double g,h;
g=a(i,j);
h=a(k,l);
a(i,j)=g-s*(h+g*tau);
a(k,l)=h+s*(g-h*tau);

}

void jacobi(int n, MatrixXd &a, MatrixXd &v, VectorXd &d )
{
int j,iq,ip,i;
double tresh,theta,tau,t,sm,s,h,g,c;

VectorXd b(n);
VectorXd z(n);

v.setIdentity();    
z.setZero();

#pragma omp parallel for 
for (ip=0;ip<n;ip++)
{   
    d(ip)=a(ip,ip);
    b(ip)=d(ip);
}

for (i=0;i<50;i++) 
{
    sm=0.0;
    for (ip=0;ip<n-1;ip++) 
    {
        #pragma omp parallel for reduction (+:sm)
        for (iq=ip+1;iq<n;iq++)
            sm += fabs(a(ip,iq));
    }
    if (sm == 0.0) {
        break;
    }

    if (i < 3)
    tresh=0.2*sm/(n*n); 
    else
    tresh=0.0;  

    #pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
    for (ip=0;ip<n-1;ip++)
    {
    //#pragma omp parallel for private (g,h,t,theta,c,s,tau)
        for (iq=ip+1;iq<n;iq++)
        {
            g=100.0*fabs(a(ip,iq));
            if (i > 3 && (fabs(d(ip))+g) == fabs(d[ip]) && (fabs(d[iq])+g) == fabs(d[iq]))
            a(ip,iq)=0.0;
            else if (fabs(a(ip,iq)) > tresh)
            {
                h=d(iq)-d(ip);
                if ((fabs(h)+g) == fabs(h))
                {
                    t=(a(ip,iq))/h;
                }   
                else 
                {
                    theta=0.5*h/(a(ip,iq));
                    t=1.0/(fabs(theta)+sqrt(1.0+theta*theta));
                    if (theta < 0.0)
                    {
                        t = -t;
                    }
                    c=1.0/sqrt(1+t*t);
                    s=t*c;
                    tau=s/(1.0+c);
                    h=t*a(ip,iq);

                   #pragma omp critical
                    {
                    z(ip)=z(ip)-h;
                    z(iq)=z(iq)+h;
                    d(ip)=d(ip)-h;
                    d(iq)=d(iq)+h;
                    a(ip,iq)=0.0;


                    for (j=0;j<ip;j++)
                        ROTATE(a,j,ip,j,iq,s,tau);
                    for (j=ip+1;j<iq;j++)
                        ROTATE(a,ip,j,j,iq,s,tau);
                    for (j=iq+1;j<n;j++)
                        ROTATE(a,ip,j,iq,j,s,tau);
                    for (j=0;j<n;j++)
                        ROTATE(v,j,ip,j,iq,s,tau);
                    }

                }
            } 
        }
    }


}

}

I wanted to parallelize the loop that does most of the calculations and both comments inserted in the code:

 //#pragma omp parallel for private (ip,g,h,t,theta,c,s,tau)
 //#pragma omp parallel for private (g,h,t,theta,c,s,tau)

are my attempts at it. Unfortunately both of them end up producing incorrect results. I suspect the problem may be in this block:

z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;

because usually this sort of accumulation would need a reduction, but since each thread accesses a different part of the array, I am not certain of this.

I am not really sure if I am doing the parallelization in a correct manner because I have only recently started working with openmp, so any suggestion or recommendation would also be welcomed.

Sidenote: I know there are faster algorithms for eigenvalue and eigenvector determination including the SelfAdjointEigenSolver in Eigen, but those are not giving me the precision I need in the eigenvectors and this algorithm is.

My thanks in advance.

Edit: I considered to correct answer to be the one provided by The Quantum Physicist because what I did does not reduce the computation time for system of size up to 4096x4096. In any case I corrected the code in order to make it work and maybe for big enough systems it could be of some use. I would advise the use of timers to test if the

#pragma omp for

actually decrease the computation time.

解决方案

I'll try to help, but I'm not sure this is the answer to your question.

There are tons of problems with your code. My friendly advice for you is: Don't do parallel things if you don't understand the implications of what you're doing.

For some reason, it looks like that you think putting everything in parallel #pragma for will make it faster. This is VERY wrong. Because spawning threads is an expensive thing to do and costs (relatively) lots of memory and time. So if you redo that #pragma for for every loop, you'll respawn threads for every loop, which will significantly reduce the speed of your program... UNLESS: Your matrices are REALLY huge and the computation time is >> than the cost of spawning them.

I fell into a similar issue when I wanted to multiply huge matrices, element-wise (and then I needed the sum for some expectation value in quantum mechanics). To use OpenMP for that, I had to flatten the matrices to linear arrays, and then distribute the array chunks each to a thread, and then run a for loop, where every loop iteration uses elements that are independent of others for sure, and I made them all evolve independently. This was quite fast. Why? Because I never had to respawn threads twice.

Why you're getting wrong results? I believe the reason is because you're not respecting shared memory rules. You have some variable(s) that is being modified by multiple threads simultaneously. It's hiding somewhere, and you have to find it! For example, what does the function z do? Does it take stuff by reference? What I see here:

z(ip)=z(ip)-h;
z(iq)=z(iq)+h;
d(ip)=d(ip)-h;
d(iq)=d(iq)+h;

Looks VERY multi-threading not-safe, and I don't understand what you're doing. Are you returning a reference that you have to modify? This is a recipe for thread non-safety. Why don't you create clean arrays and deal with them instead of this?

How to debug: Start with a small example (2x2 matrix, maybe), and use only 2 threads, and try to understand what's going on. Use a debugger and define break points, and check what information is shared between threads.

Also consider using a mutex to check what data gets ruined when it becomes shared. Here is how to do it.

My recommendation: Don't use OpenMP unless you plan to spawn the threads ONLY ONCE. I actually believe that OpenMP is going to die very soon because of C++11. OpenMP was beautiful back then when C++ didn't have any native multi-threading implementation. So learn how to use std::thread, and use it, and if you need to run many things in threads, then learn how to create a Thread Pool with std::thread. This is a good book for learning multithreading.

这篇关于使用Openmp的本征C ++对Jacobi算法进行并行化的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用Openmp的本征C ++对Jacobi算法进行并行化 [英] Parallelization of Jacobi algorithm using eigen c++ using openmp

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

使用Openmp的本征C ++对Jacobi算法进行并行化 [英] Parallelization of Jacobi algorithm using eigen c++ using openmp

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭