奇怪的openmpi输出错误 [英] OpenMPI strange output error

查看:311
本文介绍了奇怪的openmpi输出错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的openmpi 1.3小簇上。

I am using OpenMPI 1.3 on a small cluster.

这是我调用该函数:

void invertColor_Parallel(struct image *im, int size, int rank)
{
     int i,j,aux,r;

     int total_pixels = (*im).ih.width * (*im).ih.height;
     int qty = total_pixels/(size-1);
     int rest = total_pixels % (size-1);
     MPI_Status status;

     //printf("\n%d\n", rank);

     if(rank == 0)
     {
         for(i=1; i<size; i++){
         j = i*qty - qty;
         aux = j;

         if(rest != 0 && i==size-1) {qty=qty+rest;} //para distrubuir toda la carga
         //printf("\nj: %d  qty: %d  rest: %d\n", j, qty, rest);

         MPI_Send(&aux, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG+1, MPI_COMM_WORLD);
         MPI_Send(&qty, 1, MPI_INT, i, MASTER_TO_SLAVE_TAG+2, MPI_COMM_WORLD);

         MPI_Send(&(*im).array[j], qty*3, MPI_BYTE, i, MASTER_TO_SLAVE_TAG, MPI_COMM_WORLD);
        }

     }
     else
     {
        MPI_Recv(&aux, 1, MPI_INT, MPI_ANY_SOURCE, MASTER_TO_SLAVE_TAG+1, MPI_COMM_WORLD,&status);
        MPI_Recv(&qty, 1, MPI_INT, MPI_ANY_SOURCE, MASTER_TO_SLAVE_TAG+2, MPI_COMM_WORLD,&status);

        pixel *arreglo = (pixel *)calloc(qty, sizeof(pixel));
        MPI_Recv(&arreglo[0], qty*3, MPI_BYTE, MPI_ANY_SOURCE, MASTER_TO_SLAVE_TAG, MPI_COMM_WORLD,&status);
        //printf("Receiving node=%d, message=%d\n", rank, aux);

        for(i=0;i<qty;i++)
        {
            arreglo[i].R = 255-arreglo[i].R;
            arreglo[i].G = 255-arreglo[i].G;
            arreglo[i].B = 255-arreglo[i].B;
        }

        MPI_Send(&aux, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG+1, MPI_COMM_WORLD);
        MPI_Send(&qty, 1, MPI_INT, 0, SLAVE_TO_MASTER_TAG+2, MPI_COMM_WORLD);
        MPI_Send(&arreglo[0], qty*3, MPI_BYTE, 0, SLAVE_TO_MASTER_TAG, MPI_COMM_WORLD);

        free(arreglo);
     }


    if (rank==0){
        //printf("\nrank: %d\n", rank);
        for (i=1; i<size; i++) // untill all slaves have handed back the processed data
        {
            MPI_Recv(&aux, 1, MPI_INT, MPI_ANY_SOURCE, SLAVE_TO_MASTER_TAG+1, MPI_COMM_WORLD,&status);
            MPI_Recv(&qty, 1, MPI_INT, MPI_ANY_SOURCE, SLAVE_TO_MASTER_TAG+2, MPI_COMM_WORLD,&status);
            MPI_Recv(&(*im).array[aux], qty*3, MPI_BYTE, MPI_ANY_SOURCE, SLAVE_TO_MASTER_TAG, MPI_COMM_WORLD,&status);
        }
    }
}

int main(int argc, char *argv[])
{
    //////////time counter
    clock_t begin;


    int rank, size;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Status status;

    int op = (int)atof(argv[1]);
    char filename_toload[50];
    int bright_number=0;
    struct image image2;

    if (rank==0)
    {
        loadImage(&image2, argv[2]);
    }

    //Broadcast the user's choice to all other ranks
    MPI_Bcast(&op, 1, MPI_INT, 0, MPI_COMM_WORLD);


    switch(op)
    {
        case 1:
                if (rank==0) {begin = clock();}
                MPI_Barrier(MPI_COMM_WORLD);
                invertColor_Parallel(&image2, size, rank);
                MPI_Barrier(MPI_COMM_WORLD);
                if (rank==0) {runningTime(begin, clock()); printf("Se invirtieron los colores de la imagen\n\n");}
                break;
    }

    MPI_Barrier(MPI_COMM_WORLD);

    if (rank==0)
    {
        saveImage(&image2, argv[3]);
        free(image2.array);
    }

    MPI_Finalize();

    return 0;
}

有时我碰到下面的错误。

and sometimes I get the following error.

cluster@maestro:/mpi$ mpirun -np 60 -hostfile /home/hostfile paralelo
1 image.bmp out.bmp

cluster@nodo1's password:
[maestro:5194] *** An error occurred in MPI_Recv
[maestro:5194] *** on communicator MPI_COMM_WORLD
[maestro:5194] *** MPI_ERR_TRUNCATE: message truncated
[maestro:5194] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 5194 on node maestro
exiting without calling "finalize". This may have caused other
processes in the application to be terminated by signals sent by
mpirun (as reported here).
--------------------------------------------------------------------------
[nodo1] [[49223,1],55][../../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

这取决于处理计数是否会抛出异常与否,例如与 99 -np 它的工作原理pretty不错。

It depends on the process count whether it will throw an error or not, e.g. with -np 99 it works pretty good.

这是怎么回事任何想法?

Any idea on what is going on?

推荐答案

这code可能是罪魁祸首:

This code is probably the culprit:

if (rank==0){
    //printf("\nrank: %d\n", rank);
    for (i=1; i<size; i++) // untill all slaves have handed back the processed data
    {
        MPI_Recv(&aux, 1, MPI_INT, MPI_ANY_SOURCE, SLAVE_TO_MASTER_TAG+1, MPI_COMM_WORLD,&status);
        MPI_Recv(&qty, 1, MPI_INT, MPI_ANY_SOURCE, SLAVE_TO_MASTER_TAG+2, MPI_COMM_WORLD,&status);
        MPI_Recv(&(*im).array[aux], qty*3, MPI_BYTE, MPI_ANY_SOURCE, SLAVE_TO_MASTER_TAG, MPI_COMM_WORLD,&status);
    }
}

既然你(AB-)使用 MPI_ANY_SOURCE ,你基本上是创建消息接收种族完美的条件。这是完全有可能第一个 MPI_RECV 与从排名消息的 I 的,第二个与从排名消息的Ĵ和第三个从排名的 K 的,一个消息匹配,其中的 I Ĵ K 的具有完全不同值。因此,这是可能的,您会收到错误的像素数到错误的图像插槽。此外,如果它发生等级的 K 的发送更多的像素比数量从秩值的Ĵ的规定,你会得到一个截断误差(和你实际上得到它)。一句忠告:从不使用 MPI_ANY_SOURCE 轻浮,除非绝对肯定,该算法是正确的,也没有发生种族

Since you are (ab-)using MPI_ANY_SOURCE, you are essentially creating the perfect conditions for message reception races. It is entirely possible that the first MPI_Recv matches a message from rank i, the second one matches a message from rank j and the third one matches a message from rank k, where i, j, and k have completely different values. Therefore it is possible that you receive the wrong number of pixels into the wrong image slot. Also, if it happens that rank k sends more pixels than the value of qty from rank j specifies, you'll get a truncation error (and you are actually getting it). A word of advice: never use MPI_ANY_SOURCE frivolously unless absolutely sure that the algorithm is correct and no races could occur.

无论是重写code为:

Either rewrite the code as:

if (rank==0){
    //printf("\nrank: %d\n", rank);
    for (i=1; i<size; i++) // untill all slaves have handed back the processed data
    {
        MPI_Recv(&aux, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG+1, MPI_COMM_WORLD, &status);
        MPI_Recv(&qty, 1, MPI_INT, i, SLAVE_TO_MASTER_TAG+2, MPI_COMM_WORLD, &status);
        MPI_Recv(&(*im).array[aux], qty*3, MPI_BYTE, i, SLAVE_TO_MASTER_TAG, MPI_COMM_WORLD, &status);
    }
}

甚至更好,因为:

or even better as:

if (rank==0){
    //printf("\nrank: %d\n", rank);
    for (i=1; i<size; i++) // untill all slaves have handed back the processed data
    {
        MPI_Recv(&aux, 1, MPI_INT, MPI_ANY_SOURCE, SLAVE_TO_MASTER_TAG+1,
                 MPI_COMM_WORLD, &status);
        MPI_Recv(&qty, 1, MPI_INT, status.MPI_SOURCE, SLAVE_TO_MASTER_TAG+2,
                 MPI_COMM_WORLD, &status);
        MPI_Recv(&(*im).array[aux], qty*3, MPI_BYTE, status.MPI_SOURCE, SLAVE_TO_MASTER_TAG,
                 MPI_COMM_WORLD, &status);
    }
}

三个接收总是会得到相同的过程和竞争条件的邮件这​​种方式将被淘汰。第二个版本的工作方式是,它第一次收到来自任何级别的消息,但然后使用 status.MPI_SOURCE 字段,以获得实际官衔,并使用它下面的接收。

That way the three receives will always get messages from the same process and the race condition will be eliminated. The way the second version works is that it first receives a message from any rank but then uses the status.MPI_SOURCE field to get the actual rank and use it for the following receive.

这篇关于奇怪的openmpi输出错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆