执行MPI I / O时MPI_ERR_BUFFER [英] MPI_ERR_BUFFER when performing MPI I/O

查看:972
本文介绍了执行MPI I / O时MPI_ERR_BUFFER的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在测试MPI I / O。

 子程序save_vtk 
整数:: filetype,fh,单位
integer(MPI_OFFSET_KIND):: pos
real(RP),allocatable :: buffer(::::
integer :: ie

if(master)然后
打开(newunit = unit,file =out.vtk,&
access ='stream',status ='replace',form =unformatted,action =write)
!写头b $ b close(unit)
end if

call MPI_Barrier(mpi_comm,ie)

call MPI_File_open(mpi_comm,out.vtk MPI_MODE_APPEND + MPI_MODE_WRONLY,MPI_INFO_NULL,fh,即)

调用MPI_Type_create_subarray(3,int(ng),int(nxyz),int(off),&
MPI_ORDER_FORTRAN,MPI_RP,filetype ,即)

调用MPI_type_commit(文件类型,即)

调用MPI_Barrier(mpi_comm,即)
调用MPI_File_get_position(fh,pos,即)
调用MPI_Barrier(mpi_comm,即)

调用MPI_File_set_view(fh,pos,MPI_RP,filetype,native,MPI_INFO_NULL,即)

buffer = BigEnd(Phi(1: (fh,buffer,nx * ny * nz,MPI_RP,MPI_STATUS_IGNORE,即)

call MPI_File_close(fh,nx,1:ny,1:nz))

call MPI_File_close ,即)

结束子程序

未定义变量来自主机关联,删除了一些错误检查。在全国学术丛集上运行时出现此错误:

  *** MPI_Isend发生错误
* **由进程[3941400577,18036219417246826496]
***在通信器MPI COMMUNICATOR 20 DUP从0
*** MPI_ERR_BUFFER:无效缓冲区指针
*** MPI_ERRORS_ARE_FATAL(进程在此通信程序现在会中止,
***和潜在的你的MPI工作)

错误被触发通过调用 MPI_File_write_all 。我怀疑它可能与缓冲区的大小有关,这个缓冲区的大小是 10 ^ 5 10 ^ 6 。但我无法排除我身边的编程错误,因为我之前没有使用MPI I / O的经验。



使用的MPI实现是 OpenMPI 1.8.0 与Intel Fortran 14.0.2。



你知道如何使它工作并写入文件吗?

--- Edit2 ---



测试简化版本时,重要代码保持不变,完整源代码为此处。注意它与gfortran协同工作,并且与英特尔不同的MPI失败。我无法用PGI编译它。另外我错了,它只会在不同的节点上失败,即使在单个进程运行中也是如此。

 > module ad gcc -4.8.1 
>模块广告openmpi-1.8.0-gcc
> mpif90 save.f90
> ./ a.out
尝试在1 1中分解1个过程网格。
> mpirun a.out
尝试在1 1 2个过程网格中分解。

>模块rm openmpi-1.8.0-gcc
>模块广告openmpi-1.8.0-intel
> mpif90 save.f90
> ./a.out
尝试在1 1 1过程网格中分解。
ERROR write_all
MPI_ERR_IO:输入/输出错误



>模块rm openmpi-1.8.0-intel
>模块ad openmpi-1.6-intel
> mpif90 save.f90
> ./ a.out
尝试在1 1 1进程网格中进行分解。
错误write_all
MPI_ERR_IO:输入/输出错误



[luna24.fzu.cz:24260] *** MPI_File_set_errhandler
[luna24.fzu.cz:24260] ***关于NULL沟通
[luna24.fzu.cz:24260] ***未知错误
[luna24.fzu.cz:24260] *** MPI_ERRORS_ARE_FATAL:您的MPI作业现在将中止
----------------------------------- ---------------------------------------
MPI进程正在中止无法保证其工作中的所有同业流程的所有
都会被正确处理。你应该
仔细检查一切已经完全关闭。

原因:MPI_FINALIZE被调用后
本地主机:luna24.fzu.cz
PID:24260
------------- -------------------------------------------------- -----------
>模块rm openmpi-1.6-intel
>模块广告mpich2-intel
> mpif90 save.f90
> ; ./a.out
尝试在1 1 1过程网格中分解。
错误write_all
其他I / O错误,错误堆栈:
ADIOI_NFS_WRITECONTIG(70):其他I / O错误错误
地址

 

code> buffer = BigEnd(Phi(1:nx,1:ny,1:nz))



根据Fortran 2003标准(不在Fortran 95中),数组 buffer 应自动分配到右侧的形状。英特尔Fortran版本14在默认情况下不会这样做,它需要选项

  -assume realloc_lhs 

来做到这一点。这个选项包含在其他选项中

  -standard-semantics 

因为当问题中的代码被测试时,这个选项没有生效,程序访问了一个未分配的数组和导致崩溃的未定义行为。


I am testing MPI I/O.

  subroutine save_vtk
    integer :: filetype, fh, unit
    integer(MPI_OFFSET_KIND) :: pos
    real(RP),allocatable :: buffer(:,:,:)
    integer :: ie

    if (master) then
      open(newunit=unit,file="out.vtk", &
           access='stream',status='replace',form="unformatted",action="write")
      ! write the header
      close(unit)
    end if

    call MPI_Barrier(mpi_comm,ie)

    call MPI_File_open(mpi_comm,"out.vtk", MPI_MODE_APPEND + MPI_MODE_WRONLY, MPI_INFO_NULL, fh, ie)

    call MPI_Type_create_subarray(3, int(ng), int(nxyz), int(off), &
       MPI_ORDER_FORTRAN, MPI_RP, filetype, ie)

    call MPI_type_commit(filetype, ie)

    call MPI_Barrier(mpi_comm,ie)
    call MPI_File_get_position(fh, pos, ie)
    call MPI_Barrier(mpi_comm,ie)

    call MPI_File_set_view(fh, pos, MPI_RP, filetype, "native", MPI_INFO_NULL, ie)

    buffer = BigEnd(Phi(1:nx,1:ny,1:nz))

    call MPI_File_write_all(fh, buffer, nx*ny*nz, MPI_RP, MPI_STATUS_IGNORE, ie)

    call MPI_File_close(fh, ie)

  end subroutine

The undefined variables come from host association, some error checking removed. I receive this error when running it on a national academic cluster:

*** An error occurred in MPI_Isend
*** reported by process [3941400577,18036219417246826496]
*** on communicator MPI COMMUNICATOR 20 DUP FROM 0
*** MPI_ERR_BUFFER: invalid buffer pointer
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

The error is triggered by the call to MPI_File_write_all. I am suspecting it may be connected with size of the buffer which is the full nx*ny*nz which is in the order of 10^5 to 10^6., but I cannot exclude a programming error on my side, as I have no prior experience with MPI I/O.

The MPI implementation used is OpenMPI 1.8.0 with the Intel Fortran 14.0.2.

Do you know how to make it work and write the file?

--- Edit2 ---

Testing a simplified version, the important code remains the same, full source is here. Notice it works with gfortran and fails with different MPI's with Intel. I wasn't able to compile it with PGI. Also I was wrong in that it fails only on different nodes, it fails even in single process run.

>module ad gcc-4.8.1
>module ad openmpi-1.8.0-gcc
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
>mpirun a.out
 Trying to decompose in           1           1           2 process grid.

>module rm openmpi-1.8.0-gcc
>module ad openmpi-1.8.0-intel
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
 ERROR write_all
 MPI_ERR_IO: input/output error                                                 



>module rm openmpi-1.8.0-intel
>module ad openmpi-1.6-intel
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
 ERROR write_all
 MPI_ERR_IO: input/output error                                                 



[luna24.fzu.cz:24260] *** An error occurred in MPI_File_set_errhandler
[luna24.fzu.cz:24260] *** on a NULL communicator
[luna24.fzu.cz:24260] *** Unknown error
[luna24.fzu.cz:24260] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     After MPI_FINALIZE was invoked
  Local host: luna24.fzu.cz
  PID:        24260
--------------------------------------------------------------------------
>module rm openmpi-1.6-intel
>module ad mpich2-intel
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
 ERROR write_all
 Other I/O error , error stack:
ADIOI_NFS_WRITECONTIG(70): Other I/O error Bad a
 ddress  

解决方案

In line

 buffer = BigEnd(Phi(1:nx,1:ny,1:nz))

the array buffer should be allocated automatically to the shape of the right hand side according to the Fortran 2003 standard (not in Fortran 95). Intel Fortran as of version 14 does not do this by default., it requires the option

-assume realloc_lhs

to do that. This option is included (with other options) in option

-standard-semantics

Because this option was not in effect when the code in the question was tested the program accessed an unallocated array and undefined behavior leading to a crash followed.

这篇关于执行MPI I / O时MPI_ERR_BUFFER的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆