在fortran中写入大型数组的最佳方法?文字与其他 [英] Best way to write a large array to file in fortran? Text vs Other

查看:1088
本文介绍了在fortran中写入大型数组的最佳方法?文字与其他的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道将大型Fortran阵列(5000 x 5000个真实单精度数字)写入文件的最佳方法。我试图保存数值计算的结果供以后使用,因此不需要重复。从计算5000 x 5000 x 4字节每个数字是100 Mb,是否有可能将此保存为只有100Mb的形式?有没有办法将fortran数组保存为二进制文件并将其读回供以后使用?

我注意到,将数字保存到文本文件会生成比保存的数据类型大得多的文件。这是因为数字被保存为字符?



我熟悉写入文件的唯一方法是

  open(unit = 41,file ='outfile.txt')

do i = 1,len
do j = 1,len

write(41 ,*)Array(i,j)
end do
end do

虽然我会想象有更好的方法来做到这一点。如果任何人都可以指点我一些资源或例子来批准我有效地编写和读取较大文件(在内存方面)的能力,那将是很棒的。
Thanks!

解决方案

用二进制写数据文件,除非你真的要读输出 - 你不会读250万个元素的数组。



使用二进制的原因有三重,重要性降低: b
$ b


  • 准确度

  • 性能
    数据量
    可能是最明显的。当您将(二进制)浮点数转换为十进制数的字符串表示形式时,您不可避免地会在某个点处截断。没关系,如果你确定当你将文本值读回浮点值时,你肯定会得到相同的值;但这实际上是一个微妙的问题,需要仔细选择你的格式。使用默认格式,各种编译器以不同程度的质量执行此任务。 此博文 ,从游戏程序员的角度出发,在解决问题方面做得很好。



    让我们考虑一个小程序,它针对各种格式写入一个单精度实数输出到一个字符串,然后再次读回来,记录它遇到的最大错误。我们将从0到1,以机器epsilon为单位。代码如下:

     程序testaccuracy 

    字符(len = 128):: teststring
    integer,parameter :: nformats = 4
    character(len = 20),parameter :: formats(nformats)=&
    ['(E11.4)','(E13.6)','(E15.8)','(E17.10)']
    real,dimension(nformats):: errors

    real :: output,back
    real,parameter :: delta = epsilon(输出)
    integer :: i

    errors = 0
    output = 0
    while(output <1)
    do i = 1,nformats
    write(teststring,FMT = formats(i))output
    read(teststring ,*)返回
    if(abs(back-output)> errors(i))errors(i)= abs(back-output)
    enddo
    output = output + delta
    end do

    print *,'最大错误:'
    print *,格式
    print *,错误

    print *,'Trying with默认格式:'

    errors = 0
    output = 0
    while(output <1)
    write(teststring,*)output
    read如果(abs(back-output)> errors(1))错误(1)= abs(反向输出)
    ,则返回
    (teststring,*) output = output + delta
    end do

    print *,'Error =',errors(1)

    end program testaccuracy

    当我们运行它时,我们得到:

      $ ./准确
    最大误差:
    (E11.4)(E13.6)(E15.8)(E17.10)
    5.00082970E-05 5.06639481E- 07 7.45058060E-09 0.0000000
    使用默认格式尝试:
    错误= 7.45058060E-09

    请注意,即使在小数点后面使用8位数字的格式 - 我们可能会认为这会很有用,因为单精度实数只能精确到6-7小数位 - 我们没有得到精确的副本,关闭大约1e-8。而这种编译器的默认格式不会给我们准确的往返浮点值;一些错误被引入!如果你是一个视频游戏程序员,那么这种精确度就足够了。但是,如果您正在进行时间相关的湍流流体模拟,那么可能绝对不行,尤其是在引入误差的位置存在某种偏差的情况下,或者错误发生在应该是保守量的情况下。 / p>

    请注意,如果您尝试运行此代码,您会注意到完成时间需要很长的时间。这是因为,也许令人惊讶的是,性能是浮点数字文本输出的另一个实际问题。考虑下面这个简单的程序,它只是写出你的例子5000× 5000真实数组作为文本和未格式化的二进制:

     程序testarray 
    隐式无
    整型,参数: :asize = 5000
    real,dimension(asize,asize):: array

    integer :: i,j
    integer :: time,u
    $ b $ (i = 1:asize,j = 1:asize)array(i,j)= i * asize + j

    call tick(time)
    open(newunit = u, (数组(i,j),j = 1,asize)
    enddo
    close(u)
    print *,'ASCII:time =',tock(time)

    call tick(time)
    open(newunit = u,file ='test .dat',form ='unformatted')
    write(u)array
    close(u)
    print *,'Binary:time =',tock(time)


    包含
    子程序tick(t)
    整数,意图(OUT):: t
    调用system_clock(t)
    结束子程序tick

    !返回从现在到以秒为单位的时间,单位为t
    实数函数tock(t)
    整数,意图(in):: t
    integer :: now,clock_rate
    call system_clock (now,clock_rate)
    tock = real(now -t)/ real(clock_rate)
    结束函数tock

    结束程序testarray

    以下是写入磁盘或虚拟磁盘的时序输出:

     磁盘:
    ASCII:time = 41.193001
    二进制:time = 0.11700000
    Ramdisk
    ASCII:time = 40.789001
    二进制:time = 5.70000000 E-02

    请注意,写入磁盘时,二进制输出为 352次与ASCII一样快,并且ramdisk更接近700次。有两个原因 - 一个是你可以一次写出所有数据,而不必循环;另一个是生成浮点数的字符串十进制表示是一个令人惊讶的微妙操作,它需要对每个值进行大量的计算。



    最后,数据大小;上面例子中的文本文件出来了(在我的系统上)大约是二进制文件大小的4倍。



    现在,二进制输出存在真正的问题。特别是,原始的Fortran(或者说C语言)的二进制输出非常脆弱。如果您更换平台,或者数据大小发生变化,您的输出结果可能不再有用。向输出添加新变量会破坏文件格式,除非您始终在文件末尾添加新数据,并且无法提前知道从协作者获得的二进制BLOB中的变量(可能是谁你,三个月前)。二进制输出的大多数缺点都可以通过使用像 NetCDF 这样的库来避免,它写自描述的二进制文件比原始二进制文件更未来的证明。更好的是,由于它是一个标准,因此许多工具都可以读取NetCDF文件。

    互联网上有许多NetCDF教程;我们的此处。使用NetCDF的一个简单例子给出了与原始二进制文件类似的时间:

    $ $ $ ./array
    ASCII:time = 40.676998
    Binary:time = 4.30000015E-02
    NetCDF:time = 0.16000000

    但是给你一个很好的自描述文件:

      $ ncdump -h test.nc 
    netcdf测试{
    尺寸:
    X = 5000;
    Y = 5000;
    变量:
    float Array(Y,X);
    Array:units =ergs;
    }

    以及与原始二进制文件大小相同的文件大小:

      $ du -sh test。* 
    96M test.dat
    96M test.nc
    382M test.txt

    代码如下:

     程序testarray 
    隐式无
    整数,参数:: asize = 5000
    实数,维(asize,asize)::数组

    整数(i,j)= i * asize + j

    call tick(time)
    open(newunit = u,file ='test.txt')
    do i = 1,asize
    write(u,*) (array(i,j),j = 1,asize)
    enddo
    close(u)
    print *,'ASCII:time =',tock(time)

    call tick(time)
    open(newunit = u,file ='test.dat',form ='unformatted')
    write(u)array
    close(u)
    print *,'Binary:time =',tock(时间)

    呼叫记号(时间)
    呼叫writenetcdff ile(array)
    print *,'NetCDF:time =',tock(time)


    包含
    子程序tick(t)
    整数, intent(OUT):: t
    调用system_clock(t)
    结束子程序tick

    !返回从现在到以秒为单位的时间,单位为t
    实数函数tock(t)
    整数,意图(in):: t
    整数:: now,clock_rate
    call system_clock (now,clock_rate)
    tock = real(now -t)/ real(clock_rate)
    结束函数tock

    子程序writenetcdffile(数组)
    使用netcdf
    implicit none
    real,intent(IN),dimension(:, :) :: array

    整数:: file_id,xdim_id,ydim_id
    整数:: array_id
    integer,dimension(2):: arrdims
    字符(len = *),parameter :: arrunit ='ergs'

    整数:: i,j
    整数: :ierr

    i = size(array,1)
    j = size(array,2)

    !创建文件
    ierr = nf90_create(path ='test.nc',cmode = NF90_CLOBBER,ncid = file_id)

    !定义尺寸
    ierr = nf90_def_dim(file_id,'X',i,xdim_id)
    ierr = nf90_def_dim(file_id,'Y',j,ydim_id)

    !现在已经定义了维度,我们可以在其上定义变量,...
    arrdims =(/ xdim_id,ydim_id /)
    ierr = nf90_def_var(file_id,'Array',NF90_REAL,arrdims,array_id)

    ! ...并为它们分配单位作为属性
    ierr = nf90_put_att(file_id,array_id,units,arrunit)

    !完成定义
    ierr = nf90_enddef(file_id)

    !写出值
    ierr = nf90_put_var(file_id,array_id,array)

    !关;完成
    ierr = nf90_close(file_id)
    返回
    结束子程序writenetcdffile
    结束程序testarray


    I wanted to know what the best way to write a large fortran array ( 5000 x 5000 real single precision numbers) to a file. I am trying to save the results of a numerical calculation for later use so they do not need to be repeated. From calculation 5000 x 5000 x 4bytes per number number is 100 Mb, is it possible to save this in a form that is only 100Mb? Is there a way to save fortran arrays as a binary file and read it back in for later use?

    I've noticed that saving numbers to a text file produces a file much larger than the size of the data type being saved. Is this because the numbers are being saved as characters?

    The only way I am familiar with to write to file is

    open (unit=41, file='outfile.txt')
    
    do  i=1,len
        do j=1,len
    
            write(41,*) Array(i,j)
        end do
    end do
    

    Although I'd imagine there is a better way to do it. If anyone could point me to some resources or examples to approve my ability to write and read larger files efficiently (in terms of memory) that would be great. Thanks!

    解决方案

    Write data files in binary, unless you're going to actually be reading the output - and you're not going to be reading a 2.5 million-element array.

    The reasons for using binary are threefold, in decreasing importance:

    • Accuracy
    • Performance
    • Data size

    Accuracy concerns may be the most obvious. When you are converting a (binary) floating point number to a string representation of the decimal number, you are inevitably going to truncate at some point. That's ok if you are sure that when you read the text value back into a floating point value, you are certainly going to get the same value; but that is actually a subtle question and requires choosing your format carefully. Using default formatting, various compilers perform this task with varying degrees of quality. This blog post, written from the point of view of a games programmer, does a good job of covering the issues.

    Let's consider a little program which, for a variety of formats, writes a single-precision real number out to a string, and then reads it back in again, keeping track of the maximum error it encounters. We'll just go from 0 to 1, in units of machine epsilon. The code follows:

    program testaccuracy
    
        character(len=128) :: teststring
        integer, parameter :: nformats=4
        character(len=20), parameter :: formats(nformats) =   &
            [ '( E11.4)', '( E13.6)', '( E15.8)', '(E17.10)' ]
        real, dimension(nformats) :: errors
    
        real :: output, back
        real, parameter :: delta=epsilon(output)
        integer :: i
    
        errors = 0
        output = 0
        do while (output < 1)
            do i=1,nformats
                write(teststring,FMT=formats(i)) output
                read(teststring,*) back
                if (abs(back-output) > errors(i)) errors(i) = abs(back-output)
            enddo
            output = output + delta
        end do
    
        print *, 'Maximum errors: '
        print *, formats
        print *, errors
    
        print *, 'Trying with default format: '
    
        errors = 0
        output = 0
        do while (output < 1)
            write(teststring,*) output
            read(teststring,*) back
            if (abs(back-output) > errors(1)) errors(1) = abs(back-output)
            output = output + delta
        end do
    
        print *, 'Error = ', errors(1)
    
    end program testaccuracy
    

    and when we run it, we get:

    $ ./accuracy 
     Maximum errors: 
     ( E11.4)            ( E13.6)            ( E15.8)            (E17.10)            
      5.00082970E-05  5.06639481E-07  7.45058060E-09   0.0000000    
     Trying with default format: 
     Error =   7.45058060E-09
    

    Note that even using a format with 8 digits after the decimal place - which we might think would be plenty, given that single precision reals are only accurate to 6-7 decimal places - we don't get exact copies back, off by approximately 1e-8. And this compiler's default format does not give us accurate round-trip floating point values; some error is introduced! If you're a video-game programmer, that level of accuracy may well be enough. If you're doing time-dependant simulations of turbulent fluids, however, that might absolutely not be ok, particularly if there's some bias to where the error is introduced, or if the error occurs in what is supposed to be a conserved quantity.

    Note that if you try running this code, you'll notice that it takes a surprisingly long time to finish. That's because, maybe surprisingly, performance is another real issue with text output of floating point numbers. Consider the following simple program, which just writes out your example of a 5000 × 5000 real array as text and as unformatted binary:

    program testarray
        implicit none
        integer, parameter :: asize=5000
        real, dimension(asize,asize) :: array
    
        integer :: i, j
        integer :: time, u
    
        forall (i=1:asize, j=1:asize) array(i,j)=i*asize+j
    
        call tick(time)
        open(newunit=u,file='test.txt')
        do i=1,asize
            write(u,*) (array(i,j), j=1,asize)
        enddo
        close(u)
        print *, 'ASCII: time = ', tock(time)
    
        call tick(time)
        open(newunit=u,file='test.dat',form='unformatted')
        write(u) array
        close(u)
        print *, 'Binary: time = ', tock(time)
    
    
    contains
        subroutine tick(t)
            integer, intent(OUT) :: t
            call system_clock(t)
        end subroutine tick
    
        ! returns time in seconds from now to time described by t 
        real function tock(t)
            integer, intent(in) :: t
            integer :: now, clock_rate
            call system_clock(now,clock_rate)
            tock = real(now - t)/real(clock_rate)
        end function tock
    
    end program testarray
    

    Here are the timing outputs, for writing to disk or to ramdisk:

    Disk:
     ASCII: time =    41.193001    
     Binary: time =   0.11700000    
    Ramdisk
     ASCII: time =    40.789001    
     Binary: time =   5.70000000E-02
    

    Note that when writing to disk, the binary output is 352 times as fast as ASCII, and to ramdisk it's closer to 700 times. There are two reasons for this - one is that you can write out data all at once, rather than having to loop; the other is that generating the string decimal representation of a floating point number is a surprisingly subtle operation which requires a significant amount of computing for each value.

    Finally, is data size; the text file in the above example comes out (on my system) to about 4 times the size of the binary file.

    Now, there are real problems with binary output. In particular, raw Fortran (or, for that matter, C) binary output is very brittle. If you change platforms, or your data size changes, your output may no longer be any good. Adding new variables to the output will break the file format unless you always add new data at the end of the file, and you have no way of knowing ahead of time what variables are in a binary blob you get from your collaborator (who might be you, three months ago). Most of the downsides of binary output are avoided by using libraries like NetCDF, which write self-describing binary files that are much more "future proof" than raw binary. Better still, since it's a standard, many tools read NetCDF files.

    There are many NetCDF tutorials on the internet; ours is here. A simple example using NetCDF gives similar times to raw binary:

    $ ./array 
     ASCII: time =    40.676998    
     Binary: time =   4.30000015E-02
     NetCDF: time =   0.16000000  
    

    but gives you a nice self-describing file:

    $ ncdump -h test.nc
    netcdf test {
    dimensions:
        X = 5000 ;
        Y = 5000 ;
    variables:
        float Array(Y, X) ;
            Array:units = "ergs" ;
    }
    

    and file sizes about the same as raw binary:

    $ du -sh test.*
    96M test.dat
    96M test.nc
    382M    test.txt
    

    the code follows:

    program testarray
        implicit none
        integer, parameter :: asize=5000
        real, dimension(asize,asize) :: array
    
        integer :: i, j
        integer :: time, u
    
        forall (i=1:asize, j=1:asize) array(i,j)=i*asize+j
    
        call tick(time)
        open(newunit=u,file='test.txt')
        do i=1,asize
            write(u,*) (array(i,j), j=1,asize)
        enddo
        close(u)
        print *, 'ASCII: time = ', tock(time)
    
        call tick(time)
        open(newunit=u,file='test.dat',form='unformatted')
        write(u) array
        close(u)
        print *, 'Binary: time = ', tock(time)
    
        call tick(time)
        call writenetcdffile(array)
        print *, 'NetCDF: time = ', tock(time)
    
    
    contains
        subroutine tick(t)
            integer, intent(OUT) :: t
            call system_clock(t)
        end subroutine tick
    
        ! returns time in seconds from now to time described by t 
        real function tock(t)
            integer, intent(in) :: t
            integer :: now, clock_rate
            call system_clock(now,clock_rate)
            tock = real(now - t)/real(clock_rate)
        end function tock
    
        subroutine writenetcdffile(array)
            use netcdf
            implicit none
            real, intent(IN), dimension(:,:) :: array
    
            integer :: file_id, xdim_id, ydim_id
            integer :: array_id
            integer, dimension(2) :: arrdims
            character(len=*), parameter :: arrunit = 'ergs'
    
            integer :: i, j
            integer :: ierr
    
            i = size(array,1)
            j = size(array,2)
    
            ! create the file
            ierr = nf90_create(path='test.nc', cmode=NF90_CLOBBER, ncid=file_id)
    
            ! define the dimensions
            ierr = nf90_def_dim(file_id, 'X', i, xdim_id)
            ierr = nf90_def_dim(file_id, 'Y', j, ydim_id)
    
            ! now that the dimensions are defined, we can define variables on them,...
            arrdims = (/ xdim_id, ydim_id /)
            ierr = nf90_def_var(file_id, 'Array',  NF90_REAL, arrdims, array_id)
    
            ! ...and assign units to them as an attribute 
            ierr = nf90_put_att(file_id, array_id, "units", arrunit)
    
            ! done defining
            ierr = nf90_enddef(file_id)
    
            ! Write out the values
            ierr = nf90_put_var(file_id, array_id, array)
    
            ! close; done
            ierr = nf90_close(file_id)
        return
        end subroutine writenetcdffile
    end program testarray
    

    这篇关于在fortran中写入大型数组的最佳方法?文字与其他的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆