在fortran中将大数组写入文件的最佳方法?文本与其他 [英] Best way to write a large array to file in fortran? Text vs Other

查看:56
本文介绍了在fortran中将大数组写入文件的最佳方法?文本与其他的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道将大型 Fortran 数组(5000 x 5000 实单精度数)写入文件的最佳方法是什么.我正在尝试保存数值计算的结果以供以后使用,因此它们不需要重复.根据计算,每个号码 5000 x 5000 x 4bytes 是 100 Mb,是否可以将其保存为只有 100Mb 的形式?有没有办法将 fortran 数组保存为二进制文件并将其读回以备后用?

I wanted to know what the best way to write a large fortran array ( 5000 x 5000 real single precision numbers) to a file. I am trying to save the results of a numerical calculation for later use so they do not need to be repeated. From calculation 5000 x 5000 x 4bytes per number number is 100 Mb, is it possible to save this in a form that is only 100Mb? Is there a way to save fortran arrays as a binary file and read it back in for later use?

我注意到将数字保存到文本文件会生成一个比所保存数据类型大小大得多的文件.这是因为数字被保存为字符吗?

I've noticed that saving numbers to a text file produces a file much larger than the size of the data type being saved. Is this because the numbers are being saved as characters?

我熟悉的写入文件的唯一方法是

The only way I am familiar with to write to file is

open (unit=41, file='outfile.txt')

do  i=1,len
    do j=1,len

        write(41,*) Array(i,j)
    end do
end do

虽然我想有更好的方法来做到这一点.如果有人能指点我一些资源或示例来认可我有效地(在内存方面)编写和读取较大文件的能力,那就太好了.谢谢!

Although I'd imagine there is a better way to do it. If anyone could point me to some resources or examples to approve my ability to write and read larger files efficiently (in terms of memory) that would be great. Thanks!

推荐答案

以二进制形式写入数据文件,除非您要实际读取输出 - 并且您不会读取 250 万个元素的数组.

Write data files in binary, unless you're going to actually be reading the output - and you're not going to be reading a 2.5 million-element array.

使用二进制文件的原因有三个,其重要性逐渐降低:

The reasons for using binary are threefold, in decreasing importance:

  • 准确性
  • 性能
  • 数据大小

准确性问题可能是最明显的.当您将(二进制)浮点数转换为十进制数的字符串表示时,您不可避免地会在某些时候进行截断.如果您确定当您将文本值读回浮点值时,您肯定会得到相同的值,那没关系;但这实际上是一个微妙的问题,需要仔细选择格式.使用默认格式,各种编译器以不同程度的质量执行此任务.这篇博文,从游戏程序员的角度编写,很好地涵盖了这些问题.

Accuracy concerns may be the most obvious. When you are converting a (binary) floating point number to a string representation of the decimal number, you are inevitably going to truncate at some point. That's ok if you are sure that when you read the text value back into a floating point value, you are certainly going to get the same value; but that is actually a subtle question and requires choosing your format carefully. Using default formatting, various compilers perform this task with varying degrees of quality. This blog post, written from the point of view of a games programmer, does a good job of covering the issues.

让我们考虑一个小程序,它针对各种格式将单精度实数写入字符串,然后再次读回,跟踪遇到的最大错误.我们只是从 0 到 1,以机器 epsilon 为单位.代码如下:

Let's consider a little program which, for a variety of formats, writes a single-precision real number out to a string, and then reads it back in again, keeping track of the maximum error it encounters. We'll just go from 0 to 1, in units of machine epsilon. The code follows:

program testaccuracy

    character(len=128) :: teststring
    integer, parameter :: nformats=4
    character(len=20), parameter :: formats(nformats) =   &
        [ '( E11.4)', '( E13.6)', '( E15.8)', '(E17.10)' ]
    real, dimension(nformats) :: errors

    real :: output, back
    real, parameter :: delta=epsilon(output)
    integer :: i

    errors = 0
    output = 0
    do while (output < 1)
        do i=1,nformats
            write(teststring,FMT=formats(i)) output
            read(teststring,*) back
            if (abs(back-output) > errors(i)) errors(i) = abs(back-output)
        enddo
        output = output + delta
    end do

    print *, 'Maximum errors: '
    print *, formats
    print *, errors

    print *, 'Trying with default format: '

    errors = 0
    output = 0
    do while (output < 1)
        write(teststring,*) output
        read(teststring,*) back
        if (abs(back-output) > errors(1)) errors(1) = abs(back-output)
        output = output + delta
    end do

    print *, 'Error = ', errors(1)

end program testaccuracy

当我们运行它时,我们得到:

and when we run it, we get:

$ ./accuracy 
 Maximum errors: 
 ( E11.4)            ( E13.6)            ( E15.8)            (E17.10)            
  5.00082970E-05  5.06639481E-07  7.45058060E-09   0.0000000    
 Trying with default format: 
 Error =   7.45058060E-09

请注意,即使使用小数点后有 8 位数字的格式 - 我们可能认为这已经足够了,因为 单精度实数只能精确到 6-7 个小数位 - 我们没有得到精确的副本,大约相差 1e-8.而且这个编译器的默认格式没有给我们准确的往返浮点值;引入了一些错误!如果您是一名视频游戏程序员,那么这种准确度可能就足够了.但是,如果您正在对湍流流体进行瞬态模拟,那可能绝对不行,特别是如果对引入误差的位置存在一些偏差,或者如果误差发生在应该是守恒的量中.

Note that even using a format with 8 digits after the decimal place - which we might think would be plenty, given that single precision reals are only accurate to 6-7 decimal places - we don't get exact copies back, off by approximately 1e-8. And this compiler's default format does not give us accurate round-trip floating point values; some error is introduced! If you're a video-game programmer, that level of accuracy may well be enough. If you're doing time-dependant simulations of turbulent fluids, however, that might absolutely not be ok, particularly if there's some bias to where the error is introduced, or if the error occurs in what is supposed to be a conserved quantity.

请注意,如果您尝试运行此代码,您会注意到它需要很长的时间才能完成.那是因为,也许令人惊讶的是,性能是浮点数文本输出的另一个真正问题.考虑下面的简单程序,它只是写出你的 5000 × 示例.5000 实数组作为文本和未格式化的二进制:

Note that if you try running this code, you'll notice that it takes a surprisingly long time to finish. That's because, maybe surprisingly, performance is another real issue with text output of floating point numbers. Consider the following simple program, which just writes out your example of a 5000 × 5000 real array as text and as unformatted binary:

program testarray
    implicit none
    integer, parameter :: asize=5000
    real, dimension(asize,asize) :: array

    integer :: i, j
    integer :: time, u

    forall (i=1:asize, j=1:asize) array(i,j)=i*asize+j

    call tick(time)
    open(newunit=u,file='test.txt')
    do i=1,asize
        write(u,*) (array(i,j), j=1,asize)
    enddo
    close(u)
    print *, 'ASCII: time = ', tock(time)

    call tick(time)
    open(newunit=u,file='test.dat',form='unformatted')
    write(u) array
    close(u)
    print *, 'Binary: time = ', tock(time)


contains
    subroutine tick(t)
        integer, intent(OUT) :: t
        call system_clock(t)
    end subroutine tick

    ! returns time in seconds from now to time described by t 
    real function tock(t)
        integer, intent(in) :: t
        integer :: now, clock_rate
        call system_clock(now,clock_rate)
        tock = real(now - t)/real(clock_rate)
    end function tock

end program testarray

以下是写入磁盘或 ramdisk 的计时输出:

Here are the timing outputs, for writing to disk or to ramdisk:

Disk:
 ASCII: time =    41.193001    
 Binary: time =   0.11700000    
Ramdisk
 ASCII: time =    40.789001    
 Binary: time =   5.70000000E-02

请注意,写入磁盘时,二进制输出的速度是 ASCII 的 352 倍,而对于 ramdisk,则接近 700 倍.这有两个原因——一个是你可以一次写出所有数据,而不必循环;另一个原因是,生成浮点数的字符串十进制表示是一种非常微妙的操作,需要对每个值进行大量计算.

Note that when writing to disk, the binary output is 352 times as fast as ASCII, and to ramdisk it's closer to 700 times. There are two reasons for this - one is that you can write out data all at once, rather than having to loop; the other is that generating the string decimal representation of a floating point number is a surprisingly subtle operation which requires a significant amount of computing for each value.

最后是数据大小;上例中的文本文件(在我的系统上)大约是二进制文件大小的 4 倍.

Finally, is data size; the text file in the above example comes out (on my system) to about 4 times the size of the binary file.

现在,二进制输出确实存在问题.特别是,原始 Fortran(或者,就此而言,C)二进制输出非常脆弱.如果您更改平台,或者您的数据大小发生变化,您的输出可能不再有任何好处.向输出添加新变量将破坏文件格式,除非您总是在文件末尾添加新数据,并且您无法提前知道从您的合作者(谁可能是你,三个月前).通过使用诸如 NetCDF 之类的库,可以避免二进制输出的大多数缺点,这些库编写 self- 描述比原始二进制文件更面向未来"的二进制文件.更好的是,由于它是标准的,许多工具都可以读取 NetCDF 文件.

Now, there are real problems with binary output. In particular, raw Fortran (or, for that matter, C) binary output is very brittle. If you change platforms, or your data size changes, your output may no longer be any good. Adding new variables to the output will break the file format unless you always add new data at the end of the file, and you have no way of knowing ahead of time what variables are in a binary blob you get from your collaborator (who might be you, three months ago). Most of the downsides of binary output are avoided by using libraries like NetCDF, which write self-describing binary files that are much more "future proof" than raw binary. Better still, since it's a standard, many tools read NetCDF files.

网上有很多NetCDF教程;我们的是这里.一个使用 NetCDF 的简单示例给出了与原始二进制文件相似的时间:

There are many NetCDF tutorials on the internet; ours is here. A simple example using NetCDF gives similar times to raw binary:

$ ./array 
 ASCII: time =    40.676998    
 Binary: time =   4.30000015E-02
 NetCDF: time =   0.16000000  

但是给你一个很好的自我描述文件:

but gives you a nice self-describing file:

$ ncdump -h test.nc
netcdf test {
dimensions:
    X = 5000 ;
    Y = 5000 ;
variables:
    float Array(Y, X) ;
        Array:units = "ergs" ;
}

文件大小与原始二进制文件大致相同:

and file sizes about the same as raw binary:

$ du -sh test.*
96M test.dat
96M test.nc
382M    test.txt

代码如下:

program testarray
    implicit none
    integer, parameter :: asize=5000
    real, dimension(asize,asize) :: array

    integer :: i, j
    integer :: time, u

    forall (i=1:asize, j=1:asize) array(i,j)=i*asize+j

    call tick(time)
    open(newunit=u,file='test.txt')
    do i=1,asize
        write(u,*) (array(i,j), j=1,asize)
    enddo
    close(u)
    print *, 'ASCII: time = ', tock(time)

    call tick(time)
    open(newunit=u,file='test.dat',form='unformatted')
    write(u) array
    close(u)
    print *, 'Binary: time = ', tock(time)

    call tick(time)
    call writenetcdffile(array)
    print *, 'NetCDF: time = ', tock(time)


contains
    subroutine tick(t)
        integer, intent(OUT) :: t
        call system_clock(t)
    end subroutine tick

    ! returns time in seconds from now to time described by t 
    real function tock(t)
        integer, intent(in) :: t
        integer :: now, clock_rate
        call system_clock(now,clock_rate)
        tock = real(now - t)/real(clock_rate)
    end function tock

    subroutine writenetcdffile(array)
        use netcdf
        implicit none
        real, intent(IN), dimension(:,:) :: array

        integer :: file_id, xdim_id, ydim_id
        integer :: array_id
        integer, dimension(2) :: arrdims
        character(len=*), parameter :: arrunit = 'ergs'

        integer :: i, j
        integer :: ierr

        i = size(array,1)
        j = size(array,2)

        ! create the file
        ierr = nf90_create(path='test.nc', cmode=NF90_CLOBBER, ncid=file_id)

        ! define the dimensions
        ierr = nf90_def_dim(file_id, 'X', i, xdim_id)
        ierr = nf90_def_dim(file_id, 'Y', j, ydim_id)

        ! now that the dimensions are defined, we can define variables on them,...
        arrdims = (/ xdim_id, ydim_id /)
        ierr = nf90_def_var(file_id, 'Array',  NF90_REAL, arrdims, array_id)

        ! ...and assign units to them as an attribute 
        ierr = nf90_put_att(file_id, array_id, "units", arrunit)

        ! done defining
        ierr = nf90_enddef(file_id)

        ! Write out the values
        ierr = nf90_put_var(file_id, array_id, array)

        ! close; done
        ierr = nf90_close(file_id)
    return
    end subroutine writenetcdffile
end program testarray

这篇关于在fortran中将大数组写入文件的最佳方法?文本与其他的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆