读取fortran未格式化文件时记录标记不一致 [英] inconsistent record marker while reading fortran unformatted file

查看:18
本文介绍了读取fortran未格式化文件时记录标记不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 python 读取一个非常大的 Fortran 无格式二进制文件.该文件包含 2^30 个整数.

I'm trying to read a very big Fortran unformatted binary file with python. This file contains 2^30 integers.

我发现记录标记令人困惑(第一个是-2147483639),无论如何我已经实现了恢复数据结构(那些想要的整数都相似,因此与记录标记不同)并编写下面的代码(与这里的帮助).

I find that the record markers is confusing (the first one is -2147483639), anyway I have achieved to recover the data structure ( those wanted integers are all similar, thus differ from record markers ) and write the code below ( with help of here ).

但是,我们可以看到每条记录的开头和结尾的标记并不相同.这是为什么?

However, we can see the markers at the begin and the end of each record are not the same. Why is that?

是不是因为数据的大小太长了( 536870910 = (2^30 - 4)/2 )?但是 (2^31 - 1)/4 = 536870911 > 536870910.

Is it because the size of the data is too long ( 536870910 = (2^30 - 4) / 2 ) ? But ( 2^31 - 1 ) / 4 = 536870911 > 536870910.

还是只是数据文件的作者犯了一些错误?

Or just some mistakes made by the author of the data file?

另一个问题,记录开头的标记类型是 int 还是 unsigned int?

Another question, what's the type of the marker at begin of a record , int or unsigned int?

fp = open(file_path, "rb")

rec_len1, = struct.unpack( '>i', fp.read(4) )
data1 = np.fromfile( fp, '>i', 536870910)
rec_end1, = struct.unpack( '>i', fp.read(4) )

rec_len2, = struct.unpack( '>i', fp.read(4) )
data2 = np.fromfile( fp, '>i', 536870910)
rec_end2, = struct.unpack( '>i', fp.read(4) )

rec_len3, = struct.unpack( '>i', fp.read(4) )
data3 = np.fromfile( fp, '>i', 4)
rec_end3, = struct.unpack( '>i', fp.read(4) )
data = np.concatenate([data1, data2, data3])

(rec_len1,rec_end1,rec_len2,rec_end2,rec_len3,rec_end3)

这是读取的记录长度值,如上所示:

here's the values of record lenth readed as showed above:

(-2147483639, -2176, 2406, 589824, 1227787, -18)

推荐答案

终于,事情似乎更清楚了.

Finally, things seem to be more clear.

这里是英特尔 Fortran 编译器用户和参考指南,请参阅记录类型:可变长度记录部分.

Here is a Intel Fortran Compiler User and Reference Guides, see the section Record Types:Variable-Length Records.

对于大于 2,147,483,639 字节的记录,记录为分为子记录.子记录的长度可以是 1 到2,147,483,639,包括在内.

For a record length greater than 2,147,483,639 bytes, the record is divided into subrecords. The subrecord can be of any length from 1 to 2,147,483,639, inclusive.

前导长度字段的符号位表示记录是否是否继续.尾随长度字段的符号位表示存在前面的子记录.的位置符号位由文件的字节序格式决定.

The sign bit of the leading length field indicates whether the record is continued or not. The sign bit of the trailing length field indicates the presence of a preceding subrecord. The position of the sign bit is determined by the endian format of the file.

连续的子记录有一个带符号的前导长度字段位值为 1.构成记录的最后一个子记录具有符号位值为 0 的前导长度字段.具有前面的子记录有一个带有符号位的尾随长度字段值为 1.组成记录的第一个子记录有一个尾随符号位值为 0 的长度字段.如果符号位的值为 1,记录的长度以 twos-complement notation 的形式存储.

A subrecord that is continued has a leading length field with a sign bit value of 1. The last subrecord that makes up a record has a leading length field with a sign bit value of 0. A subrecord that has a preceding subrecord has a trailing length field with a sign bit value of 1. The first subrecord that makes up a record has a trailing length field with a sign bit value of 0. If the value of the sign bit is 1, the length of the record is stored in twos-complement notation.

经过多篇文章,我意识到我被twos-complement notation误导了,记录标记只是按照上面的规则改变符号,而不是改为它的twos-complement notation 当符号位为 1 时.无论如何,我的数据也有可能是使用不同的编译器创建.

After many essays, I realized that I was mislead by twos-complement notation, the record marker just change the sign according to the rules above, instead changing to its twos-complement notation when the sign bit is 1. Anyway it's also possible that my data was created with a diffrent compiler.

以下是解决方案.

数据大于 2GB,因此分为多个子记录.正如我们看到的第一个记录开始标记是 -2147483639,所以第一条记录的长度是 2147483639,这正是子记录的最大长度,而不是我想的 2147483640,也不是 2147483638 的 twos-complement notation -2147483639.

The data is larger than 2GB, so it's devided into several subrecords. As we see the first record start marker is -2147483639, so the lenth of the first record is 2147483639 which is exactly the maximum length of subrecord, not 2147483640 as I thought nor 2147483638 the twos-complement notation of -2147483639.

如果我们跳过 2147483639 字节来读取记录结束标记,你会得到 2147483639,因为它是第一个结束标记为正的子记录.

If we skip 2147483639 bytes to read the record end marker, you will get 2147483639, as it's the first subrecord whose end marker is positive.

以下是检查记录标记的代码:

Below is the code to check the record markers:

fp = open(file_path, "rb")
while 1:
    prefix, = struct.unpack( '>i', fp.read(4) )
    fp.seek(abs(prefix), 1)    #or read |prefix| bytes data as you want
    suffix, = struct.unpack( '>i', fp.read(4) )
    print prefix, suffix
    if abs(suffix) - abs(prefix): 
        print "suffix != prefix!"
        break
    if prefix > 0: break

还有丝网印刷

-2147483639 2147483639
-2147483639 -2147483639
18 -18

我们可以看到记录开始标记和结束标记除了符号之外总是相同的.三条记录的长度分别为2147483639、2147483639、18个字节,不一定是4的倍数.所以第一条记录以某个整数的前3个字节结束,第二条记录以其余1个字节开始.

We can see the record begin marker and end marker always are the same except the sign. Length of the three records are 2147483639, 2147483639, 18 bytes, not nessary to be multiple of 4. So the first record ends with the first 3 bytes of certain integer and the second record begins with the rest 1 byte.

这篇关于读取fortran未格式化文件时记录标记不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆