在读取Fortran未格式化文件时不一致的记录标记 [英] inconsistent record marker while reading fortran unformatted file

查看:198
本文介绍了在读取Fortran未格式化文件时不一致的记录标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用Python读一个非常大的Fortran未格式化的二进制文件。这个文件包含2 ^ 30个整数。

我发现记录标记很混乱(第一个是-2147483639),无论如何,我已经实现了恢复数据结构那些想要的整数都是相似的,因此不同于记录标记)并且写下面的代码(在这里)。

但是,我们可以看到每条记录开头和结尾的标记不一样。这是为什么?

是因为数据太大(536870910 =(2 ^ 30 - 4)/ 2)?
但(2 ^ 31 - 1)/ 4 = 536870911> 536870910。

或者只是数据文件作者犯的一些错误?



另一个问题,记录开头的标记类型是int还是unsigned int?

  fp = open(file_path,rb)

rec_len1,= struct.unpack('> i',fp.read(4))
data1 = np.fromfile(fp,'> i',536870910)
rec_end1,= struct.unpack('> i',fp.read(4))

rec_len2,= struct .unpack('> i',fp.read(4))
data2 = np.fromfile(fp,'> i',536870910)
rec_end2,= struct.unpack('> i',fp.read(4))

rec_len3,= struct.unpack('> i',fp.read(4))
data3 = np.fromfile(fp, '> i',4)
rec_end3,= struct.unpack('> i',fp.read(4))
data = np.concatenate([data1,data2,data3])

(rec_len1,rec_end1,rec_len2,rec_end2,rec_len3,rec_end3)

这里是贷款记录的价值h如上所示:

 ( -  2147483639,-2176,2406,589824,1227787,-18)


解决方案

最后,事情似乎更加清晰。
$ b

这里是英特尔Fortran编译器用户和参考指南,
请参阅记录类型:可变长度记录
$ b


对于大于2,147,483,639字节的记录长度,记录是
分成子记录。子记录的长度可以是从1到
2,147,483,639,包括$。

前导长度字段的符号位表示记录
是是否继续。尾部长度字段
的符号位表示前面的子记录的存在。
符号位的位置由文件的尾部格式确定。



继续的子记录具有带符号$ b的前导长度字段$ b位值为1.构成记录的最后一个子记录具有符号位值为0的
前导长度字段。在子记录前具有
a的子记录具有带符号的尾部长度字段bit
的值为1.构成记录的第一个子记录具有符号位值为0的尾部
长度字段。如果符号位
的值为1,则长度的记录存储在二进制补码表示法中

经过多篇文章后,我意识到我是误导了二元补码符号,记录标记只是根据上述规则更改符号,而不是在符号位为1时更改为其二进制补码符号。我的数据也有可能是用不同的编译器创建的



下面是解决方案。



数据大于2GB,因此它被分成几个子记录。
当我们看到第一个记录开始标记是-2147483639,
,所以第一个记录的长度是2147483639,这正好是子记录的最大长度,而不是2147483640,因为我认为也不是2147483638 两倍-2147483639的补充符号



如果我们跳过2147483639个字节来读取记录结束标记,您将获得2147483639,
,因为它是结束标记为正的第一个子记录。



下面是检查记录标记的代码:

  fp = open file_path,rb)
while 1:
prefix,= struct.unpack('> i',fp.read(4))
fp.seek(abs(prefix), 1)#或read | prefix |字节数据
后缀,= struct.unpack('> i',fp.read(4))
打印前缀,后缀
如果abs(后缀) - abs(前缀):
打印suffix!= prefix!
打破
如果前缀> 0:打破

和屏幕打印

  -2147483639 2147483639 
-2147483639 -2147483639
18 -18

我们可以看到记录开始标记和结束标记始终是相同的,除了符号。
三条记录的长度分别为2147483639,21447483639,18字节,不要求为4的倍数。所以第一条记录以某个整数的前3个字节结尾,第二条记录以剩下的1个字节开始。

I'm trying to read a very big Fortran unformatted binary file with python. This file contains 2^30 integers.

I find that the record markers is confusing (the first one is -2147483639), anyway I have achieved to recover the data structure ( those wanted integers are all similar, thus differ from record markers ) and write the code below ( with help of here ).

However, we can see the markers at the begin and the end of each record are not the same. Why is that?

Is it because the size of the data is too long ( 536870910 = (2^30 - 4) / 2 ) ? But ( 2^31 - 1 ) / 4 = 536870911 > 536870910.

Or just some mistakes made by the author of the data file?

Another question, what's the type of the marker at begin of a record , int or unsigned int?

fp = open(file_path, "rb")

rec_len1, = struct.unpack( '>i', fp.read(4) )
data1 = np.fromfile( fp, '>i', 536870910)
rec_end1, = struct.unpack( '>i', fp.read(4) )

rec_len2, = struct.unpack( '>i', fp.read(4) )
data2 = np.fromfile( fp, '>i', 536870910)
rec_end2, = struct.unpack( '>i', fp.read(4) )

rec_len3, = struct.unpack( '>i', fp.read(4) )
data3 = np.fromfile( fp, '>i', 4)
rec_end3, = struct.unpack( '>i', fp.read(4) )
data = np.concatenate([data1, data2, data3])

(rec_len1,rec_end1,rec_len2,rec_end2,rec_len3,rec_end3)

here's the values of record lenth readed as showed above:

(-2147483639, -2176, 2406, 589824, 1227787, -18)

解决方案

Finally, things seem to be more clear.

Here is a Intel Fortran Compiler User and Reference Guides, see the section Record Types:Variable-Length Records.

For a record length greater than 2,147,483,639 bytes, the record is divided into subrecords. The subrecord can be of any length from 1 to 2,147,483,639, inclusive.

The sign bit of the leading length field indicates whether the record is continued or not. The sign bit of the trailing length field indicates the presence of a preceding subrecord. The position of the sign bit is determined by the endian format of the file.

A subrecord that is continued has a leading length field with a sign bit value of 1. The last subrecord that makes up a record has a leading length field with a sign bit value of 0. A subrecord that has a preceding subrecord has a trailing length field with a sign bit value of 1. The first subrecord that makes up a record has a trailing length field with a sign bit value of 0. If the value of the sign bit is 1, the length of the record is stored in twos-complement notation.

After many essays, I realized that I was mislead by twos-complement notation, the record marker just change the sign according to the rules above, instead changing to its twos-complement notation when the sign bit is 1. Anyway it's also possible that my data was created with a diffrent compiler.

Below is the solution.

The data is larger than 2GB, so it's devided into several subrecords. As we see the first record start marker is -2147483639, so the lenth of the first record is 2147483639 which is exactly the maximum length of subrecord, not 2147483640 as I thought nor 2147483638 the twos-complement notation of -2147483639.

If we skip 2147483639 bytes to read the record end marker, you will get 2147483639, as it's the first subrecord whose end marker is positive.

Below is the code to check the record markers:

fp = open(file_path, "rb")
while 1:
    prefix, = struct.unpack( '>i', fp.read(4) )
    fp.seek(abs(prefix), 1)    #or read |prefix| bytes data as you want
    suffix, = struct.unpack( '>i', fp.read(4) )
    print prefix, suffix
    if abs(suffix) - abs(prefix): 
        print "suffix != prefix!"
        break
    if prefix > 0: break

And screen prints

-2147483639 2147483639
-2147483639 -2147483639
18 -18

We can see the record begin marker and end marker always are the same except the sign. Length of the three records are 2147483639, 2147483639, 18 bytes, not nessary to be multiple of 4. So the first record ends with the first 3 bytes of certain integer and the second record begins with the rest 1 byte.

这篇关于在读取Fortran未格式化文件时不一致的记录标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆