是否可以使用python将磁盘上的不连续数据映射到数组? [英] Is it possible to map a discontiuous data on disk to an array with python?

查看:35
本文介绍了是否可以使用python将磁盘上的不连续数据映射到数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将硬盘上的一个大的 Fortran 记录(12G)映射到一个 numpy 数组.(映射而不是加载以节省内存.)

I want to map a big fortran record (12G) on hard disk to a numpy array. (Mapping instead of loading for saving memory.)

存储在 fortran 记录中的数据不是连续的,因为它被记录标记分隔.记录结构为标记、数据、标记、数据、...、数据、标记".数据区域和标记的长度是已知的.

The data stored in fortran record is not continuous as it is divided by record markers. The record structure is as "marker, data, marker, data,..., data, marker". The length of data regions and markers are known.

标记之间的数据长度不是4字节的倍数,否则我可以将每个数据区域映射到一个数组.

The length of data between markers is not multiple of 4 bytes, otherwise I can map each data region to an array.

在memmap中设置offset可以跳过第一个marker,是否可以跳过其他marker,将数据映射到数组?

The first marker can be skipped by setting offset in memmap, is it possible to skip other markers and map the data to an array?

对于可能的歧义表达表示歉意,并感谢您提供任何解决方案或建议.

Apology for possible ambiguous expression and thanks for any solution or suggestion.

5 月 15 日编辑

这些是 fortran 未格式化的文件.存储在记录中的数据是一个 (1024^3)*3 的 float32 数组 (12Gb).

These are fortran unformatted files. The data stored in record is a (1024^3)*3 float32 array (12Gb).

大于2GB的变长记录的记录布局如下图所示:

The record layout of variable-length records that are greater than 2 gigabytes is shown below:

(详情参见 这里 -> [记录类型] -> [可变长度记录].)

(For details see here -> the section [Record Types] -> [Variable-Length Records].)

在我的例子中,除了最后一个,每个子记录的长度为 2147483639 字节,并由 8 个字节分隔(如上图所示,前一个子记录的结束标记和下一个子记录的开始标记,共 8 个字节).

In my case, except the last one, each subrecord has a length of 2147483639 bytes and separated by 8 bytes (as you see in the figure above, a end marker of the previous subrecord and a begin marker of the following one, 8 bytes in total ) .

我们可以看到第一个子记录以某个浮点数的前 3 个字节结束,第二个子记录以剩余的 1 个字节开始,如 2147483639 mod 4 =3.

We can see the first subrecord ends with the first 3 bytes of certain float number and the second subrecord begins with the rest 1 byte as 2147483639 mod 4 =3.

推荐答案

我发布了另一个答案,因为对于 给出的示例这里 numpy.memmap 起作用了:

I posted another answer because for the example given here numpy.memmap worked:

offset = 0
data1 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=0, shape=(size1))
offset += size1*byte_size
data2 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=offset, shape=(size2))
offset += size1*byte_size
data3 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=offset, shape=(size3))

对于int32 byte_size=32/8,对于int16 byte_size=16/8 等等...

for int32 byte_size=32/8, for int16 byte_size=16/8 and so forth...

如果大小不变,您可以将数据加载到二维数组中,例如:

If the sizes are constant, you can load the data in a 2D array like:

shape = (total_length/size,size)
data = np.memmap('tmp', dtype='i', mode='r+', order='F', shape=shape)

您可以根据需要更改 memmap 对象.甚至可以让数组共享相同的元素.在这种情况下,在一个中所做的更改会在另一个中自动更新.

You can change the memmap object as long as you want. It is even possible to make arrays sharing the same elements. In that case the changes made in one are automatically updated in the other.

其他参考:

numpy.memmap 文档在这里.

numpy.memmap documentation here.

这篇关于是否可以使用python将磁盘上的不连续数据映射到数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆