是否有可能到磁盘上的discontiuous数据映射到与Python数组? [英] Is it possible to map a discontiuous data on disk to an array with python?

查看:179
本文介绍了是否有可能到磁盘上的discontiuous数据映射到与Python数组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在硬盘上的一个大的FORTRAN记录(12G)映射到numpy的数组。 (映射,而不必加载保存的记忆。)

I want to map a big fortran record (12G) on hard disk to a numpy array. (Mapping instead of loading for saving memory.)

因为它记录标记划分存储在FORTRAN记录中的数据是不连续的。记录结构为标记,数据,指标,数据,...,数据,标记。数据区域和标记的长度是已知的。

The data stored in fortran record is not continuous as it is divided by record markers. The record structure is as "marker, data, marker, data,..., data, marker". The length of data regions and markers are known.

标记物之间的数据的长度不是4字节倍数,否则我每个数据区域映射到阵列

The length of data between markers is not multiple of 4 bytes, otherwise I can map each data region to an array.

第一个标记可以通过MEMMAP设定偏移被跳过,是有可能跳过其他标记和数据映射到一个数组中?

The first marker can be skipped by setting offset in memmap, is it possible to skip other markers and map the data to an array?

道歉模棱两可可能前pression和感谢任何解决方案或建议。

Apology for possible ambiguous expression and thanks for any solution or suggestion.

编辑5月15日

这是FORTRAN未格式化的文件。存储在记录中的数据是(1024 ^ 3)* 3 FLOAT32阵列(12GB)。

These are fortran unformatted files. The data stored in record is a (1024^3)*3 float32 array (12Gb).

的可变长度记录是大于2千兆字节的记录布局如下:

The record layout of variable-length records that are greater than 2 gigabytes is shown below:

(详见<一个href=\"http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/fortran-win/index.htm#GUID-64D43E4C-68E7-4C48-8B50-B49F1F7DA46C.htm\"相对=nofollow>此处 - >段[记录类型] - > [可变长度记录])

(For details see here -> the section [Record Types] -> [Variable-Length Records].)

在我的情况下,除了最后一个,每个子记录为2147483639字节的长度和8个字节分隔(如你在图中看到上述情况,previous子记录的结束标志和开始的标志下列操作之一,共8个字节)。

In my case, except the last one, each subrecord has a length of 2147483639 bytes and separated by 8 bytes (as you see in the figure above, a end marker of the previous subrecord and a begin marker of the following one, 8 bytes in total ) .

我们可以看到第一子记录的前3个字节某些浮点数和第二子记录的结束开始,其余1个字节作为2147483639模4 = 3

We can see the first subrecord ends with the first 3 bytes of certain float number and the second subrecord begins with the rest 1 byte as 2147483639 mod 4 =3.

推荐答案

我张贴了另一个答案,因为这里给出的示例 numpy.memmap 工作

I posted another answer because for the example given here numpy.memmap worked:

offset = 0
data1 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=0, shape=(size1))
offset += size1*byte_size
data2 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=offset, shape=(size2))
offset += size1*byte_size
data3 = np.memmap('tmp', dtype='i', mode='r+', order='F',
                  offset=offset, shape=(size3))

INT32 byte_size = 32/8 INT16 byte_size =八分之一十六等等...

如果大小是不变的,你可以在一个二维数组一样加载数据:

If the sizes are constant, you can load the data in a 2D array like:

shape = (total_length/size,size)
data = np.memmap('tmp', dtype='i', mode='r+', order='F', shape=shape)

您可以只要你想修改 MEMMAP 对象。它甚至可能使阵列共享相同的元件。在这种情况下,在一个所做的更改在其它自动更新。

You can change the memmap object as long as you want. It is even possible to make arrays sharing the same elements. In that case the changes made in one are automatically updated in the other.

其他参考资料:

这里 numpy.memmap 文件。

numpy.memmap documentation here.

这篇关于是否有可能到磁盘上的discontiuous数据映射到与Python数组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆