如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取? [英] How do I lazily concatenate "numpy ndarray"-like objects for sequential reading?

查看:80
本文介绍了如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个大型hdf5文件的列表,每个文件都有4D数据集.我想在第一轴上将它们连接起来,例如在一个类似数组的对象中,该对象将被用作所有数据集被连接在一起的方式.我的最终目的是多次沿同一轴顺序读取数据块(例如[0:100,:,:,:][100:200,:,:,:],...).

I have a list of several large hdf5 files, each with a 4D dataset. I would like to obtain a concatenation of them on the first axis, as in, an array-like object that would be used as if all datasets were concatenated. My final intent is to sequentially read chunks of the data along the same axis (e.g. [0:100,:,:,:], [100:200,:,:,:], ...), multiple times.

h5py中的数据集共享numpy数组API的重要部分,这使我可以调用 numpy.concatenate 完成任务:

Datasets in h5py share a significant part of the numpy array API, which allows me to call numpy.concatenate to get the job done:

files = [h5.File(name, 'r') for name in filenames]
X = np.concatenate([f['data'] for f in files], axis=0)

另一方面,内存布局不相同,并且无法在其中共享内存(源代码数组串联功能可以确认这一点.

On the other hand, the memory layout is not the same, and memory cannot be shared among them (related question). Alas, concatenate will eagerly copy the entire content of each array-like object into a new array, which I cannot accept in my use case. The source code of the array concatenation function confirms this.

如何获得多个类似数组的对象的连接视图,而没有急切地将它们读取到内存中?就此视图而言,对该视图进行切片和建立索引的行为就好像我有一个串联数组一样.

How can I obtain a concatenated view over multiple array-like objects, without eagerly reading them to memory? As far as this view is concerned, slicing and indexing over this view would behave just as if I had a concatenated array.

我可以想象编写一个自定义包装器是可行的,但是我想知道这样的实现是否已经以库的形式存在,或者是否存在解决该问题的另一种解决方案,并且是否可行.到目前为止,我的搜索没有产生任何这种结果.我也愿意接受特定于h5py的解决方案.

I can imagine that writing a custom wrapper would work, but I would like to know whether such an implementation already exists as a library, or whether another solution to the problem is available and just as feasible. My searches so far have yielded nothing of this sort. I am also willing to accept solutions specific to h5py.

推荐答案

flist = [f['data'] for f in files]dataset对象的列表.实际数据在h5文件上,只要这些文件保持打开状态就可以访问.

flist = [f['data'] for f in files] is a list of dataset objects. The actual data is on the h5 files, is accessible as long as those files remain open.

完成时

arr = np.concatenate(flist, axis=0)

我想concatenate首先是

tmep = [np.asarray(a) for a in flist]

,即构造一个numpy数组的列表.我假设np.asarray(f['data'])f['data'].valuef['data'][:]相同(正如我在两年前在链接的SO问题中所讨论的那样).我应该做一些时间测试,将其与

that is, construct a list of numpy arrays. I assume np.asarray(f['data']) is the same as f['data'].value or f['data'][:] (as I discussed 2 yrs ago in the linked SO question). I should do some time tests comparing that with

arr = np.concatenate([a.value for a in flist], axis=0)

flist是这些数据集的一种惰性编译,因为数据仍然驻留在文件中,并且仅在您执行其他操作时才可以访问.

flist is a kind of lazy compilation of these data sets, in that the data still resides on the file, and is accessed only when you do something more.

[a.value[:,:,:10] for a in flist]

将每个数据集的一部分加载到内存中;我希望该列表上的连接等同于arr[:,:,:10].

would load a portion of each of those data sets into memory; I expect that a concatenate on that list would be the equivalent of arr[:,:,:10].

生成器或生成器理解是一种惰性评估,但是我认为在concatenate中使用它们之前必须将它们转换为列表.无论如何,concatenate的结果始终是一个数组,其中所有数据都在连续的内存块中.它永远不会是文件中的数据块.

Generators or generator comprehensions are a form of lazy evaluation, but I think they have to be turned into lists before use in concatenate. In any case, the result of concatenate is always an array with all the data in a contiguous block of memory. It is never blocks of data residing in files.

您需要告诉我们更多有关此大型串联数据集打算做什么的信息.作为概述,我认为您可以构造包含所有数据集切片的数组.您还可以执行其他操作,如我在上一个答案中所演示的那样-但要花费访问时间.

You need to tell us more about what intend to do with this large concatenated array of data sets. As outline I think you can construct arrays that contain slices of all the data sets. You could also perform other actions as I demonstrate in the previous answer - but with an access time cost.

这篇关于如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆