如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取? [英] How do I lazily concatenate "numpy ndarray"-like objects for sequential reading?

查看：80 发布时间：2020/5/18 22:04:01 python numpy multidimensional-array h5py

本文介绍了如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有几个大型hdf5文件的列表，每个文件都有4D数据集.我想在第一轴上将它们连接起来，例如在一个类似数组的对象中，该对象将被用作所有数据集被连接在一起的方式.我的最终目的是多次沿同一轴顺序读取数据块(例如[0:100,:,:,:]，[100:200,:,:,:]，...).

I have a list of several large hdf5 files, each with a 4D dataset. I would like to obtain a concatenation of them on the first axis, as in, an array-like object that would be used as if all datasets were concatenated. My final intent is to sequentially read chunks of the data along the same axis (e.g. [0:100,:,:,:], [100:200,:,:,:], ...), multiple times.

h5py中的数据集共享numpy数组API的重要部分，这使我可以调用 numpy.concatenate 完成任务:

Datasets in h5py share a significant part of the numpy array API, which allows me to call numpy.concatenate to get the job done:

files = [h5.File(name, 'r') for name in filenames]
X = np.concatenate([f['data'] for f in files], axis=0)

另一方面，内存布局不相同，并且无法在其中共享内存(源代码数组串联功能可以确认这一点.

On the other hand, the memory layout is not the same, and memory cannot be shared among them (related question). Alas, concatenate will eagerly copy the entire content of each array-like object into a new array, which I cannot accept in my use case. The source code of the array concatenation function confirms this.

如何获得多个类似数组的对象的连接视图，而没有急切地将它们读取到内存中?就此视图而言，对该视图进行切片和建立索引的行为就好像我有一个串联数组一样.

How can I obtain a concatenated view over multiple array-like objects, without eagerly reading them to memory? As far as this view is concerned, slicing and indexing over this view would behave just as if I had a concatenated array.

我可以想象编写一个自定义包装器是可行的，但是我想知道这样的实现是否已经以库的形式存在，或者是否存在解决该问题的另一种解决方案，并且是否可行.到目前为止，我的搜索没有产生任何这种结果.我也愿意接受特定于h5py的解决方案.

I can imagine that writing a custom wrapper would work, but I would like to know whether such an implementation already exists as a library, or whether another solution to the problem is available and just as feasible. My searches so far have yielded nothing of this sort. I am also willing to accept solutions specific to h5py.

推荐答案

flist = [f['data'] for f in files]是dataset对象的列表.实际数据在h5文件上，只要这些文件保持打开状态就可以访问.

flist = [f['data'] for f in files] is a list of dataset objects. The actual data is on the h5 files, is accessible as long as those files remain open.

完成时

arr = np.concatenate(flist, axis=0)

我想concatenate首先是

tmep = [np.asarray(a) for a in flist]

，即构造一个numpy数组的列表.我假设np.asarray(f['data'])与f['data'].value或f['data'][:]相同(正如我在两年前在链接的SO问题中所讨论的那样).我应该做一些时间测试，将其与

that is, construct a list of numpy arrays. I assume np.asarray(f['data']) is the same as f['data'].value or f['data'][:] (as I discussed 2 yrs ago in the linked SO question). I should do some time tests comparing that with

arr = np.concatenate([a.value for a in flist], axis=0)

flist是这些数据集的一种惰性编译，因为数据仍然驻留在文件中，并且仅在您执行其他操作时才可以访问.

flist is a kind of lazy compilation of these data sets, in that the data still resides on the file, and is accessed only when you do something more.

[a.value[:,:,:10] for a in flist]

将每个数据集的一部分加载到内存中；我希望该列表上的连接等同于arr[:,:,:10].

would load a portion of each of those data sets into memory; I expect that a concatenate on that list would be the equivalent of arr[:,:,:10].

生成器或生成器理解是一种惰性评估，但是我认为在concatenate中使用它们之前必须将它们转换为列表.无论如何，concatenate的结果始终是一个数组，其中所有数据都在连续的内存块中.它永远不会是文件中的数据块.

Generators or generator comprehensions are a form of lazy evaluation, but I think they have to be turned into lists before use in concatenate. In any case, the result of concatenate is always an array with all the data in a contiguous block of memory. It is never blocks of data residing in files.

您需要告诉我们更多有关此大型串联数据集打算做什么的信息.作为概述，我认为您可以构造包含所有数据集切片的数组.您还可以执行其他操作，如我在上一个答案中所演示的那样-但要花费访问时间.

You need to tell us more about what intend to do with this large concatenated array of data sets. As outline I think you can construct arrays that contain slices of all the data sets. You could also perform other actions as I demonstrate in the previous answer - but with an access time cost.

这篇关于如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取? [英] How do I lazily concatenate "numpy ndarray"-like objects for sequential reading?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取? [英] How do I lazily concatenate &quot;numpy ndarray&quot;-like objects for sequential reading?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

如何将类似"numpy ndarray"的对象延迟连接以进行顺序读取? [英] How do I lazily concatenate "numpy ndarray"-like objects for sequential reading?

登录关闭