h5py:切片数组数据集的正确方法 [英] h5py: Correct way to slice array datasets

查看:219
本文介绍了h5py:切片数组数据集的正确方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里有点困惑:

据我所知,h5py的.value方法读取整个数据集并将其转储到一个数组中,这很慢且不鼓励(通常应替换为[()].正确的方法是使用numpy-切成薄片.

As far as I have understood, h5py's .value method reads an entire dataset and dumps it into an array, which is slow and discouraged (and should be generally replaced by [()]. The correct way is to use numpy-esque slicing.

但是,我得到的结果令人讨厌(使用h5py 2.2.1):

However, I'm getting irritating results (with h5py 2.2.1):

import h5py
import numpy as np
>>> file = h5py.File("test.hdf5",'w')
# Just fill a test file with a numpy array test dataset
>>> file["test"] = np.arange(0,300000)

# This is TERRIBLY slow?!
>>> file["test"][range(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This is fast
>>> file["test"].value[range(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This is also fast
>>> file["test"].value[np.arange(0,300000)]
array([     0,      1,      2, ..., 299997, 299998, 299999])
# This crashes
>>> file["test"][np.arange(0,300000)]

我猜我的数据集是如此之小,以致.value不会显着影响性能,但是第一个选择怎么会这么慢呢? 这里的首选版本是什么?

I guess that my dataset is so small that .value doesn't hinder performance significantly, but how can the first option be that slow? What is the preferred version here?

谢谢!

更新 看来我还不够清楚,对不起.我确实知道.value将整个数据集复制到内存中,而切片仅会检索适当的子部分.我想知道的是为什么在文件中切片比复制整个数组然后在内存中切片要慢得多. 我一直以为hdf5/h5py是专门实现的,因此切片子部分总是最快的.

UPDATE It seems that I wasn't clear enough, sorry. I do know that .value copies the whole dataset into memory while slicing only retrieves the appropiate subpart. What I'm wondering is why slicing in file is slower than copying the whole array and then slicing in memory. I always thought hdf5/h5py was implemented specifically so that slicing subparts would always be the fastest.

推荐答案

要使用h5py快速切片,请坚持使用普通香草"切片符号:

For fast slicing with h5py, stick to the "plain-vanilla" slice notation:

file['test'][0:300000]

或例如读取其他所有元素:

or, for example, reading every other element:

file['test'][0:300000:2]

简单切片(切片对象和单个整数索引)应该非常快,因为它可以直接转换为HDF5超平板选择.

Simple slicing (slice objects and single integer indices) should be very fast, as it translates directly into HDF5 hyperslab selections.

表达式file['test'][range(300000)]调用h5py的"fancy indexing"版本,即通过显式索引列表进行索引. HDF5中没有本地方法可以执行此操作,因此h5py在Python中实现了一个(较慢的)方法,但不幸的是,当列表> 1000个元素时,该方法的性能会很差.对于file['test'][np.arange(300000)]也是如此,其解释方式相同.

The expression file['test'][range(300000)] invokes h5py's version of "fancy indexing", namely, indexing via an explicit list of indices. There's no native way to do this in HDF5, so h5py implements a (slower) method in Python, which unfortunately has abysmal performance when the lists are > 1000 elements. Likewise for file['test'][np.arange(300000)], which is interpreted in the same way.

另请参阅:

[1] http://docs.h5py.org /en/latest/high/dataset.html#fancy-indexing

[2] https://github.com/h5py/h5py/issues/293

这篇关于h5py:切片数组数据集的正确方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆