h5py:如何读取hdf5文件的选定行? [英] h5py: how to read selected rows of an hdf5 file?

查看:392
本文介绍了h5py:如何读取hdf5文件的选定行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以在不加载整个文件的情况下从hdf5文件中读取给定的行集?我有相当大的hdf5文件,其中包含大量数据集,这是我想减少时间和内存使用量的一个示例:

Is it possible to read a given set of rows from an hdf5 file without loading the whole file? I have quite big hdf5 files with loads of datasets, here is an example of what I had in mind to reduce time and memory usage:

#! /usr/bin/env python

import numpy as np
import h5py

infile = 'field1.87.hdf5'
f = h5py.File(infile,'r')
group = f['Data']

mdisk = group['mdisk'].value

val = 2.*pow(10.,10.)
ind = np.where(mdisk>val)[0]

m = group['mcold'][ind]
print m

ind不给出连续的行,而是分散的行.

ind doesn't give consecutive rows but rather scattered ones.

上面的代码失败,但是它遵循切片hdf5数据集的标准方法.我收到的错误消息是:

The above code fails, but it follows the standard way of slicing an hdf5 dataset. The error message I get is:

Traceback (most recent call last):
  File "./read_rows.py", line 17, in <module>
    m = group['mcold'][ind]
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/dataset.py", line 425, in __getitem__
    selection = sel.select(self.shape, args, dsid=self.id)
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 71, in select
    sel[arg]
  File "/cosma/local/Python/2.7.3/lib/python2.7/site-packages/h5py-2.3.1-py2.7-linux-x86_64.egg/h5py/_hl/selections.py", line 209, in __getitem__
    raise TypeError("PointSelection __getitem__ only works with bool arrays")
TypeError: PointSelection __getitem__ only works with bool arrays

推荐答案

我有一个样本h5py文件,其内容为:

I have a sample h5py file with:

data = f['data']
#  <HDF5 dataset "data": shape (3, 6), type "<i4">
# is arange(18).reshape(3,6)
ind=np.where(data[:]%2)[0]
# array([0, 0, 0, 1, 1, 1, 2, 2, 2], dtype=int32)
data[ind]  # getitem only works with boolean arrays error
data[ind.tolist()] # can't read data (Dataset: Read failed) error

最后一个错误是由列表中的重复值引起的.

This last error is caused by repeated values in the list.

但是使用具有唯一值的列表建立索引会很好

But indexing with lists with unique values works fine

In [150]: data[[0,2]]
Out[150]: 
array([[ 0,  1,  2,  3,  4,  5],
       [12, 13, 14, 15, 16, 17]])

In [151]: data[:,[0,3,5]]
Out[151]: 
array([[ 0,  3,  5],
       [ 6,  9, 11],
       [12, 15, 17]])

具有适当尺寸切片的数组也是如此:

So does an array with the proper dimension slicing:

In [157]: data[ind[[0,3,6]],:]
Out[157]: 
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17]])
In [165]: f['data'][:2,np.array([0,3,5])]
Out[165]: 
array([[ 0,  3,  5],
       [ 6,  9, 11]])
In [166]: f['data'][[0,1],np.array([0,3,5])]  
# errror about only one indexing array allowed

因此,如果索引正确-唯一值,并且与数组尺寸匹配,它应该可以工作.

So if the indexing is right - unique values, and matching the array dimensions, it should work.

我的简单示例未测试要加载多少数组.文档听起来好像是从文件中选择了元素,而没有将整个数组加载到内存中.

My simple example doesn't test how much of the array is loaded. The documentation sounds as though elements are selected from the file without loading the whole array into memory.

这篇关于h5py:如何读取hdf5文件的选定行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆