如何在hdf5中压缩列表/嵌套列表 [英] how to compress lists/nested lists in hdf5

查看:85
本文介绍了如何在hdf5中压缩列表/嵌套列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近了解了hdf5压缩并正在使用它.在处理巨大文件时,它比.npz/npy有一些优势.我设法尝试了一个小列表,因为有时我会处理具有以下字符串的列表;

I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files. I managed to try out a small list, since I do sometimes work with lists that have strings as follows;

def write():
    test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
    

    with  h5py.File('example_file.h5', 'w') as f:
        f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9) 
        f.close()
    

但是我遇到了这个错误:

However I got this error:

f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)

在网上搜索了任何更好的方法进行数小时后,我无法获得.有没有更好的方法可以用H5压缩列表?

After searching for hours over the net on any better ways to do this, I couldn't get. Is there a better way to compress lists with H5?

推荐答案

对于嵌套列表,这是一个更通用的答案,其中每个嵌套列表的长度都不同.当嵌套列表的长度相等时,它也适用于更简单的情况.有2种解决方案:1种使用h5py,而另一种使用PyTables.

This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.

h5py示例
h5py不支持参差不齐的数组,因此您必须基于最长的子字符串创建数据集,并将元素添加到"short"数组中.子字符串.在嵌套列表中没有对应值的每个数组位置,您将得到'None'(或子字符串).注意 dtype = 条目.这显示了如何在列表中找到最长的字符串(如slen = ##)并使用它来创建 dtype ='S ##'

h5py example
h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings. You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'

import h5py
import numpy as np

test_list = [['a01','a02','a03','a04','a05','a06'], 
             ['a11','a12','a13','a14','a15','a16','a17'], 
             ['a21','a22','a23','a24','a25','a26','a27','a28']]

# arrlen and test_array from answer to SO #10346336 - Option 3:
# Ref: https://stackoverflow.com/a/26224619/10462884    
slen = max(len(item) for sublist in test_list for item in sublist)
arrlen = max(map(len, test_list))
test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
  
with h5py.File('example_nested.h5', 'w') as f:
     f.create_dataset('test3', data=test_array, compression='gzip')

PyTables示例
PyTables支持衣衫2-的二维数组作为VLArrays(可变长度).这避免了为"short"值添加"None"值的复杂性.子字符串.另外,您不必预先确定数组的长度,因为在创建VLArray时并没有定义行数(在创建后添加行).同样,请注意 dtype = 条目.使用与上面相同的方法.

PyTables example
PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.

import tables as tb
import numpy as np

test_list = [['a01','a02','a03','a04','a05','a06'], 
             ['a11','a12','a13','a14','a15','a16','a17'], 
             ['a21','a22','a23','a24','a25','a26','a27','a28']]
   
slen = max(len(item) for sublist in test_list for item in sublist)

with tb.File('example_nested_tb.h5', 'w') as h5f:        
    vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) ) 
    for slist in test_list:
        arr = np.array(slist,dtype='S'+str(slen))
        vlarray.append(arr)

    print('-->', vlarray.name)
    for row in vlarray:
        print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))

这篇关于如何在hdf5中压缩列表/嵌套列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆