如何在hdf5文件中创建可变长度的列? [英] How to create variable length columns in hdf5 file?

查看:136
本文介绍了如何在hdf5文件中创建可变长度的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用h5py包为我的训练集创建HDF5文件.

我想创建具有可变长度的第一列.例如,[1,2,3]作为列中的第一个条目,[1,2,3,4,5]作为列中的第二个条目,依此类推,将同一数据集中的其他5列保留在数据类型为intHDF5文件中,且长度固定,即1. /p>

我尝试了以下代码语句来解决这种情况:

dt = h5py.special_dtype(vlen=np.dtype('int32'))
datatype = np.dtype([('FieldA', dt), ('FieldB', dt1), ('FieldC', dt1), ('FieldD', dt1), ('FieldE', dt1), ('FieldF', dt1)])

但是,在输出中,对于该数据集,上述每个列仅得到一个空数组.

而且,当我尝试以下代码时:

dt = h5py.special_dtype(vlen=np.dtype('int32'))
data = db.create_dataset("data1", (5000,), dtype=dt)

这仅给我一列具有可变长度条目的数据集,但我希望所有这6列都包含在同一数据集中,但第1列具有如上所述的可变长度条目.

对于如何为这种情况找到解决方案,我完全感到困惑.任何帮助将不胜感激.

解决方案

您要使用可变长度(参差不齐)的列,还是只需要一个可以容纳数据数组(不超过dtype限制)的列?第二个很简单.请参见下面的代码. (这是一个简单的示例,其中包含2个字段来演示该方法.)

my_dt = np.dtype([('FieldA', 'int32', (4,)), ('FieldB', 'int32') ] )


with h5py.File('SO_57260167.h5','w') as h5f :

    data = h5f.create_dataset("testdata", (10,), dtype=my_dt)

    for cnt in range(10) :
        arr = np.random.randint(1,1000,size=4)
        print (arr)
        data[cnt,'FieldA']=arr
        data[cnt,'FieldB']=arr[0]
        print (data[cnt]['FieldB'])

如果您要使用可变长度(参差不齐")的列,那么我有99%的把握确定在数据集中使用特殊dtype时,您只能使用一列.另外,我认为您无法命名字段/列. (我无法使它正常工作,也找不到任何示例.)
下面的代码显示了上面的示例,将其修改为将变量列数据放入数据集vl_data中,并将其余的整数数据放入数据集fx_data中.

vl_dt = h5py.special_dtype(vlen=np.dtype('int32'))
my_dt = np.dtype([('FieldB', 'int32'), ('FieldC', 'int32'), ('FieldD', 'int32'), 
                  ('FieldE', 'int32'), ('FieldF', 'int32')])

with h5py.File('SO_57260167_vl.h5','w') as h5f :

    vl_data = h5f.create_dataset("testdata_vl", (10,), dtype= vl_dt)
    fx_data = h5f.create_dataset("testdata", (10,), dtype=my_dt )

    for cnt in range(10) :
        arr = np.random.randint(1,1000,size=cnt+2)
#        print (arr)
        vl_data[cnt]=arr
        print (vl_data[cnt])
        fx_data[cnt,'FieldB']=arr[0]
        fx_data[cnt,'FieldF']=arr[-1]
        print (fx_data[cnt])

I am using h5py package to create HDF5 file for my training set.

I want to create the first column having a variable length. For example, [1,2,3] as 1st entry in the column, [1,2,3,4,5] as 2nd entry and so on leaving other 5 columns in the same dataset in HDF5 file with data type int with a fixed length, i.e. 1.

I have tried the below code statement to solve this type of scenario:

dt = h5py.special_dtype(vlen=np.dtype('int32'))
datatype = np.dtype([('FieldA', dt), ('FieldB', dt1), ('FieldC', dt1), ('FieldD', dt1), ('FieldE', dt1), ('FieldF', dt1)])

But, in the output, I got only empty array for each of the columns stated above for this dataset.

And, when I tried the below code:

dt = h5py.special_dtype(vlen=np.dtype('int32'))
data = db.create_dataset("data1", (5000,), dtype=dt)

This only gives me one column with variable length entries in the dataset but I want all these 6 columns to be included in the same dataset but with 1st column as having variable length entries like stated above.

I am totally confused as to how to get a solution for this type of scenario. Any help would highly be appreciated.

解决方案

Do you want variable length (ragged) columns, or just a column that can hold an array of data (up to the dtype limit)? The second is pretty straight forward. See the code below. (It's a simple example with 2 fields to demonstrate the method.)

my_dt = np.dtype([('FieldA', 'int32', (4,)), ('FieldB', 'int32') ] )


with h5py.File('SO_57260167.h5','w') as h5f :

    data = h5f.create_dataset("testdata", (10,), dtype=my_dt)

    for cnt in range(10) :
        arr = np.random.randint(1,1000,size=4)
        print (arr)
        data[cnt,'FieldA']=arr
        data[cnt,'FieldB']=arr[0]
        print (data[cnt]['FieldB'])

If you want a variable length ("ragged") column, I'm 99% sure you are limited to a single column when using the special dtype in a dataset. Also, I don't think you can name the fields/columns. (I couldn't get it to work, and couldn't find any examples.)
Code below shows example above modified to put variable column data in data set vl_data and the rest of the integer data in data set fx_data.

vl_dt = h5py.special_dtype(vlen=np.dtype('int32'))
my_dt = np.dtype([('FieldB', 'int32'), ('FieldC', 'int32'), ('FieldD', 'int32'), 
                  ('FieldE', 'int32'), ('FieldF', 'int32')])

with h5py.File('SO_57260167_vl.h5','w') as h5f :

    vl_data = h5f.create_dataset("testdata_vl", (10,), dtype= vl_dt)
    fx_data = h5f.create_dataset("testdata", (10,), dtype=my_dt )

    for cnt in range(10) :
        arr = np.random.randint(1,1000,size=cnt+2)
#        print (arr)
        vl_data[cnt]=arr
        print (vl_data[cnt])
        fx_data[cnt,'FieldB']=arr[0]
        fx_data[cnt,'FieldF']=arr[-1]
        print (fx_data[cnt])

这篇关于如何在hdf5文件中创建可变长度的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆