如何在hdf5文件中创建可变长度的列? [英] How to create variable length columns in hdf5 file?
问题描述
我正在使用h5py
包为我的训练集创建HDF5
文件.
我想创建具有可变长度的第一列.例如,[1,2,3]
作为列中的第一个条目,[1,2,3,4,5]
作为列中的第二个条目,依此类推,将同一数据集中的其他5列保留在数据类型为int
的HDF5
文件中,且长度固定,即1. /p>
我尝试了以下代码语句来解决这种情况:
dt = h5py.special_dtype(vlen=np.dtype('int32'))
datatype = np.dtype([('FieldA', dt), ('FieldB', dt1), ('FieldC', dt1), ('FieldD', dt1), ('FieldE', dt1), ('FieldF', dt1)])
但是,在输出中,对于该数据集,上述每个列仅得到一个空数组.
而且,当我尝试以下代码时:
dt = h5py.special_dtype(vlen=np.dtype('int32'))
data = db.create_dataset("data1", (5000,), dtype=dt)
这仅给我一列具有可变长度条目的数据集,但我希望所有这6列都包含在同一数据集中,但第1列具有如上所述的可变长度条目.
对于如何为这种情况找到解决方案,我完全感到困惑.任何帮助将不胜感激.
您要使用可变长度(参差不齐)的列,还是只需要一个可以容纳数据数组(不超过dtype限制)的列?第二个很简单.请参见下面的代码. (这是一个简单的示例,其中包含2个字段来演示该方法.)
my_dt = np.dtype([('FieldA', 'int32', (4,)), ('FieldB', 'int32') ] )
with h5py.File('SO_57260167.h5','w') as h5f :
data = h5f.create_dataset("testdata", (10,), dtype=my_dt)
for cnt in range(10) :
arr = np.random.randint(1,1000,size=4)
print (arr)
data[cnt,'FieldA']=arr
data[cnt,'FieldB']=arr[0]
print (data[cnt]['FieldB'])
如果您要使用可变长度(参差不齐")的列,那么我有99%的把握确定在数据集中使用特殊dtype时,您只能使用一列.另外,我认为您无法命名字段/列. (我无法使它正常工作,也找不到任何示例.)
下面的代码显示了上面的示例,将其修改为将变量列数据放入数据集vl_data
中,并将其余的整数数据放入数据集fx_data
中.
vl_dt = h5py.special_dtype(vlen=np.dtype('int32'))
my_dt = np.dtype([('FieldB', 'int32'), ('FieldC', 'int32'), ('FieldD', 'int32'),
('FieldE', 'int32'), ('FieldF', 'int32')])
with h5py.File('SO_57260167_vl.h5','w') as h5f :
vl_data = h5f.create_dataset("testdata_vl", (10,), dtype= vl_dt)
fx_data = h5f.create_dataset("testdata", (10,), dtype=my_dt )
for cnt in range(10) :
arr = np.random.randint(1,1000,size=cnt+2)
# print (arr)
vl_data[cnt]=arr
print (vl_data[cnt])
fx_data[cnt,'FieldB']=arr[0]
fx_data[cnt,'FieldF']=arr[-1]
print (fx_data[cnt])
I am using h5py
package to create HDF5
file for my training set.
I want to create the first column having a variable length. For example, [1,2,3]
as 1st entry in the column, [1,2,3,4,5]
as 2nd entry and so on leaving other 5 columns in the same dataset in HDF5
file with data type int
with a fixed length, i.e. 1.
I have tried the below code statement to solve this type of scenario:
dt = h5py.special_dtype(vlen=np.dtype('int32'))
datatype = np.dtype([('FieldA', dt), ('FieldB', dt1), ('FieldC', dt1), ('FieldD', dt1), ('FieldE', dt1), ('FieldF', dt1)])
But, in the output, I got only empty array for each of the columns stated above for this dataset.
And, when I tried the below code:
dt = h5py.special_dtype(vlen=np.dtype('int32'))
data = db.create_dataset("data1", (5000,), dtype=dt)
This only gives me one column with variable length entries in the dataset but I want all these 6 columns to be included in the same dataset but with 1st column as having variable length entries like stated above.
I am totally confused as to how to get a solution for this type of scenario. Any help would highly be appreciated.
Do you want variable length (ragged) columns, or just a column that can hold an array of data (up to the dtype limit)? The second is pretty straight forward. See the code below. (It's a simple example with 2 fields to demonstrate the method.)
my_dt = np.dtype([('FieldA', 'int32', (4,)), ('FieldB', 'int32') ] )
with h5py.File('SO_57260167.h5','w') as h5f :
data = h5f.create_dataset("testdata", (10,), dtype=my_dt)
for cnt in range(10) :
arr = np.random.randint(1,1000,size=4)
print (arr)
data[cnt,'FieldA']=arr
data[cnt,'FieldB']=arr[0]
print (data[cnt]['FieldB'])
If you want a variable length ("ragged") column, I'm 99% sure you are limited to a single column when using the special dtype in a dataset. Also, I don't think you can name the fields/columns. (I couldn't get it to work, and couldn't find any examples.)
Code below shows example above modified to put variable column data in data set vl_data
and the rest of the integer data in data set fx_data
.
vl_dt = h5py.special_dtype(vlen=np.dtype('int32'))
my_dt = np.dtype([('FieldB', 'int32'), ('FieldC', 'int32'), ('FieldD', 'int32'),
('FieldE', 'int32'), ('FieldF', 'int32')])
with h5py.File('SO_57260167_vl.h5','w') as h5f :
vl_data = h5f.create_dataset("testdata_vl", (10,), dtype= vl_dt)
fx_data = h5f.create_dataset("testdata", (10,), dtype=my_dt )
for cnt in range(10) :
arr = np.random.randint(1,1000,size=cnt+2)
# print (arr)
vl_data[cnt]=arr
print (vl_data[cnt])
fx_data[cnt,'FieldB']=arr[0]
fx_data[cnt,'FieldF']=arr[-1]
print (fx_data[cnt])
这篇关于如何在hdf5文件中创建可变长度的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!