python h5py:我可以存储不同列具有不同类型的数据集吗? [英] python h5py: can I store a dataset which different columns have different types?

查看:327
本文介绍了python h5py:我可以存储不同列具有不同类型的数据集吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个表,其中有很多列,只有几列是浮点型的,其他则是小整数,例如:

Suppose I have a table which has many columns, only a few columns is float type, others are small integers, for example:

col1, col2, col3, col4
1.31   1      2     3
2.33   3      5     4
...

我如何有效地存储它,假设我对这个数据集使用np.float32,存储被浪费了,因为其他列只有一个小整数,它们不需要那么多空间.如果我使用np.int16,则float列不准确,这也是我想要的. 因此,我该如何处理这种情况?

How can I store this effectively, suppose I use np.float32 for this dataset, the storage is wasted, because other columns only have a small integer, they don't need so much space. If I use np.int16, the float column is not exact, which also what I wanted. Therefore how do I deal with the situation like this?

假设我还有一个字符串列,这让我更加困惑,该如何存储数据?

Suppose I also have a string column, which make me more confused, how should I store the data?

col1, col2, col3, col4, col5
1.31   1      2     3    "a"
2.33   3      5     4    "b"
...

为简化起见,假设string列仅具有固定长度的字符串,例如长度为3.

To make things simpler, lets suppose the string column has fix length strings only, for example, length of 3.

推荐答案

我将演示结构化数组方法:

I'm going to demonstrate the structured array approach:

我猜您是从csv文件表"开始的.如果不是,那仍然是将样本转换为数组的最简单方法:

I'm guessing you are starting with a csv file 'table'. If not it's still the easiest way to turn your sample into an array:

In [40]: txt = '''col1, col2, col3, col4, col5
    ...: 1.31   1      2     3    "a"
    ...: 2.33   3      5     4    "b"
    ...: '''


In [42]: data = np.genfromtxt(txt.splitlines(), names=True, dtype=None, encoding=None)

In [43]: data
Out[43]: 
array([(1.31, 1, 2, 3, '"a"'), (2.33, 3, 5, 4, '"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])

使用这些参数,genfromtxt负责创建结构化数组.请注意,这是一个具有5个字段的1d数组.字段dtype由数据确定.

With these parameters, genfromtxt takes care of creating a structured array. Note it is a 1d array with 5 fields. Fields dtype are determined from the data.

In [44]: import h5py
...

In [46]: f = h5py.File('struct.h5', 'w')

In [48]: ds = f.create_dataset('data',data=data)
...
TypeError: No conversion path for dtype: dtype('<U3')

但是h5py在保存unicode字符串时存在问题(py3的默认设置).可能有一些解决方法,但是在这里将字符串dtype转换为字节串会更简单.此外,它会更紧凑.

But h5py has problems saving the unicode strings (default for py3). There may be ways around that, but here it will be simpler to convert the string dtype to bytestrings. Besides, that'll be more compact.

要进行转换,我将创建一个新的dtype,并使用astype.另外,我可以在genfromtxt调用中指定dtypes.

To convert that, I'll make a new dtype, and use astype. Alternatively I could specify the dtypes in the genfromtxt call.

In [49]: data.dtype
Out[49]: dtype([('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])

In [50]: data.dtype.descr
Out[50]: 
[('col1', '<f8'),
 ('col2', '<i8'),
 ('col3', '<i8'),
 ('col4', '<i8'),
 ('col5', '<U3')]

In [51]: dt1 = data.dtype.descr

In [52]: dt1[-1] = ('col5', 'S3')

In [53]: data.astype(dt1)
Out[53]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])

现在,它可以毫无问题地保存数组:

Now it saves the array without problem:

In [54]: data1 = data.astype(dt1)

In [55]: data1
Out[55]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])

In [56]: ds = f.create_dataset('data',data=data1)

In [57]: ds
Out[57]: <HDF5 dataset "data": shape (2,), type "|V35">

In [58]: ds[:]
Out[58]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])

我可以做进一步的修改,缩短一个或多个int字段:

I could make further modifications, shortening one or more of the int fields:

In [60]: dt1[1] = ('col2','i2')    
In [61]: dt1[2] = ('col3','i2')

In [62]: dt1
Out[62]: 
[('col1', '<f8'),
 ('col2', 'i2'),
 ('col3', 'i2'),
 ('col4', '<i8'),
 ('col5', 'S3')]

In [63]: data1 = data.astype(dt1)

In [64]: data1
Out[64]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])

In [65]: ds1 = f.create_dataset('data1',data=data1)

ds1的存储空间更紧凑,"V23"和"V35"

ds1 has a more compact storage, 'V23' vs 'V35'

In [67]: ds1
Out[67]: <HDF5 dataset "data1": shape (2,), type "|V23">

In [68]: ds1[:]
Out[68]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])

这篇关于python h5py:我可以存储不同列具有不同类型的数据集吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆