Python/PyTables:数组的不同列是否可能具有不同的数据类型? [英] Python/PyTables: Is it possible to have different data types for different columns of an array?

查看:136
本文介绍了Python/PyTables:数组的不同列是否可能具有不同的数据类型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了Nx4列的可扩展耳朵.有些列需要float64数据类型,而其他列则可以使用int32进行管理.是否可以在各列之间更改数据类型?现在,我只使用一个(下面的float64),但是它占用了很大的磁盘空间(> 10 GB).

I create an expandable earray of Nx4 columns. Some columns require float64 datatype, the others can be managed with int32. Is it possible to vary the data types among the columns? Right now I just use one (float64, below) for all, but it takes huge disk space for (>10 GB) files.

例如,如何确保列1-2的元素是int32,而3-4的元素是float64 ?

import tables
f1 = tables.open_file("table.h5", "w")
a = f1.create_earray(f1.root, "dataset_1", atom=tables.Float32Atom(), shape=(0, 4))

这是我使用Earray进行追加的简单版本:

Here is a simplistic version of how I am appending using Earray:

Matrix = np.ones(shape=(10**6, 4))

if counter <= 10**6: # keep appending to Matrix until 10**6 rows
    Matrix[s:s+length, 0:4] = chunk2[left:right] # chunk2 is input np.ndarray
    s += length

# save to disk when rows = 10**6
if counter > 10**6:
    a.append(Matrix[:s])  
    del Matrix
    Matrix = np.ones(shape=(10**6, 4))

以下方法的缺点是什么?

import tables as tb
import numpy as np

filename = 'foo.h5'
f = tb.open_file(filename, mode='w')
int_app = f.create_earray(f.root, "col1", atom=tb.Int32Atom(), shape=(0,2), chunkshape=(3,2))
float_app = f.create_earray(f.root, "col2", atom=tb.Float64Atom(), shape=(0,2), chunkshape=(3,2))

# array containing ints..in reality it will be 10**6x2
arr1 = np.array([[1, 1],
                [2, 2],
                [3, 3]], dtype=np.int32)

# array containing floats..in reality it will be 10**6x2
arr2 = np.array([[1.1,1.2],
                 [1.1,1.2],
                 [1.1,1.2]], dtype=np.float64)

for i in range(3):
    int_app.append(arr1)
    float_app.append(arr2)

f.close()

print('\n*********************************************************')
print("\t\t Reading Now=> ")
print('*********************************************************')
c = tb.open_file('foo.h5', mode='r')
chunks1 = c.root.col1
chunks2 = c.root.col2
chunk1 = chunks1.read()
chunk2 = chunks2.read()
print(chunk1)
print(chunk2)

推荐答案

否和是.所有PyTables数组类型(Array,CArray,EArray,VLArray)都用于同类数据类型(类似于NumPy ndarray).如果要混合数据类型,则需要使用表.表是可扩展的;他们有一个 .append()方法来添加数据行.

No and Yes. All PyTables array types (Array, CArray, EArray, VLArray) are for homogeneous datatypes (similar to a NumPy ndarray). If you want to mix datatypes, you need to use a Table. Tables are extendable; they have an .append() method to add rows of data.

创建过程类似于以下答案(仅dtype不同): PyTables create_array无法保存numpy数组.您仅定义一行的数据类型.您无需定义形状或行数.当您向表中添加数据时,这就是隐含的.如果您已经将数据存储在NumPy数组中,则可以使用 description = 条目进行引用,并且Table将对表使用dtype并填充数据.此处的更多信息: PyTables表类

The creation process is similar to this answer (only the dtype is different): PyTables create_array fails to save numpy array. You only define the datatypes for a row. You don't define the shape or number of rows. That is implied as you add data to the table. If you already have your data in a NumPy recarray, you can reference it with the description= entry, and the Table will use the dtype for the table and populate with the data. More info here: PyTables Tables Class

您的代码应如下所示:

import tables as tb
import numpy as np
table_dt = np.dtype(
           {'names': ['int1', 'int2', 'float1', 'float2'], 
            'formats': [int, int, float, float] } )
# Create some random data:
i1 = np.random.randint(0,1000, (10**6,) )
i2 = np.random.randint(0,1000, (10**6,) )
f1 = np.random.rand(10**6)
f2 = np.random.rand(10**6)

with tb.File('table.h5', 'w') as h5f:
    a = h5f.create_table('/', 'dataset_1', description=table_dt)

# Method 1 to create empty recarray 'Matrix', then add data:     
    Matrix = np.recarray( (10**6,), dtype=table_dt)
    Matrix['int1'] = i1
    Matrix['int2'] = i2
    Matrix['float1'] = f1
    Matrix['float2'] = f2        
# Append Matrix to the table
    a.append(Matrix)

# Method 2 to create recarray 'Matrix' with data in 1 step:       
    Matrix = np.rec.fromarrays([i1, i2, f1, f2], dtype=table_dt)
# Append Matrix to the table
    a.append(Matrix)

您提到了创建一个非常大的文件,但是没有说多少行(显然超过10 ** 6).这是基于其他主题中的注释的其他一些想法.

You mentioned creating a very large file, but did not say how many rows (obviously way more than 10**6). Here are some additional thoughts based on comments in another thread.

.create_table()方法具有一个可选参数: expectedrows = .此参数用于"优化HDF5 B树和已用内存量".默认值在 tables/parameters.py 中设置(查找 EXPECTED_ROWS_TABLE ;在我的安装中仅为10000.)我强烈建议您将其设置为更大的值.创建10 ** 6(或更多)行.

The .create_table() method has an optional parameter: expectedrows=. This parameter is used 'to optimize the HDF5 B-Tree and amount of memory used'. Default value is set in tables/parameters.py (look for EXPECTED_ROWS_TABLE; It's only 10000 in my installation.) I highly suggest you set this to a larger value if you are creating 10**6 (or more) rows.

此外,您应该考虑文件压缩.需要权衡:压缩会减小文件大小,但会降低I/O性能(增加访问时间).有几种选择:

Also, you should consider file compression. There's a trade-off: compression reduces the file size, but will reduce I/O performance (increases access time). There are a few options:

  1. 在创建文件时启用压缩(在创建文件时添加 filters = 参数).从 tb.Filters(complevel = 1)开始.
  2. 使用HDF Group实用程序 h5repack -针对HDF5文件运行以创建新文件(从未压缩变为压缩,或反之亦然).
  3. 使用PyTables实用程序 ptrepack -与 h4repack 类似,并且与PyTables一起提供.
  1. Enable compression when you create the file (add the filters= parameter when you create the file). Start with tb.Filters(complevel=1).
  2. Use the HDF Group utility h5repack - run against a HDF5 file to create a new file (useful to go from uncompressed to compressed, or vice-versa).
  3. Use the PyTables utility ptrepack - works similar to h4repack and delivered with PyTables.

我倾向于使用经常使用的未压缩文件来获得最佳I/O性能.然后,完成后,我将转换为压缩格式以进行长期归档.

I tend to use uncompressed files I work with often for best I/O performance. Then when done, I convert to compressed format for long term archiving.

这篇关于Python/PyTables:数组的不同列是否可能具有不同的数据类型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆