pytables的写速度比h5py快得多.为什么? [英] pytables writes much faster than h5py. Why?

查看:196
本文介绍了pytables的写速度比h5py快得多.为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我注意到,如果我使用h5py库而不是pytables库,则编写.h5文件的时间会更长.是什么原因?当阵列的形状以前已知时,也是如此.此外,我使用相同的块大小,没有压缩过滤器.

I noticed that writing .h5 files takes much longer if I use the h5py library instead of the pytables library. What is the reason? This is also true when the shape of the array is known before. Further, i use the same chunksize and no compression filter.

以下脚本:

import h5py
import tables
import numpy as np
from time import time

dim1, dim2 = 64, 1527416

# append columns
print("PYTABLES: append columns")
print("=" * 32)
f = tables.open_file("/tmp/test.h5", "w")
a = f.create_earray(f.root, "time_data", tables.Float32Atom(), shape=(0, dim1))
t1 = time()
zeros = np.zeros((1, dim1), dtype="float32")
for i in range(dim2):
    a.append(zeros)
tcre = round(time() - t1, 3)
thcre = round(dim1 * dim2 * 4 / (tcre * 1024 * 1024), 1)
print("Time to append %d columns: %s sec (%s MB/s)" % (i+1, tcre, thcre))
print("=" * 32)
chunkshape = a.chunkshape
f.close()

print("H5PY: append columns")
print("=" * 32)
f = h5py.File(name="/tmp/test.h5",mode='w')
a = f.create_dataset(name='time_data',shape=(0, dim1),
                     maxshape=(None,dim1),dtype='f',chunks=chunkshape)
t1 = time()
zeros = np.zeros((1, dim1), dtype="float32")
samplesWritten = 0
for i in range(dim2):
    a.resize((samplesWritten+1, dim1))
    a[samplesWritten:(samplesWritten+1),:] = zeros
    samplesWritten += 1
tcre = round(time() - t1, 3)
thcre = round(dim1 * dim2 * 4 / (tcre * 1024 * 1024), 1)
print("Time to append %d columns: %s sec (%s MB/s)" % (i+1, tcre, thcre))
print("=" * 32)
f.close()

在我的计算机上返回:

PYTABLES: append columns
================================
Time to append 1527416 columns: 22.679 sec (16.4 MB/s)
================================
H5PY: append columns
================================
Time to append 1527416 columns: 158.894 sec (2.3 MB/s)
================================

如果我在每个for循环之后都刷新,例如:

If I flush after every for loop, like:

for i in range(dim2):
    a.append(zeros)
    f.flush()

我得到:

PYTABLES: append columns
================================
Time to append 1527416 columns: 67.481 sec (5.5 MB/s)
================================
H5PY: append columns
================================
Time to append 1527416 columns: 193.644 sec (1.9 MB/s)
================================

推荐答案

这是 PyTables h5py 写入性能的有趣比较.通常,我使用它们来读取HDF5文件(通常读取一些大型数据集),因此没有注意到这种差异.我的想法与@ max9111一致:随着写入数据集的大小增加,随着写入操作次数的减少,性能应该会提高.为此,我对您的代码进行了重新设计,以减少循环次数来写入N行数据. (代码在末尾).
结果令人惊讶(对我而言).主要发现:
1.写入所有数据的总时间是循环数的线性函数(对于PyTables和h5py).
2. PyTables和h5py之间的性能差异仅随着数据集I/O大小的增加而略有改善.
3. Pytables一次写入1行(写入1,527,416次)的速度提高了5.4倍,而一次写入88行(17,357次写入)的速度提高了3.5倍.

This is an interesting comparison of PyTables and h5py write performance. Typically I use them to read HDF5 files (and usually with a few reads of large datasets), so haven't noticed this difference. My thoughts align with @max9111: that performance should improve as the number of write operations decreased as the size of the written dataset increased. To that end, I reworked your code to write N lines of data using fewer loops. (Code is at the end).
Results were surprising (to me). Key findings:
1. Total time to write all of the data was a linear function of the # of loops (for both PyTables and h5py).
2. The performance difference between PyTables and h5py only improved slightly as dataset I/O size increased.
3. Pytables was 5.4x faster writing 1 row at a time (1,527,416 writes), and was 3.5x faster writing 88 rows at a time (17,357 writes).

这是比较性能的图.
具有上表值的图表.

Here is a plot comparing performance.
Chart with values for table above.

此外,我注意到您的代码注释说追加列",但是您正在扩展第一个维度(HDF5表/数据集的行).我重写了您的代码以测试扩展第二维的性能(将列添加到HDF5文件),并且看到了非常相似的性能.

Also, I noticed your code comments say "append columns", but you are extending the first dimension (rows of a HDF5 table/dataset). I rewrote your code to test performance extending the second dimension (adding columns to the HDF5 file), and saw very similar performance.

最初,我认为I/O瓶颈是由于调整数据集的大小而引起的.因此,我重写了该示例以初始调整数组大小以容纳所有行. 这并没有提高性能(并没有显着降低h5py的性能).这是非常令人惊讶的.不知道该怎么做.

Initially I thought the I/O bottleneck was due to resizing the datasets. So, I rewrote the example to initially size the array to hold the all rows. This did NOT improve performance (and significantly degraded h5py performance). That was very surprising. Not sure what to make of it.

这是我的例子.它使用3个变量来调整数组的大小(随着数据的添加):

Here is my example. It uses 3 variables that size the array (as data is added):

  • cdim:列数(固定)
  • row_loops:#个写循环
  • block_size:在每个循环上写入的数据块的大小
  • row_loops * block_size =写入的总行数

我还对添加的1(而不是零)进行了微小的更改(以验证数据是否已写入,并将其移到顶部(并退出定时循环).

I also made a small change to the add Ones instead of Zeros (to verify data was written, and moved it to the top (and out of the timing loops).

我的代码在这里:

import h5py
import tables
import numpy as np
from time import time

cdim, block_size, row_loops = 64, 4, 381854 
vals = np.ones((block_size, cdim), dtype="float32")

# append rows
print("PYTABLES: append rows: %d blocks with: %d rows" % (row_loops, block_size))
print("=" * 32)
f = tables.open_file("rowapp_test_tb.h5", "w")
a = f.create_earray(f.root, "time_data", atom=tables.Float32Atom(), shape=(0, cdim))
t1 = time()
for i in range(row_loops):
    a.append(vals)
tcre = round(time() - t1, 3)
thcre = round(cdim * block_size * row_loops * 4 / (tcre * 1024 * 1024), 1)
print("Time to append %d rows: %s sec (%s MB/s)" % (block_size * row_loops, tcre, thcre))
print("=" * 32)
chunkshape = a.chunkshape
f.close()

print("H5PY: append rows %d blocks with: %d rows" % (row_loops, block_size))
print("=" * 32)
f = h5py.File(name="rowapp_test_h5.h5",mode='w')
a = f.create_dataset(name='time_data',shape=(0, cdim),
                     maxshape=(block_size*row_loops,cdim),
                     dtype='f',chunks=chunkshape)
t1 = time()
samplesWritten = 0
for i in range(row_loops):
    a.resize(((i+1)*block_size, cdim))
    a[samplesWritten:samplesWritten+block_size] = vals
    samplesWritten += block_size
tcre = round(time() - t1, 3)
thcre = round(cdim * block_size * row_loops * 4 / (tcre * 1024 * 1024), 1)
print("Time to append %d rows: %s sec (%s MB/s)" % (block_size * row_loops, tcre, thcre))
print("=" * 32)
f.close()

这篇关于pytables的写速度比h5py快得多.为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆