对于HDF5的快速读/写性能(在Python / pandas中)推荐的压缩是什么? [英] What is the recommended compression for HDF5 for fast read/write performance (in Python/pandas)?

查看:6576
本文介绍了对于HDF5的快速读/写性能(在Python / pandas中)推荐的压缩是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经读过几次,在HDF5中打开压缩可以带来更好的读/写性能。

I have read several times that turning on compression in HDF5 can lead to better read/write performance.

我不知道什么理想的设置可以实现良好的读/写性能:

I wonder what ideal settings can be to achieve good read/write performance at:

 data_df.to_hdf(..., format='fixed', complib=..., complevel=..., chunksize=...)

我已经在使用固定的格式(即 h5py ),因为它比 table 更快。我有很强的处理器,不在乎磁盘空间。

I'm already using fixed format (i.e. h5py) as it's faster than table. I have strong processors and do not care much about disk space.

我经常存储 DataFrame $ c> float64 和 str 2500行x 9000列。

I often store DataFrames of float64 and str types in files of approx. 2500 rows x 9000 columns.

推荐答案

您可以使用几个可能的压缩过滤器。
由于 HDF5版本1.8.11 ,您可以轻松注册第三方压缩过滤器。

There are a couple of possible compression filters that you could use. Since HDF5 version 1.8.11 you can easily register a 3rd party compresssion filters.

这可能取决于您的存取模式,因为您可能想为块,以便它与你的访问模式一致,否则你的性能会遭受很多(例如,如果你知道你通常访问一个列和所有行,你应该相应地定义你的块形状(1,9000 ))。请参见此处此处,并此处了解一些信息。

It probably depends on your access pattern because you probably want to define proper dimensions for your chunks so that it aligns well with your access pattern otherwise your performance will suffer a lot (for example if you you know that you usually access one column and all rows you should define your chunk shape accordingly (1,9000)). See here, here and here for some infos.

但是,AFAIK pandas通常会将整个HDF5文件加载到内存中,除非您使用 read_table iterator (请参阅此处)或自己完成部分IO(请参阅这里),因此并不真正有益于定义好的块大小。

However AFAIK pandas usually will end up loading the entire HDF5 file into memory unless you use read_table and an iterator (see here) or do the partial IO yourself (see here) and thus doesn't really benefit that much of defining a good chunk size.

但是,您仍然可以从压缩中受益,因为将压缩数据加载到内存和使用CPU解压缩数据可能比加载未压缩数据更快。

Nevertheless you might still benefit from compression because loading the compressed data to memory and uncompressing it using the CPUs is probably faster than loading the uncompressed data.

我建议您查看 Blosc 。它是一个多线程元压缩库,支持各种不同的压缩过滤器:

I would recommend to take a look at Blosc. It is a multi-threaded meta-compressor library that supports various different compression filters:


  • BloscLZ:内部默认压缩器,

  • LZ4:一个紧凑,非常受欢迎和快速的压缩器。

  • LZ4HC:LZ4的调整版本,

  • Snappy:在许多地方使用的流行压缩机。

  • Zlib:

  • BloscLZ: internal default compressor, heavily based on FastLZ.
  • LZ4: a compact, very popular and fast compressor.
  • LZ4HC: a tweaked version of LZ4, produces better compression ratios at the expense of speed.
  • Snappy: a popular compressor used in many places.
  • Zlib: a classic; somewhat slower than the previous ones, but achieving better compression ratios.

这些具有不同的优势,最好的方法是尝试和基准测试与您的数据,看看哪个工作最好。

These have different strengths and the best thing is to try and benchmark them with your data and see which works best.

这篇关于对于HDF5的快速读/写性能(在Python / pandas中)推荐的压缩是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆