HDF5比CSV占用更多的空间? [英] HDF5 taking more space than CSV?

查看:389
本文介绍了HDF5比CSV占用更多的空间?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下示例:

import string
import random
import pandas as pd

matrix = np.random.random((100, 3000))
my_cols = [random.choice(string.ascii_uppercase) for x in range(matrix.shape[1])]
mydf = pd.DataFrame(matrix, columns=my_cols)
mydf['something'] = 'hello_world'

设置HDF5的最高压缩率:

store = pd.HDFStore('myfile.h5',complevel=9, complib='bzip2')
store['mydf'] = mydf
store.close()

另存为CSV:

mydf.to_csv('myfile.csv', sep=':')

结果是:

  • myfile.csv大5.6 MB
  • myfile.h5大11 MB
  • myfile.csv is 5.6 MB big
  • myfile.h5 is 11 MB big

数据集越大,差异越大.

The difference grows bigger as the datasets get larger.

我尝试了其他压缩方法和级别.这是一个错误吗? (我正在使用Pandas 0.11和HDF5和Python的最新稳定版本.)

I have tried with other compression methods and levels. Is this a bug? (I am using Pandas 0.11 and the latest stable version of HDF5 and Python).

推荐答案

问题的副本: https://github.com/pydata/pandas/issues/3651

您的样本实在太小. HDF5具有相当大的开销,而且尺寸非常小(即使较小的一侧也有300k条目).以下是两边都没有压缩的情况.浮点数实际上可以更有效地以二进制形式(以文本形式)表示.

Your sample is really too small. HDF5 has a fair amount of overhead with really small sizes (even 300k entries is on the smaller side). The following is with no compression on either side. Floats are really more efficiently represented in binary (that as a text representation).

此外,HDF5是基于行的.通过使表不太宽但很长,可以提高效率. (因此,您的示例在HDF5中根本不是很有效,请在这种情况下将其存储在转置的位置)

In addition, HDF5 is row based. You get MUCH efficiency by having tables that are not too wide but are fairly long. (Hence your example is not very efficient in HDF5 at all, store it transposed in this case)

我通常有超过1000万行的表,查询时间可以以毫秒为单位.甚至下面的例子也很小.拥有10GB以上的文件非常普遍(更不用说10GB以上的天文学专家们需要几秒钟的时间!)

I routinely have tables that are 10M+ rows and query times can be in the ms. Even the below example is small. Having 10+GB files is quite common (not to mention the astronomy guys who 10GB+ is a few seconds!)

-rw-rw-r--  1 jreback users 203200986 May 19 20:58 test.csv
-rw-rw-r--  1 jreback users  88007312 May 19 20:59 test.h5

In [1]: df = DataFrame(randn(1000000,10))

In [9]: df
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 10 columns):
0    1000000  non-null values
1    1000000  non-null values
2    1000000  non-null values
3    1000000  non-null values
4    1000000  non-null values
5    1000000  non-null values
6    1000000  non-null values
7    1000000  non-null values
8    1000000  non-null values
9    1000000  non-null values
dtypes: float64(10)

In [5]: %timeit df.to_csv('test.csv',mode='w')
1 loops, best of 3: 12.7 s per loop

In [6]: %timeit df.to_hdf('test.h5','df',mode='w')
1 loops, best of 3: 825 ms per loop

In [7]: %timeit pd.read_csv('test.csv',index_col=0)
1 loops, best of 3: 2.35 s per loop

In [8]: %timeit pd.read_hdf('test.h5','df')
10 loops, best of 3: 38 ms per loop

我真的不会担心大小(我怀疑您不是,只是感兴趣,这很好). HDF5的要点是磁盘便宜,cpu便宜,但是您无法一次将所有内容存储在内存中,因此我们通过分块进行优化

I really wouldn't worry about the size (I suspect you are not, but are merely interested, which is fine). The point of HDF5 is that disk is cheap, cpu is cheap, but you can't have everything in memory at once so we optimize by using chunking

这篇关于HDF5比CSV占用更多的空间?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆