使用 pandas 将CSV文件转换为HDF5 [英] Converting CSV file to HDF5 using pandas

查看:178
本文介绍了使用 pandas 将CSV文件转换为HDF5的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我使用熊猫将csv文件转换为hdf5文件时,生成的文件非常大.例如,一个170Mb的测试csv文件(23列,130万行)将产生2Gb的hdf5文件.但是,如果绕过熊猫而直接写入hdf5文件(使用pytables),则只有20Mb.在以下代码(用于在熊猫中进行转换)中,数据框中的对象列的值被显式转换为字符串对象(以防止酸洗):

When i use pandas to convert csv files to hdf5 files the resulting file is extremely large. For example a test csv file (23 columns, 1.3 million rows) of 170Mb results in an hdf5 file of 2Gb. However if pandas is bypassed and the hdf5 file is directly written (using pytables) it is only 20Mb. In the following code (that is used to do the conversion in pandas) the values of the object columns in the dataframe are explicitly converted to string objects (to prevent pickling):

# Open the csv file as pandas data frame
data = pd.read_csv(csvfilepath, sep=delimiter, low_memory=False)

# Write the resulting data frame to the hdf5 file
data.to_hdf(hdf5_file_path, table_name, format='table', complevel=9,
            complib='lzo')

这是经过检查的hdf5文件(使用可检测表):

This is the hdf5 file inspected (using vitables):

对我来说似乎很奇怪的是,这些值以数据类型(values_block0:int,values_block1:float和values_block2:string)(而不是csv文件中的每一列)表示为一个(python?)列表.我想知道这是否会导致文件大,对查询时间会有什么影响?

What seems odd to me is that the values are represented as a (python?) list by data type (values_block0:int,values_block1:float and values_block2:string) instead of 1 specific column for every column in the csv file. I'm wondering if this causes the large file size and what the impact will be on query times?

鉴于必须转换大约1Tb,我想知道可以采取什么措施来减小生成的hdf5文件的大小?

Given that about 1Tb has to be converted I would like to know what can be done to reduce the size of the resulting hdf5 file?

P.S.我知道了这个问题,但是声明hdf5文件很大是由HDF5格式本身引起的,在这种情况下,由于绕过熊猫而产生的hdf5文件要小得多,因此这不是原因.

P.S. I'm aware of this question but is states that the large hdf5 file size is caused by the HDF5 format itself which can't be the cause in this case given that the hdf5 file resulting from bypassing pandas is much smaller.

P.P.S.使用data.iloc代替joris建议的data.loc没有任何区别.我已经删除了转换",但没有任何作用.Jeff请求读取的数据帧上的信息:

P.P.S. Using data.iloc instead of data.loc as suggested by joris doesn't make any difference. I've removed the 'conversion' it doesn't make a difference. The info on the read dataframe as requested by Jeff:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1303331 entries, 0 to 1303330
Columns: 23 entries, _PlanId to ACTIVITY_Gratis
dtypes: float64(1), int64(5), object(17)

推荐答案

这是各种IO方法的时间/大小的非正式比较

Here's an informal comparison of times/sizes for various IO method

在64位linux上使用0.13.1

Using 0.13.1 on 64-bit linux

设置

In [3]: N = 1000000

In [4]: df = DataFrame(dict([ ("int{0}".format(i),np.random.randint(0,10,size=N)) for i in range(5) ]))

In [5]: df['float'] = np.random.randn(N)

In [6]: from random import randrange

In [8]: for i in range(10):
   ...:     df["object_1_{0}".format(i)] = ['%08x'%randrange(16**8) for _ in range(N)]
   ...:     

In [9]: for i in range(7):
   ...:     df["object_2_{0}".format(i)] = ['%15x'%randrange(16**15) for _ in range(N)]
   ...:     

 In [11]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 0 to 999999
Data columns (total 23 columns):
int0          1000000 non-null int64
int1          1000000 non-null int64
int2          1000000 non-null int64
int3          1000000 non-null int64
int4          1000000 non-null int64
float         1000000 non-null float64
object_1_0    1000000 non-null object
object_1_1    1000000 non-null object
object_1_2    1000000 non-null object
object_1_3    1000000 non-null object
object_1_4    1000000 non-null object
object_1_5    1000000 non-null object
object_1_6    1000000 non-null object
object_1_7    1000000 non-null object
object_1_8    1000000 non-null object
object_1_9    1000000 non-null object
object_2_0    1000000 non-null object
object_2_1    1000000 non-null object
object_2_2    1000000 non-null object
object_2_3    1000000 non-null object
object_2_4    1000000 non-null object
object_2_5    1000000 non-null object
object_2_6    1000000 non-null object
dtypes: float64(1), int64(5), object(17)

types: float64(1), int64(5), object(17)

通过各种方法进行保存

In [12]: df.to_hdf('test_fixed.h5','data',format='fixed')

In [13]: df.to_hdf('test_table_no_dc.h5','data',format='table')

In [14]: df.to_hdf('test_table_dc.h5','data',format='table',data_columns=True)

In [15]: df.to_hdf('test_fixed_compressed.h5','data',format='fixed',complib='blosc',complevel=9)
!ls -ltr *.h5

In [16]: !ls -ltr *.h5
-rw-rw-r-- 1 jreback users 361093304 Apr 28 09:20 test_fixed.h5
-rw-rw-r-- 1 jreback users 311475690 Apr 28 09:21 test_table_no_dc.h5
-rw-rw-r-- 1 jreback users 351316525 Apr 28 09:22 test_table_dc.h5
-rw-rw-r-- 1 jreback users 317467870 Apr 28  2014 test_fixed_compressed.h5

磁盘上的大小将取决于为每一列选择的字符串大小;如果使用NO data_columns,则它是ANY字符串的最长大小.因此,使用data_columns进行写入可能具有此处的大小(通过增加列以使每个列占用更多空间这一事实来平衡).您可能希望指定 min_item_size 进行控制,请参见

The size on disk is going to be a function of the string size that is selected for each column; If you use NO data_columns then its the longest size for ANY string. So writing with data_columns can potentially the size here (balanced by the fact that you have more columns so it takes more space per-column). You prob want to specify min_item_size to control see here

以下是磁盘结构的示例:

Here is an example of the on-disk structure:

In [8]: DataFrame(dict(A = ['foo','bar','bah'], B = [1,2,3], C = [1.0,2.0,3.0], D=[4.0,5.0,6.0])).to_hdf('test.h5','data',mode='w',format='table')

In [9]: !ptdump -avd test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/data (Group) ''
  /data._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := [],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A', 'B', 'C', 'D'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['values_block_0', 'values_block_1', 'values_block_2']]
/data/table (Table(3,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "values_block_2": StringCol(itemsize=3, shape=(1,), dflt='', pos=3)}
  byteorder := 'little'
  chunkshape := (1872,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /data/table._v_attrs (AttributeSet), 19 attributes:
   [CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := 0.0,
    FIELD_1_NAME := 'values_block_0',
    FIELD_2_FILL := 0,
    FIELD_2_NAME := 'values_block_1',
    FIELD_3_FILL := '',
    FIELD_3_NAME := 'values_block_2',
    NROWS := 3,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer',
    values_block_0_dtype := 'float64',
    values_block_0_kind := ['C', 'D'],
    values_block_1_dtype := 'int64',
    values_block_1_kind := ['B'],
    values_block_2_dtype := 'string24',
    values_block_2_kind := ['A']]
  Data dump:
[0] (0, [1.0, 4.0], [1], ['foo'])
[1] (1, [2.0, 5.0], [2], ['bar'])
[2] (2, [3.0, 6.0], [3], ['bah'])

Dtypes被分组为块(如果您有data_columns,则它们是分开的).这些都是以这种方式打印的;它们以数组形式存储.

Dtypes are grouping into blocks (if you have data_columns then they are separate). These are just printed this way; they are stored array like.

这篇关于使用 pandas 将CSV文件转换为HDF5的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆