pandas pytables附加:性能和文件大小的增加 [英] pandas pytables append: performance and increase in file size

查看:185
本文介绍了 pandas pytables附加:性能和文件大小的增加的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有超过500个 PyTables 商店,每个商店包含大约300Mb的数据。我想将这些文件合并到一个大商店中,使用pandas append 作为下面的代码。



<$ p $ (file_list,merged_store):
用于file_list中的文件:
store = HDFStore(file,mode ='r')
merged_store.append('数据',store.data)
store.close()

追加操作非常缓慢(将一个商店追加到 merged_store 需要10分钟),奇怪的是 merged_store 的文件大小>每个附加商店似乎都增加了1Gb。



我已经指出根据文档应该提高性能的期望行总数,并且已经阅读改善pandas(PyTables?)HDF5表写性能我期待大写时间,但每300Mb几乎10分钟eems太慢了,我无法理解为什么会增加尺寸。



我在想我是否缺少什么?



有关其他信息,请参阅500个PyTable中的一个。

  / data / table(Table (形状=(),dflt = 0,pos = 0),
values_block_0:Float64Col(shape = (6,),dflt = 0.0,pos = 1),
id:StringCol(itemsize = 11,shape =(),dflt ='',pos = 2),
datetaken :Int64Col(shape =(),dflt = 0,pos = 3),
owner:StringCol(itemsize = 15,shape =(),dflt ='',pos = 4),
machine_tags:StringCol(itemsize = 100,shape =(),dflt ='',pos = 5),
title:StringCol(itemsize = 200,shape =(),dflt ='',pos = 6),
country:StringCol(itemsize = 3,shape =(),dflt ='',pos = 7),
place_id:StringCol(itemsize = 18,shape =( ),
url_s:StringCol(itemsize = 80,shape =(),dflt ='',pos = 9),
url_o:Stri ngCol(itemsize = 80,shape =(),dflt ='',pos = 10),
ownername:StringCol(itemsize = 50,shape =(),dflt ='',pos = 11),
tags:StringCol(itemsize = 505,shape =(),dflt ='',pos = 12)}
byteorder:='little'
chunkshape:=(232,)


解决方案

这基本上是答案这里,我最近回答。



底线是这样的,你需要关闭索引 store.append('df',df,index = False) 。在创建商店时,请在最后对其进行索引。



此外,合并表格时关闭压缩。



索引是一项相当昂贵的操作,而IIRC只使用一个处理器。

最后,请确保您使用模式创建合并='w',因为所有后续操作都是附加的,并且您希望从一个干净的新文件开始。



我也不会指定前面的 chunksize 。而是在创建最终索引后,使用 ptrepack 执行压缩并指定 chunksize = auto ,这将计算它为你。我不认为这会影响写入性能,但会优化查询性能。



您可以尝试调整 chunksize 参数设置为 append (这是写入chunksize)以及更大的数字。



显然要确保每个的附加表具有完全相同的结构(如果情况并非如此,则会引发)。



我创建此问题是为了增强内部功能:< a href =https://github.com/pydata/pandas/issues/6837 =nofollow noreferrer> https://github.com/pydata/pandas/issues/6837


I have more than 500 PyTables stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append as in the code below.

def merge_hdfs(file_list, merged_store):
    for file in file_list:
        store = HDFStore(file, mode='r')
        merged_store.append('data', store.data)
        store.close()

The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store), and strangely the file size of merged_store seems to be increasing by 1Gb for each appended store.

I have indicated the total number of expected rows which according to the documentation should improve performance, and having read Improve pandas (PyTables?) HDF5 table write performance I was expecting large write times, but almost 10 minutes for every 300Mb seems to be too slow, and I cannot understand why the increase in size.

I wonder if I am missing something?

For additional information, here is a description of one of the 500 PyTables.

/data/table (Table(272734,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(6,), dflt=0.0, pos=1),
  "id": StringCol(itemsize=11, shape=(), dflt='', pos=2),
  "datetaken": Int64Col(shape=(), dflt=0, pos=3),
  "owner": StringCol(itemsize=15, shape=(), dflt='', pos=4),
  "machine_tags": StringCol(itemsize=100, shape=(), dflt='', pos=5),
  "title": StringCol(itemsize=200, shape=(), dflt='', pos=6),
  "country": StringCol(itemsize=3, shape=(), dflt='', pos=7),
  "place_id": StringCol(itemsize=18, shape=(), dflt='', pos=8),
  "url_s": StringCol(itemsize=80, shape=(), dflt='', pos=9),
  "url_o": StringCol(itemsize=80, shape=(), dflt='', pos=10),
  "ownername": StringCol(itemsize=50, shape=(), dflt='', pos=11),
  "tags": StringCol(itemsize=505, shape=(), dflt='', pos=12)}
  byteorder := 'little'
  chunkshape := (232,)

解决方案

This is basically the answer here, which I recently answered.

Bottom line is this, you need to turn off indexing store.append('df',df,index=False). When creating the store, then index it at the end.

Furthermore turn off compression when merging the tables as well.

Indexing is a fairly expensive operation and IIRC only uses a single processor.

Finally, make sure that you create the merged with with mode='w' as all of the subsequent operations are appends and you want to start with a clean new file.

I also would NOT specify the chunksize upfront. Rather, after you have created the final index, perform the compression using ptrepack and specify chunksize=auto which will compute it for you. I don't think this will affect write performance but will optimize query performance.

You might try tweaking the chunksize parameter to append (this is the writing chunksize) to a larger number as well.

Obviously make sure that each of the appending tables has exactly the same structure (will raise if this is not the case).

I created this issue for an enhancement to do this 'internally': https://github.com/pydata/pandas/issues/6837

这篇关于 pandas pytables附加:性能和文件大小的增加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆