pandas pytables附加:性能和文件大小的增加 [英] pandas pytables append: performance and increase in file size
问题描述
我有超过500个 PyTables
商店,每个商店包含大约300Mb的数据。我想将这些文件合并到一个大商店中,使用pandas append
作为下面的代码。
<$ p $ (file_list,merged_store):
用于file_list中的文件:
store = HDFStore(file,mode ='r')
merged_store.append('数据',store.data)
store.close()
追加操作非常缓慢(将一个商店追加到 merged_store
需要10分钟),奇怪的是 merged_store
的文件大小>每个附加商店似乎都增加了1Gb。
我已经指出根据文档应该提高性能的期望行总数,并且已经阅读改善pandas(PyTables?)HDF5表写性能我期待大写时间,但每300Mb几乎10分钟eems太慢了,我无法理解为什么会增加尺寸。
我在想我是否缺少什么?
有关其他信息,请参阅500个PyTable中的一个。
/ data / table(Table (形状=(),dflt = 0,pos = 0),
values_block_0:Float64Col(shape = (6,),dflt = 0.0,pos = 1),
id:StringCol(itemsize = 11,shape =(),dflt ='',pos = 2),
datetaken :Int64Col(shape =(),dflt = 0,pos = 3),
owner:StringCol(itemsize = 15,shape =(),dflt ='',pos = 4),
machine_tags:StringCol(itemsize = 100,shape =(),dflt ='',pos = 5),
title:StringCol(itemsize = 200,shape =(),dflt ='',pos = 6),
country:StringCol(itemsize = 3,shape =(),dflt ='',pos = 7),
place_id:StringCol(itemsize = 18,shape =( ),
url_s:StringCol(itemsize = 80,shape =(),dflt ='',pos = 9),
url_o:Stri ngCol(itemsize = 80,shape =(),dflt ='',pos = 10),
ownername:StringCol(itemsize = 50,shape =(),dflt ='',pos = 11),
tags:StringCol(itemsize = 505,shape =(),dflt ='',pos = 12)}
byteorder:='little'
chunkshape:=(232,)
这基本上是答案这里,我最近回答。
底线是这样的,你需要关闭索引 store.append('df',df,index = False)
。在创建商店时,请在最后对其进行索引。
此外,合并表格时关闭压缩。
索引是一项相当昂贵的操作,而IIRC只使用一个处理器。最后,请确保您使用
模式创建合并='w'
,因为所有后续操作都是附加的,并且您希望从一个干净的新文件开始。 我也不会指定前面的 chunksize
。而是在创建最终索引后,使用 ptrepack
执行压缩并指定 chunksize = auto
,这将计算它为你。我不认为这会影响写入性能,但会优化查询性能。
您可以尝试调整 chunksize
参数设置为 append
(这是写入chunksize)以及更大的数字。
显然要确保每个的附加表具有完全相同的结构(如果情况并非如此,则会引发)。
我创建此问题是为了增强内部功能:< a href =https://github.com/pydata/pandas/issues/6837 =nofollow noreferrer> https://github.com/pydata/pandas/issues/6837
I have more than 500 PyTables
stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append
as in the code below.
def merge_hdfs(file_list, merged_store):
for file in file_list:
store = HDFStore(file, mode='r')
merged_store.append('data', store.data)
store.close()
The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store
), and strangely the file size of merged_store
seems to be increasing by 1Gb for each appended store.
I have indicated the total number of expected rows which according to the documentation should improve performance, and having read Improve pandas (PyTables?) HDF5 table write performance I was expecting large write times, but almost 10 minutes for every 300Mb seems to be too slow, and I cannot understand why the increase in size.
I wonder if I am missing something?
For additional information, here is a description of one of the 500 PyTables.
/data/table (Table(272734,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(6,), dflt=0.0, pos=1),
"id": StringCol(itemsize=11, shape=(), dflt='', pos=2),
"datetaken": Int64Col(shape=(), dflt=0, pos=3),
"owner": StringCol(itemsize=15, shape=(), dflt='', pos=4),
"machine_tags": StringCol(itemsize=100, shape=(), dflt='', pos=5),
"title": StringCol(itemsize=200, shape=(), dflt='', pos=6),
"country": StringCol(itemsize=3, shape=(), dflt='', pos=7),
"place_id": StringCol(itemsize=18, shape=(), dflt='', pos=8),
"url_s": StringCol(itemsize=80, shape=(), dflt='', pos=9),
"url_o": StringCol(itemsize=80, shape=(), dflt='', pos=10),
"ownername": StringCol(itemsize=50, shape=(), dflt='', pos=11),
"tags": StringCol(itemsize=505, shape=(), dflt='', pos=12)}
byteorder := 'little'
chunkshape := (232,)
This is basically the answer here, which I recently answered.
Bottom line is this, you need to turn off indexing store.append('df',df,index=False)
. When creating the store, then index it at the end.
Furthermore turn off compression when merging the tables as well.
Indexing is a fairly expensive operation and IIRC only uses a single processor.
Finally, make sure that you create the merged with with mode='w'
as all of the subsequent operations are appends and you want to start with a clean new file.
I also would NOT specify the chunksize
upfront. Rather, after you have created the final index, perform the compression using ptrepack
and specify chunksize=auto
which will compute it for you. I don't think this will affect write performance but will optimize query performance.
You might try tweaking the chunksize
parameter to append
(this is the writing chunksize) to a larger number as well.
Obviously make sure that each of the appending tables has exactly the same structure (will raise if this is not the case).
I created this issue for an enhancement to do this 'internally': https://github.com/pydata/pandas/issues/6837
这篇关于 pandas pytables附加:性能和文件大小的增加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!