pandas pytables附加：性能和文件大小的增加 [英] pandas pytables append: performance and increase in file size

查看：185 发布时间：2018/6/6 11:12:28 python performance pandas hdfs pytables

本文介绍了 pandas pytables附加：性能和文件大小的增加的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有超过500个 PyTables 商店，每个商店包含大约300Mb的数据。我想将这些文件合并到一个大商店中，使用pandas append 作为下面的代码。

<$ p $ （file_list，merged_store）：
用于file_list中的文件：
store = HDFStore（file，mode ='r'）
merged_store.append（'数据'，store.data）
store.close（）

追加操作非常缓慢（将一个商店追加到 merged_store 需要10分钟），奇怪的是 merged_store 的文件大小>每个附加商店似乎都增加了1Gb。

我已经指出根据文档应该提高性能的期望行总数，并且已经阅读改善pandas（PyTables？）HDF5表写性能我期待大写时间，但每300Mb几乎10分钟eems太慢了，我无法理解为什么会增加尺寸。

我在想我是否缺少什么？

有关其他信息，请参阅500个PyTable中的一个。

  / data / table（Table （形状=（），dflt = 0，pos = 0），
values_block_0：Float64Col（shape = （6，），dflt = 0.0，pos = 1），
id：StringCol（itemsize = 11，shape =（），dflt =''，pos = 2），
datetaken ：Int64Col（shape =（），dflt = 0，pos = 3），
owner：StringCol（itemsize = 15，shape =（），dflt =''，pos = 4），
 machine_tags：StringCol（itemsize = 100，shape =（），dflt =''，pos = 5），
title：StringCol（itemsize = 200，shape =（），dflt =''，pos = 6），
country：StringCol（itemsize = 3，shape =（），dflt =''，pos = 7），
place_id：StringCol（itemsize = 18，shape =（ ），
url_s：StringCol（itemsize = 80，shape =（），dflt =''，pos = 9），
url_o：Stri ngCol（itemsize = 80，shape =（），dflt =''，pos = 10），
ownername：StringCol（itemsize = 50，shape =（），dflt =''，pos = 11）， 
tags：StringCol（itemsize = 505，shape =（），dflt =''，pos = 12）} 
 byteorder：='little'
 chunkshape：=（232，）

解决方案

这基本上是答案这里，我最近回答。

底线是这样的，你需要关闭索引 store.append（'df'，df，index = False） 。在创建商店时，请在最后对其进行索引。

此外，合并表格时关闭压缩。

索引是一项相当昂贵的操作，而IIRC只使用一个处理器。

最后，请确保您使用模式创建合并='w'，因为所有后续操作都是附加的，并且您希望从一个干净的新文件开始。

我也不会指定前面的 chunksize 。而是在创建最终索引后，使用 ptrepack 执行压缩并指定 chunksize = auto ，这将计算它为你。我不认为这会影响写入性能，但会优化查询性能。

您可以尝试调整 chunksize 参数设置为 append （这是写入chunksize）以及更大的数字。

显然要确保每个的附加表具有完全相同的结构（如果情况并非如此，则会引发）。

我创建此问题是为了增强内部功能：< a href =https://github.com/pydata/pandas/issues/6837 =nofollow noreferrer> https://github.com/pydata/pandas/issues/6837

I have more than 500 PyTables stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append as in the code below.

def merge_hdfs(file_list, merged_store):
    for file in file_list:
        store = HDFStore(file, mode='r')
        merged_store.append('data', store.data)
        store.close()

The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store), and strangely the file size of merged_store seems to be increasing by 1Gb for each appended store.

I have indicated the total number of expected rows which according to the documentation should improve performance, and having read Improve pandas (PyTables?) HDF5 table write performance I was expecting large write times, but almost 10 minutes for every 300Mb seems to be too slow, and I cannot understand why the increase in size.

I wonder if I am missing something?

For additional information, here is a description of one of the 500 PyTables.

/data/table (Table(272734,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(6,), dflt=0.0, pos=1),
  "id": StringCol(itemsize=11, shape=(), dflt='', pos=2),
  "datetaken": Int64Col(shape=(), dflt=0, pos=3),
  "owner": StringCol(itemsize=15, shape=(), dflt='', pos=4),
  "machine_tags": StringCol(itemsize=100, shape=(), dflt='', pos=5),
  "title": StringCol(itemsize=200, shape=(), dflt='', pos=6),
  "country": StringCol(itemsize=3, shape=(), dflt='', pos=7),
  "place_id": StringCol(itemsize=18, shape=(), dflt='', pos=8),
  "url_s": StringCol(itemsize=80, shape=(), dflt='', pos=9),
  "url_o": StringCol(itemsize=80, shape=(), dflt='', pos=10),
  "ownername": StringCol(itemsize=50, shape=(), dflt='', pos=11),
  "tags": StringCol(itemsize=505, shape=(), dflt='', pos=12)}
  byteorder := 'little'
  chunkshape := (232,)

解决方案

This is basically the answer here, which I recently answered.

Bottom line is this, you need to turn off indexing store.append('df',df,index=False). When creating the store, then index it at the end.

Furthermore turn off compression when merging the tables as well.

Indexing is a fairly expensive operation and IIRC only uses a single processor.

Finally, make sure that you create the merged with with mode='w' as all of the subsequent operations are appends and you want to start with a clean new file.

I also would NOT specify the chunksize upfront. Rather, after you have created the final index, perform the compression using ptrepack and specify chunksize=auto which will compute it for you. I don't think this will affect write performance but will optimize query performance.

You might try tweaking the chunksize parameter to append (this is the writing chunksize) to a larger number as well.

Obviously make sure that each of the appending tables has exactly the same structure (will raise if this is not the case).

I created this issue for an enhancement to do this 'internally': https://github.com/pydata/pandas/issues/6837

这篇关于 pandas pytables附加：性能和文件大小的增加的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas pytables附加：性能和文件大小的增加 [英] pandas pytables append: performance and increase in file size

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas pytables附加：性能和文件大小的增加 [英] pandas pytables append: performance and increase in file size

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭