pandas HDFStore:查询不匹配的字符串时速度较慢 [英] Pandas HDFStore: slow on query for non-matching string

查看:69
本文介绍了 pandas HDFStore:查询不匹配的字符串时速度较慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是,当我尝试查找DataFrame中不包含的字符串(存储在hdf5文件中)时,需要很长时间才能完成查询.例如:

My issue is that when I try to look for a string that is NOT contained in the DataFrame (which is stored in an hdf5 file), it takes a very long time to complete the query. For example:

我有一个包含2 * 10 ^ 9行的df.它存储在HDF5文件中.我有一个名为"code"的字符串列,它被标记为"data_column"(因此已被索引).

I have a df that contains 2*10^9 rows. It is stored in an HDF5 file. I have a string column named "code", that was marked as "data_column" (therefore it is indexed).

当我搜索数据集中存在的代码(store.select('df','code = valid_code'))时,大约需要10秒钟才能获得7万行.

When I search for a code that exists in the dataset ( store.select('df', 'code=valid_code') ) it takes around 10 seconds to get 70K rows.

但是,当我搜索数据集中不存在的代码(store.select('df','code = not_valid_code'))时,大约需要980秒才能获得查询结果(0行)

However, when I search for a code that does NOT exist in the dataset ( store.select('df', 'code=not_valid_code') ) it takes around 980 seconds to get the result of the query (0 rows).

我创建商店的方式如下: store = pd.HDFStore('data.h5',complevel = 1,complib ='zlib') 和第一个追加是这样的: store.append('df',chunk,data_columns = ['code'],expectedrows = 2318185498)

I create the store like: store = pd.HDFStore('data.h5', complevel=1, complib='zlib') And the first append is like: store.append('df', chunk, data_columns=['code'], expectedrows=2318185498)

这是正常现象还是发生了什么错误?

Is this behavior normal or is there something wrong going on?

谢谢!

PS:此问题可能与此其他问题

PS: this question is probably related with this other question

更新:

按照Jeff的建议,我复制了他的实验,并在Mac上获得了以下结果.这是生成的表:

Following Jeff's advice, I replicated his experiment, and I got the following results on a Mac. This is the table that was generated:

!ptdump -av test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['A'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['A']]
/df/table (Table(50000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": StringCol(itemsize=8, shape=(), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (8192,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 11 attributes:
   [A_dtype := 'string64',
    A_kind := ['A'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := '',
    FIELD_1_NAME := 'A',
    NROWS := 50000000,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

这些是结果:

In [8]: %timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 277 ms per loop

In [9]: %timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 391 ms per loop

In [10]: %timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 533 ms per loop

In [11]: %timeit pd.read_hdf('test_zlib2.h5','df',where='A = "bar"')
1 loops, best of 3: 504 ms per loop

由于差异可能不够大,因此我尝试了相同的实验,但是数据框更大.另外,我在另一台装有Linux的计算机上进行了此实验.

Since the differences were maybe not big enough, I tried the same experiment but with a bigger dataframe. Also, I did this experiment on a different machine, one with Linux.

这是代码(我只是将原始数据集乘以10):

This is the code (I just multiplied the original dataset by 10):

import pandas as pd

df = pd.DataFrame({'A' : [ 'foo%05d' % i for i in range(500000) ]})

df = pd.concat([ df ] * 20)

store = pd.HDFStore('test.h5',mode='w')

for i in range(50):
    print "%s" % i
    store.append('df',df,data_columns=['A'])

这是表格:

!ptdump -av test.h5
/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/df (Group) ''
  /df._v_attrs (AttributeSet), 14 attributes:
   [CLASS := 'GROUP',
    TITLE := '',
    VERSION := '1.0',
    data_columns := ['A'],
    encoding := None,
    index_cols := [(0, 'index')],
    info := {1: {'type': 'Index', 'names': [None]}, 'index': {}},
    levels := 1,
    nan_rep := 'nan',
    non_index_axes := [(1, ['A'])],
    pandas_type := 'frame_table',
    pandas_version := '0.10.1',
    table_type := 'appendable_frame',
    values_cols := ['A']]
/df/table (Table(500000000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "A": StringCol(itemsize=9, shape=(), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (15420,)
  autoindex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}
  /df/table._v_attrs (AttributeSet), 11 attributes:
   [A_dtype := 'string72',
    A_kind := ['A'],
    CLASS := 'TABLE',
    FIELD_0_FILL := 0,
    FIELD_0_NAME := 'index',
    FIELD_1_FILL := '',
    FIELD_1_NAME := 'A',
    NROWS := 500000000,
    TITLE := '',
    VERSION := '2.7',
    index_kind := 'integer']

这些是文件:

-rw-rw-r-- 1 user user 8.2G Oct  5 14:00 test.h5
-rw-rw-r-- 1 user user 9.9G Oct  5 14:30 test_zlib.h5

这些是结果:

In [9]:%timeit pd.read_hdf('test.h5','df',where='A = "foo00002"')
1 loops, best of 3: 1.02 s per loop

In [10]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "foo00002"')
1 loops, best of 3: 980 ms per loop

In [11]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 7.02 s per loop

In [12]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
1 loops, best of 3: 7.27 s per loop

这些是我的Pandas和Pytables版本:

These are my versions of Pandas and Pytables:

user@host:~/$ pip show tables
---
Name: tables
Version: 3.1.1
Location: /usr/local/lib/python2.7/dist-packages
Requires: 

user@host:~/$ pip show pandas
---
Name: pandas
Version: 0.14.1
Location: /usr/local/lib/python2.7/dist-packages
Requires: python-dateutil, pytz, numpy

尽管我非常确定该问题与Pandas无关,因为在仅使用不带Pandas的Pytables时,我观察到了类似的行为.

Although I am quite sure that the issue is not related with Pandas, since I have observed similar behavior when using only Pytables without Pandas.

更新2:

我已切换到Pytables 3.0.0,问题已解决.这使用的是与Pytables 3.1.1生成的文件相同的文件.

I have switched to Pytables 3.0.0 and the problem got fixed. This is using the same files that were generated with Pytables 3.1.1.

In [4]:%timeit pd.read_hdf('test.h5','df',where='A = "bar"')
1 loops, best of 3: 205 ms per loop

In [4]:%timeit pd.read_hdf('test_zlib.h5','df',where='A = "bar"')
10 loops, best of 3: 101 ms per loop

推荐答案

感谢Jeff的帮助,我通过将Pytables降级为3.0.0版来解决了该问题.该问题已报告给Pytables的开发人员.

Thanks to Jeff's help I fixed the issue by downgrading Pytables to the version 3.0.0. The issue has been reported to the devs of Pytables.

这篇关于 pandas HDFStore:查询不匹配的字符串时速度较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆