当字符串列的内容长于那些时,HDFStore.append(string,DataFrame)失败 [英] HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there

查看:441
本文介绍了当字符串列的内容长于那些时,HDFStore.append(string,DataFrame)失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个通过HDFStore存储的Pandas DataFrame,它基本上存储有关我正在做的测试运行的汇总行。



每行中的几个字段包含描述性字符串可变长度。



当我进行测试运行时,我创建一个新的DataFrame,其中包含一行:

  def export_as_df(self):
return pd.DataFrame(data = [self._to_dict()],index = [datetime.datetime.now()])

然后调用 HDFStore.append(string,DataFrame)将新行添加到现有的DataFrame中。



除了其中一个字符串列的内容大于已存在的最长实例之外,这样做很好得到以下错误:

 文件< ipython-input-302-a33c7955df4a>,第516行,在save_pytables 
store.append('tests',test.export_as_df())
文件/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-pac kages / pandas / io / pytables.py,第532行,附加
self._write_to_group(key,value,table = True,append = True,** kwargs)
文件/ Library / Frameworks /EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py,第788行,在_write_to_group
s.write(obj = value,append = append,complib = complib,** kwargs)
文件/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py,第2491行,写入
min_itemsize = min_itemsize,** kwargs)
文件/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py,行2254,在create_axes
raise异常(找不到正确的原子类型 - > [dtype->%s,items->%s]%s%(b.dtype.name,b.items,str(detail)))
异常:找不到正确的原子类型 - > ; [dtype-> object,items-> Index([bp,id,inst,per,sp,st,title],dtype = object)] [values_block_3]列的min_itemsize为[51]但itemsize [46 ]

在创建DataFrame时,我找不到任何有关如何指定字符串长度的文档这是什么解决方案?



更新:



失败的代码:

  store = pd.HDFStore(pytables_store)
用于self.backtests中的测试:
try:
min_itemsizes = { 'buy_pattern':60,'sell_pattern':60,'策略':60,'title':60}
store.append('tests',test.export_as_df(),min_itemsize = min_itemsizes)

以下是0.11rc1下的错误:

 文件< ipython-输入-110-492b7b6603d7>,第522行,在save_pytables 
store.append('tests',test.export_as_df(),min_itemsize = min_itemsizes)
文件/ Users / admin / dev / pandas / pandas-0.11.0rc1 / pandas / io / pytables.py,第610行,附加
self._write_to_group(key,value,table = True,append = True,** kwargs)
文件 /Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py,第871行,在_write_to_group
s.write(obj = value,append = append,complib = complib,* * kwargs)
文件/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py,第2707行,写入
min_itemsize = min_itemsize,** kwargs)
文件/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py,第2447行,在create_axes
self.validate_min_itemsize(min_itemsize)
文件/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py,第2184行,validate_min_itemsize
raise ValueError(min_itemsize has [%s] which not not a axis or data_column%k)
ValueError:min_it emsize有[buy_pattern]不是轴或data_column

数据样本:

  all_day buy_pattern \ 
2013-04-14 12:11:44.377695 False Hammer()和LowerLow()

id instrument \
2013-04-14 12:11:44.377695 tafdcc96ba4eb11e2a86d14109fcecd49 EURUSD

open_margin周期性sell_pattern策略\
2013-04-14 12:11:44.377695 0.0001 1:00:00 Tsl()

title top_bottom wick_body
2013-04-14 12:11:44.377695 tsl 0.5 2
pre>

dtypes:

  print prob_test.export_as_df()。get_dtype_counts ()

bool 1
float64 2
int64 1
对象7
dtype:int64
pre>

我每次删除h5文件,因为我想要干净的结果。想知道是否有一些愚蠢的事情,因为它是失败的,因为df不存在于h5(因此也没有任何列)在第一个附加时间?

解决方案

以下是有关此文档的新文档部分的链接: http://pandas.pydata.org/pandas-docs/dev/io.html#string-columns



此问题是您指定的是min_itemsize中不是data_column的列。简单的解决方法是将 data_columns = True 添加到您的append语句中,但是如果您传递有效的列名称,我也已更新代码以自动创建data_columns。我认为这是有道理的,你想要有一个最小的列大小,所以让它发生。



还创建了一个新的文档部分字符串列显示一个更完整的例子$($)

 #这是新的行为(代码更新后)
n [340 ]:dfs = DataFrame(dict(A ='foo',B ='bar'),index = range(5))

在[341]中:dfs
输出[341] :
AB
0 foo bar
1 foo bar
2 foo bar
3 foo bar
4 foo bar

# A和B的大小为30
在[342]中:store.append('dfs',dfs,min_itemsize = 30)

在[343]中:store.get_storer('dfs ')$ table
$ [

$ b / dfs / table(表(5))''
描述:= {
index:Int64Col (),dflt = 0,pos = 0),
values_block_0:StringCol(itemsize = 30,shape =(2,),dflt ='',pos = 1)}
byteorder: 'little'
chunkshape:=(963,)
autoIndex:= True
colindexes:= {
index:Index(6,medium,shuffle,zlib(1))。is_CSI = False}

#A作为数据列创建,大小为30
#B是大小是计算
在[344]:store.append('dfs2',dfs,min_itemsize = {'A':30})

在[345]:store.get_storer 'dfs2')。table
Out [345]:
/ dfs2 / table(Table(5,))''
description:= {
index:Int64Col shape =(),dflt = 0,pos = 0),
values_block_0:StringCol(itemsize = 3,shape =(1,),dflt ='',pos = 1),
A:StringCol(itemsize = 30,shape =(),dflt ='',pos = 2)}
byteorder:='little'
chunkshape:=(1598,)
autoIndex := True
colindexes:= {
A:Index(6,medium,shuffle,zlib(1))is_CSI = False,
index:Index(6,medium ,shuffle,zlib(1))。is_CSI = False}


I have a Pandas DataFrame stored via an HDFStore that essentially stores summary rows about test runs I am doing.

Several of the fields in each row contain descriptive strings of variable length.

When I do a test run, I create a new DataFrame with a single row in it:

def export_as_df(self):
    return pd.DataFrame(data=[self._to_dict()], index=[datetime.datetime.now()])

And then call HDFStore.append(string, DataFrame) to add the new row to the existing DataFrame.

This works fine, apart from where one of the string columns contents is larger than the longest instance already existing, whereupon I get the following error:

File "<ipython-input-302-a33c7955df4a>", line 516, in save_pytables
store.append('tests', test.export_as_df())
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 532, in append
self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 788, in _write_to_group
s.write(obj = value, append=append, complib=complib, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 2491, in write
min_itemsize=min_itemsize, **kwargs)
File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 2254, in create_axes
raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail)))
Exception: cannot find the correct atom type -> [dtype->object,items->Index([bp, id, inst, per, sp, st, title], dtype=object)] [values_block_3] column has a min_itemsize of [51] but itemsize [46] is required!

I can't find any documentation about how to specify string length when creating a DataFrame. What is the solution here?

Update:

Code that is failing:

        store = pd.HDFStore(pytables_store)            
        for test in self.backtests:
            try:
                min_itemsizes = { 'buy_pattern' : 60, 'sell_pattern': 60, 'strategy': 60, 'title': 60 }
                store.append('tests', test.export_as_df(), min_itemsize = min_itemsizes)

Here's the error under 0.11rc1:

File "<ipython-input-110-492b7b6603d7>", line 522, in save_pytables
  store.append('tests', test.export_as_df(), min_itemsize = min_itemsizes)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 610, in append
  self._write_to_group(key, value, table=True, append=True, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 871, in _write_to_group
  s.write(obj = value, append=append, complib=complib, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2707, in write
  min_itemsize=min_itemsize, **kwargs)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2447, in create_axes
  self.validate_min_itemsize(min_itemsize)
File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2184, in validate_min_itemsize
  raise ValueError("min_itemsize has [%s] which is not an axis or data_column" % k)
ValueError: min_itemsize has [buy_pattern] which is not an axis or data_column

Data sample:

                           all_day              buy_pattern  \
2013-04-14 12:11:44.377695   False  Hammer() and LowerLow()   

                                                           id instrument  \
2013-04-14 12:11:44.377695  tafdcc96ba4eb11e2a86d14109fcecd49     EURUSD   

                            open_margin periodicity sell_pattern strategy  \
2013-04-14 12:11:44.377695       0.0001     1:00:00                 Tsl()   

                           title  top_bottom  wick_body  
2013-04-14 12:11:44.377695   tsl         0.5          2 

dtypes:

print prob_test.export_as_df().get_dtype_counts() 

    bool       1
    float64    2
    int64      1
    object     7
    dtype: int64

I am deleting the h5 file each time as I want clean results. Wondering if there is something as silly as it is failing because the df does not exist in the h5 (and hence neither do any columns) at the time of the first append?

解决方案

Here is the link to the new docs section about this: http://pandas.pydata.org/pandas-docs/dev/io.html#string-columns

This issue is that you are specifiying a column in min_itemsize that is not a data_column. Simple workaround is to add data_columns=True to your append statement, but I have also updated the code to automatically create the data_columns if you pass a valid column name. I think this makes sense, you want to have a minimum column size, so let it happen.

Also created a new doc section String Columns to show a more complete example with explanation (docs will be updated soon).

# this is the new behavior (after code updates)
n [340]: dfs = DataFrame(dict(A = 'foo', B = 'bar'),index=range(5))

In [341]: dfs
Out[341]: 
     A    B
0  foo  bar
1  foo  bar
2  foo  bar
3  foo  bar
4  foo  bar

# A and B have a size of 30
In [342]: store.append('dfs', dfs, min_itemsize = 30)

In [343]: store.get_storer('dfs').table
Out[343]: 
/dfs/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=30, shape=(2,), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (963,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

# A is created as a data_column with a size of 30
# B is size is calculated
In [344]: store.append('dfs2', dfs, min_itemsize = { 'A' : 30 })

In [345]: store.get_storer('dfs2').table
Out[345]: 
/dfs2/table (Table(5,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=3, shape=(1,), dflt='', pos=1),
  "A": StringCol(itemsize=30, shape=(), dflt='', pos=2)}
  byteorder := 'little'
  chunkshape := (1598,)
  autoIndex := True
  colindexes := {
    "A": Index(6, medium, shuffle, zlib(1)).is_CSI=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

这篇关于当字符串列的内容长于那些时,HDFStore.append(string,DataFrame)失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆