当字符串列的内容长于那些时,HDFStore.append(string,DataFrame)失败 [英] HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there
问题描述
我有一个通过HDFStore存储的Pandas DataFrame,它基本上存储有关我正在做的测试运行的汇总行。
每行中的几个字段包含描述性字符串可变长度。
当我进行测试运行时,我创建一个新的DataFrame,其中包含一行:
def export_as_df(self):
return pd.DataFrame(data = [self._to_dict()],index = [datetime.datetime.now()])
然后调用 HDFStore.append(string,DataFrame)
将新行添加到现有的DataFrame中。
除了其中一个字符串列的内容大于已存在的最长实例之外,这样做很好得到以下错误:
文件< ipython-input-302-a33c7955df4a>,第516行,在save_pytables
store.append('tests',test.export_as_df())
文件/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-pac kages / pandas / io / pytables.py,第532行,附加
self._write_to_group(key,value,table = True,append = True,** kwargs)
文件/ Library / Frameworks /EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py,第788行,在_write_to_group
s.write(obj = value,append = append,complib = complib,** kwargs)
文件/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py,第2491行,写入
min_itemsize = min_itemsize,** kwargs)
文件/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py,行2254,在create_axes
raise异常(找不到正确的原子类型 - > [dtype->%s,items->%s]%s%(b.dtype.name,b.items,str(detail)))
异常:找不到正确的原子类型 - > ; [dtype-> object,items-> Index([bp,id,inst,per,sp,st,title],dtype = object)] [values_block_3]列的min_itemsize为[51]但itemsize [46 ]
在创建DataFrame时,我找不到任何有关如何指定字符串长度的文档这是什么解决方案?
更新:
失败的代码:
store = pd.HDFStore(pytables_store)
用于self.backtests中的测试:
try:
min_itemsizes = { 'buy_pattern':60,'sell_pattern':60,'策略':60,'title':60}
store.append('tests',test.export_as_df(),min_itemsize = min_itemsizes)
以下是0.11rc1下的错误:
文件< ipython-输入-110-492b7b6603d7>,第522行,在save_pytables
store.append('tests',test.export_as_df(),min_itemsize = min_itemsizes)
文件/ Users / admin / dev / pandas / pandas-0.11.0rc1 / pandas / io / pytables.py,第610行,附加
self._write_to_group(key,value,table = True,append = True,** kwargs)
文件 /Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py,第871行,在_write_to_group
s.write(obj = value,append = append,complib = complib,* * kwargs)
文件/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py,第2707行,写入
min_itemsize = min_itemsize,** kwargs)
文件/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py,第2447行,在create_axes
self.validate_min_itemsize(min_itemsize)
文件/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py,第2184行,validate_min_itemsize
raise ValueError(min_itemsize has [%s] which not not a axis or data_column%k)
ValueError:min_it emsize有[buy_pattern]不是轴或data_column
数据样本:
all_day buy_pattern \
pre>
2013-04-14 12:11:44.377695 False Hammer()和LowerLow()
id instrument \
2013-04-14 12:11:44.377695 tafdcc96ba4eb11e2a86d14109fcecd49 EURUSD
open_margin周期性sell_pattern策略\
2013-04-14 12:11:44.377695 0.0001 1:00:00 Tsl()
title top_bottom wick_body
2013-04-14 12:11:44.377695 tsl 0.5 2
dtypes:
print prob_test.export_as_df()。get_dtype_counts ()
pre>
bool 1
float64 2
int64 1
对象7
dtype:int64
我每次删除h5文件,因为我想要干净的结果。想知道是否有一些愚蠢的事情,因为它是失败的,因为df不存在于h5(因此也没有任何列)在第一个附加时间?
解决方案以下是有关此文档的新文档部分的链接: http://pandas.pydata.org/pandas-docs/dev/io.html#string-columns
此问题是您指定的是min_itemsize中不是data_column的列。简单的解决方法是将
data_columns = True
添加到您的append语句中,但是如果您传递有效的列名称,我也已更新代码以自动创建data_columns。我认为这是有道理的,你想要有一个最小的列大小,所以让它发生。
还创建了一个新的文档部分字符串列显示一个更完整的例子$($)
#这是新的行为(代码更新后)
n [340 ]:dfs = DataFrame(dict(A ='foo',B ='bar'),index = range(5))
在[341]中:dfs
输出[341] :
AB
0 foo bar
1 foo bar
2 foo bar
3 foo bar
4 foo bar
# A和B的大小为30
在[342]中:store.append('dfs',dfs,min_itemsize = 30)
在[343]中:store.get_storer('dfs ')$ table
$ [
$ b / dfs / table(表(5))''
描述:= {
index:Int64Col (),dflt = 0,pos = 0),
values_block_0:StringCol(itemsize = 30,shape =(2,),dflt ='',pos = 1)}
byteorder: 'little'
chunkshape:=(963,)
autoIndex:= True
colindexes:= {
index:Index(6,medium,shuffle,zlib(1))。is_CSI = False}
#A作为数据列创建,大小为30
#B是大小是计算
在[344]:store.append('dfs2',dfs,min_itemsize = {'A':30})
在[345]:store.get_storer 'dfs2')。table
Out [345]:
/ dfs2 / table(Table(5,))''
description:= {
index:Int64Col shape =(),dflt = 0,pos = 0),
values_block_0:StringCol(itemsize = 3,shape =(1,),dflt ='',pos = 1),
A:StringCol(itemsize = 30,shape =(),dflt ='',pos = 2)}
byteorder:='little'
chunkshape:=(1598,)
autoIndex := True
colindexes:= {
A:Index(6,medium,shuffle,zlib(1))is_CSI = False,
index:Index(6,medium ,shuffle,zlib(1))。is_CSI = False}
I have a Pandas DataFrame stored via an HDFStore that essentially stores summary rows about test runs I am doing.
Several of the fields in each row contain descriptive strings of variable length.
When I do a test run, I create a new DataFrame with a single row in it:
def export_as_df(self): return pd.DataFrame(data=[self._to_dict()], index=[datetime.datetime.now()])
And then call
HDFStore.append(string, DataFrame)
to add the new row to the existing DataFrame.This works fine, apart from where one of the string columns contents is larger than the longest instance already existing, whereupon I get the following error:
File "<ipython-input-302-a33c7955df4a>", line 516, in save_pytables store.append('tests', test.export_as_df()) File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 532, in append self._write_to_group(key, value, table=True, append=True, **kwargs) File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 788, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs) File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 2491, in write min_itemsize=min_itemsize, **kwargs) File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/pandas/io/pytables.py", line 2254, in create_axes raise Exception("cannot find the correct atom type -> [dtype->%s,items->%s] %s" % (b.dtype.name, b.items, str(detail))) Exception: cannot find the correct atom type -> [dtype->object,items->Index([bp, id, inst, per, sp, st, title], dtype=object)] [values_block_3] column has a min_itemsize of [51] but itemsize [46] is required!
I can't find any documentation about how to specify string length when creating a DataFrame. What is the solution here?
Update:
Code that is failing:
store = pd.HDFStore(pytables_store) for test in self.backtests: try: min_itemsizes = { 'buy_pattern' : 60, 'sell_pattern': 60, 'strategy': 60, 'title': 60 } store.append('tests', test.export_as_df(), min_itemsize = min_itemsizes)
Here's the error under 0.11rc1:
File "<ipython-input-110-492b7b6603d7>", line 522, in save_pytables store.append('tests', test.export_as_df(), min_itemsize = min_itemsizes) File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 610, in append self._write_to_group(key, value, table=True, append=True, **kwargs) File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 871, in _write_to_group s.write(obj = value, append=append, complib=complib, **kwargs) File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2707, in write min_itemsize=min_itemsize, **kwargs) File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2447, in create_axes self.validate_min_itemsize(min_itemsize) File "/Users/admin/dev/pandas/pandas-0.11.0rc1/pandas/io/pytables.py", line 2184, in validate_min_itemsize raise ValueError("min_itemsize has [%s] which is not an axis or data_column" % k) ValueError: min_itemsize has [buy_pattern] which is not an axis or data_column
Data sample:
all_day buy_pattern \ 2013-04-14 12:11:44.377695 False Hammer() and LowerLow() id instrument \ 2013-04-14 12:11:44.377695 tafdcc96ba4eb11e2a86d14109fcecd49 EURUSD open_margin periodicity sell_pattern strategy \ 2013-04-14 12:11:44.377695 0.0001 1:00:00 Tsl() title top_bottom wick_body 2013-04-14 12:11:44.377695 tsl 0.5 2
dtypes:
print prob_test.export_as_df().get_dtype_counts() bool 1 float64 2 int64 1 object 7 dtype: int64
I am deleting the h5 file each time as I want clean results. Wondering if there is something as silly as it is failing because the df does not exist in the h5 (and hence neither do any columns) at the time of the first append?
解决方案Here is the link to the new docs section about this: http://pandas.pydata.org/pandas-docs/dev/io.html#string-columns
This issue is that you are specifiying a column in min_itemsize that is not a data_column. Simple workaround is to add
data_columns=True
to your append statement, but I have also updated the code to automatically create the data_columns if you pass a valid column name. I think this makes sense, you want to have a minimum column size, so let it happen.Also created a new doc section String Columns to show a more complete example with explanation (docs will be updated soon).
# this is the new behavior (after code updates) n [340]: dfs = DataFrame(dict(A = 'foo', B = 'bar'),index=range(5)) In [341]: dfs Out[341]: A B 0 foo bar 1 foo bar 2 foo bar 3 foo bar 4 foo bar # A and B have a size of 30 In [342]: store.append('dfs', dfs, min_itemsize = 30) In [343]: store.get_storer('dfs').table Out[343]: /dfs/table (Table(5,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": StringCol(itemsize=30, shape=(2,), dflt='', pos=1)} byteorder := 'little' chunkshape := (963,) autoIndex := True colindexes := { "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False} # A is created as a data_column with a size of 30 # B is size is calculated In [344]: store.append('dfs2', dfs, min_itemsize = { 'A' : 30 }) In [345]: store.get_storer('dfs2').table Out[345]: /dfs2/table (Table(5,)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "values_block_0": StringCol(itemsize=3, shape=(1,), dflt='', pos=1), "A": StringCol(itemsize=30, shape=(), dflt='', pos=2)} byteorder := 'little' chunkshape := (1598,) autoIndex := True colindexes := { "A": Index(6, medium, shuffle, zlib(1)).is_CSI=False, "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}
这篇关于当字符串列的内容长于那些时,HDFStore.append(string,DataFrame)失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!