如何解决HDFStore异常:找不到正确的原子类型 [英] How to trouble-shoot HDFStore Exception: cannot find the correct atom type

查看:63
本文介绍了如何解决HDFStore异常:找不到正确的原子类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找有关哪种数据场景可能导致此异常的一般指导.我试图以各种方式按摩我的数据,但无济于事.

I am looking for some general guidance on what kinds of data scenarios can cause this exception. I have tried massaging my data in various ways to no avail.

我已经用几天的时间在Google上搜索了此异常,经过几次Google组讨论,并没有解决调试HDFStore Exception: cannot find the correct atom type的方法.我正在读取一个简单的混合数据类型的csv文件:

I have googled this exception for days now, gone through several google group discussions and come up with no solution to the debugging HDFStore Exception: cannot find the correct atom type. I am reading in a simple csv file of mixed data types:

Int64Index: 401125 entries, 0 to 401124
Data columns:
SalesID                     401125  non-null values
SalePrice                   401125  non-null values
MachineID                   401125  non-null values
ModelID                     401125  non-null values
datasource                  401125  non-null values
auctioneerID                380989  non-null values
YearMade                    401125  non-null values
MachineHoursCurrentMeter    142765  non-null values
UsageBand                   401125  non-null values
saledate                    401125  non-null values
fiModelDesc                 401125  non-null values
Enclosure_Type              401125  non-null values
...................................................
Stick_Length                401125  non-null values
Thumb                       401125  non-null values
Pattern_Changer             401125  non-null values
Grouser_Type                401125  non-null values
Backhoe_Mounting            401125  non-null values
Blade_Type                  401125  non-null values
Travel_Controls             401125  non-null values
Differential_Type           401125  non-null values
Steering_Controls           401125  non-null values
dtypes: float64(2), int64(6), object(45)

用于存储数据框的代码:

Code to store the dataframe:

In [30]: store = pd.HDFStore('test0.h5','w')
In [31]: for chunk in pd.read_csv('Train.csv', chunksize=10000):
   ....:     store.append('df', chunk, index=False)

请注意,如果我在一个镜头中导入的数据帧上使用store.put,则可以成功地存储它,尽管速度很慢(我相信这是由于对象dtype的酸洗所致,即使对象只是字符串数据也是如此).

Note that if I use store.put on a dataframe imported in one shot, I can store it successfully, albeit slowly (I believe this is due to the pickling for object dtypes, even though the object is just string data).

是否存在可能引发此异常的NaN值注意事项?

Are there NaN value considerations that could be throwing this exception?

例外:

Exception: cannot find the correct atom type -> [dtype->object,items->Index([Usa
geBand, saledate, fiModelDesc, fiBaseModel, fiSecondaryDesc, fiModelSeries, fiMo
delDescriptor, ProductSize, fiProductClassDesc, state, ProductGroup, ProductGrou
pDesc, Drive_System, Enclosure, Forks, Pad_Type, Ride_Control, Stick, Transmissi
on, Turbocharged, Blade_Extension, Blade_Width, Enclosure_Type, Engine_Horsepowe
r, Hydraulics, Pushblock, Ripper, Scarifier, Tip_Control, Tire_Size, Coupler, Co
upler_System, Grouser_Tracks, Hydraulics_Flow, Track_Type, Undercarriage_Pad_Wid
th, Stick_Length, Thumb, Pattern_Changer, Grouser_Type, Backhoe_Mounting, Blade_
Type, Travel_Controls, Differential_Type, Steering_Controls], dtype=object)] lis
t index out of range

更新1

Jeff关于存储在数据框中的列表的提示使我研究了嵌入式逗号. pandas.read_csv正确地分析了文件,并且在双引号中确实存在一些嵌入式逗号.因此,这些字段本身不是python列表,但文本中确实包含逗号.以下是一些示例:

Jeff's tip about lists stored in the dataframe led me to investigate embedded commas. pandas.read_csv is correctly parsing the file and there are indeed some embedded commas within double-quotes. So these fields are not python lists per se but do have commas in the text. Here are some examples:

3     Hydraulic Excavator, Track - 12.0 to 14.0 Metric Tons
6     Hydraulic Excavator, Track - 21.0 to 24.0 Metric Tons
8       Hydraulic Excavator, Track - 3.0 to 4.0 Metric Tons
11      Track Type Tractor, Dozer - 20.0 to 75.0 Horsepower
12    Hydraulic Excavator, Track - 19.0 to 21.0 Metric Tons

但是,当我从pd.read_csv块中删除此列并追加到我的HDFStore时,我仍然遇到相同的异常.当我尝试单独添加每个列时,出现以下新异常:

However, when I drop this column from the pd.read_csv chunks and append to my HDFStore , I still get the same Exception. When I try to append each column individually I get the following new exception:

In [6]: for chunk in pd.read_csv('Train.csv', header=0, chunksize=50000):
   ...:     for col in chunk.columns:
   ...:         store.append(col, chunk[col], data_columns=True)

Exception: cannot properly create the storer for: [_TABLE_MAP] [group->/SalesID
(Group) '',value-><class 'pandas.core.series.Series'>,table->True,append->True,k
wargs->{'data_columns': True}]

我将继续进行故障排除.这是数百条记录的链接:

I'll continue to troubleshoot. Here's a link to several hundred records:

https://docs.google.com/spreadsheet/ccc?key=0AutqBaUiJLbPdHFvaWNEMk5hZ1NTNlVyUVduYTZTeEE&usp=sharing

更新2

好的,我在工作计算机上尝试了以下操作,并得到了以下结果:

Ok, I tried the following on my work computer and got the following result:

In [4]: store = pd.HDFStore('test0.h5','w')

In [5]: for chunk in pd.read_csv('Train.csv', chunksize=10000):
   ...:     store.append('df', chunk, index=False, data_columns=True)
   ...:

Exception: cannot find the correct atom type -> [dtype->object,items->Index([fiB
aseModel], dtype=object)] [fiBaseModel] column has a min_itemsize of [13] but it
emsize [9] is required!

我想我知道这是怎么回事.如果我将字段fiBaseModel的最大长度用作第一个块,则会得到以下信息:

I think I know what's going on here. If I take the max length of the the field fiBaseModel for the first chunk I get this:

In [16]: lens = df.fiBaseModel.apply(lambda x: len(x))

In [17]: max(lens[:10000])
Out[17]: 9

第二块:

In [18]: max(lens[10001:20000])
Out[18]: 13

因此,为该列创建的存储表具有9个字节,因为这是第一个块的最大值.当它在后续块中遇到更长的文本字段时,将引发异常.

So the store table is created with 9-bytes for this column because that's the maximum of the the first chunk. When it encounters a longer text field in subsequent chunks, it throws the exception.

对此,我的建议是截断后续块中的数据(带有警告),或者允许用户为该列指定最大存储量,并截断超出该值的任何内容.也许熊猫已经可以做到这一点,但我还没有时间真正深入到HDFStore.

My suggestions for this would be to either truncate the data in subsequent chunks (with a warning) or allow the user to specify maximum storage for the column and truncate anything that exceeds it. Maybe pandas can do this already, I haven't had time to truly dive deep into HDFStore yet.

更新3

尝试使用pd.read_csv导入csv数据集.我将所有对象的字典传递给dtypes参数.然后,我遍历文件,并将每个块存储到HDFStore中,并为min_itemsize传递较大的值.我收到以下异常:

Trying to import a csv dataset using pd.read_csv. I pass a dictionary of all objects to the dtypes parameter. I then iterate over the file and store each chunk into the HDFStore passing a large value for min_itemsize. I get the following exception:

AttributeError: 'NoneType' object has no attribute 'itemsize'

我的简单代码:

store = pd.HDFStore('test0.h5','w')
objects = dict((col,'object') for col in header)

for chunk in pd.read_csv('Train.csv', header=0, dtype=objects,
    chunksize=10000, na_filter=False):
    store.append('df', chunk, min_itemsize=200)

我尝试调试并检查了堆栈跟踪中的项目.下表是例外情况的样子:

I've tried to debug and inspected the items in the stack trace. This is what the table looks like at the exception:

ipdb> self.table
/df/table (Table(10000,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": StringCol(itemsize=200, shape=(53,), dflt='', pos=1)}
  byteorder := 'little'
  chunkshape := (24,)
  autoIndex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_CSI=False}

更新4

现在,我正在尝试迭代确定数据帧的对象列中最长字符串的长度.这是我的方法:

Now I'm trying to iteratively determine the length of the longest string in object columns of my dataframe. This is how I do it:

    def f(x):
        if x.dtype != 'object':
            return
        else:
            return len(max(x.fillna(''), key=lambda x: len(str(x))))

lengths = pd.DataFrame([chunk.apply(f) for chunk in pd.read_csv('Train.csv', chunksize=50000)])
lens = lengths.max().dropna().to_dict()

In [255]: lens
Out[255]:
{'Backhoe_Mounting': 19.0,
 'Blade_Extension': 19.0,
 'Blade_Type': 19.0,
 'Blade_Width': 19.0,
 'Coupler': 19.0,
 'Coupler_System': 19.0,
 'Differential_Type': 12.0
 ... etc... }

一旦我有了最大字符串列长度的字典,我就会尝试通过min_itemsize参数将其传递给append:

Once I have the dict of max string-column lengths, I try to pass it to append via the min_itemsize argument:

In [262]: for chunk in pd.read_csv('Train.csv', chunksize=50000, dtype=types):
   .....:     store.append('df', chunk, min_itemsize=lens)

Exception: cannot find the correct atom type -> [dtype->object,items->Index([Usa
geBand, saledate, fiModelDesc, fiBaseModel, fiSecondaryDesc, fiModelSeries, fiMo
delDescriptor, ProductSize, fiProductClassDesc, state, ProductGroup, ProductGrou
pDesc, Drive_System, Enclosure, Forks, Pad_Type, Ride_Control, Stick, Transmissi
on, Turbocharged, Blade_Extension, Blade_Width, Enclosure_Type, Engine_Horsepowe
r, Hydraulics, Pushblock, Ripper, Scarifier, Tip_Control, Tire_Size, Coupler, Co
upler_System, Grouser_Tracks, Hydraulics_Flow, Track_Type, Undercarriage_Pad_Wid
th, Stick_Length, Thumb, Pattern_Changer, Grouser_Type, Backhoe_Mounting, Blade_
Type, Travel_Controls, Differential_Type, Steering_Controls], dtype=object)] [va
lues_block_2] column has a min_itemsize of [64] but itemsize [58] is required!

已将违规列的min_itemsize传递给64,但异常状态则要求项大小为58.这可能是错误吗?

The offending column was passed a min_itemsize of 64, yet exception states an itemsize of 58 is required. This may be a bug?

在[266]中:第版本 Out [266]:"0.11.0.dev-eb07c5a"

In [266]: pd.version Out[266]: '0.11.0.dev-eb07c5a'

推荐答案

您提供的链接可以很好地存储框架.逐列表示仅指定data_columns = True.它将单独处理这些列,并提出违规的列.

The link you provided worked just fine to store the frame. Column by column just means specifiy data_columns=True. It will process the columns individually and raise on the offending one.

要诊断

store = pd.HDFStore('test0.h5','w')
In [31]: for chunk in pd.read_csv('Train.csv', chunksize=10000):
   ....:     store.append('df', chunk, index=False, data_columns=True)

在生产中,您可能希望将data_columns限制为要查询的列(也可以为None,在这种情况下,您只能在索引/列上查询)

In production, you probably want to restrict data_columns to the columns that you want to query (could be None as well, in which case you can query only on the index/columns)

更新:

您可能会遇到另一个问题. read_csv根据在每个块中看到的dtypes进行转换, 因此,在块大小为10,000的情况下,附加操作失败,因为块1和2具有 在某些列中查找整数的数据,然后在块3中,您有一些NaN,因为它是浮动的. 预先指定dtype,使用较大的块大小或运行两次操作 来保证您在大块之间的dtypes.

You might run into another issue. read_csv converts dtypes based on what it sees in each chunk, so with a chunksize of 10,000 the append operations failed because chunks 1 and 2 had integer looking data in some columns, then in chunk 3 you had some NaN so it because floats. Either specify upfront the dtypes, use a larger chunksize, or run your operations twice to guarantee your dtypes between chunks.

在这种情况下,我更新了pytables.py使其具有更有用的异常(以及 告诉您一列是否包含不兼容的数据

I have updated pytables.py to have a more helpful exception in this case (as well as telling you if a column has incompatible data)

感谢您的举报!

这篇关于如何解决HDFStore异常:找不到正确的原子类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆