在阅读制表符分隔数据时,Pandas似乎忽略了第一列名称,给出了KeyError [英] Pandas seems to ignore first column name when reading tab-delimited data, gives KeyError

查看:119
本文介绍了在阅读制表符分隔数据时,Pandas似乎忽略了第一列名称,给出了KeyError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Ubuntu 13.10上的ipython3中使用pandas 0.12.0,以便在txt文件中拼写大的制表符分隔数据集。使用read_table从txt创建DataFrame似乎工作,第一行作为标题读取,但尝试使用其名称作为索引访问第一列会引发KeyError。我不明白为什么会发生这种情况,因为列名都显示已经正确读取,并且每个其他列都可以这种方式编入索引。

I am using pandas 0.12.0 in ipython3 on Ubuntu 13.10, in order to wrangle large tab-delimited datasets in txt files. Using read_table to create a DataFrame from the txt appears to work, and the first row is read as a header, but attempting to access the first column using its name as an index throws a KeyError. I don't understand why this happens, given that the column names all appear to have been read correctly, and every other column can be indexed in this way.

数据看起来像这样:

RECORDING_SESSION_LABEL LEFT_GAZE_X LEFT_GAZE_Y RIGHT_GAZE_X    RIGHT_GAZE_Y    VIDEO_FRAME_INDEX   VIDEO_NAME
73_1    .   .   395.1   302 .   .
73_1    .   .   395 301.9   .   .
73_1    .   .   394.9   301.7   .   .
73_1    .   .   394.8   301.5   .   .
73_1    .   .   394.6   301.3   .   .
73_1    .   .   394.7   300.9   .   .
73_1    .   .   394.9   301.3   .   .
73_1    .   .   395.2   302 1   1_1_just_act.avi
73_1    .   .   395.3   302.3   1   1_1_just_act.avi
73_1    .   .   395.4   301.9   1   1_1_just_act.avi
73_1    .   .   395.7   301.5   1   1_1_just_act.avi
73_1    .   .   395.9   301.5   1   1_1_just_act.avi
73_1    .   .   396 301.5   1   1_1_just_act.avi
73_1    .   .   395.9   301.5   1   1_1_just_act.avi
15_1    395.4   301.7   .   .   .   .

分隔符绝对是制表符,并且没有尾随或前导空格。

The delimiter is definitely tabs, and there is no trailing or leading whitespace.

这个最小程序发生错误:

The error occurs with this minimal program:

import pandas as pd

samples = pd.read_table('~/datafile.txt')

print(samples['RECORDING_SESSION_LABEL'])

给出错误:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-65-137d3c16b931> in <module>()
----> 1 print(samples['RECORDING_SESSION_LABEL'])

/usr/lib/python3/dist-packages/pandas/core/frame.py in __getitem__(self, key)
   2001             # get column
   2002             if self.columns.is_unique:
-> 2003                 return self._get_item_cache(key)
   2004 
   2005             # duplicate columns

/usr/lib/python3/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
    665             return cache[item]
    666         except Exception:
--> 667             values = self._data.get(item)
    668             res = self._box_item_values(item, values)
    669             cache[item] = res

/usr/lib/python3/dist-packages/pandas/core/internals.py in get(self, item)
   1654     def get(self, item):
   1655         if self.items.is_unique:
-> 1656             _, block = self._find_block(item)
   1657             return block.get(item)
   1658         else:

/usr/lib/python3/dist-packages/pandas/core/internals.py in _find_block(self, item)
   1934 
   1935     def _find_block(self, item):
-> 1936         self._check_have(item)
   1937         for i, block in enumerate(self.blocks):
   1938             if item in block:

/usr/lib/python3/dist-packages/pandas/core/internals.py in _check_have(self, item)
   1941     def _check_have(self, item):
   1942         if item not in self.items:
-> 1943             raise KeyError('no item named %s' % com.pprint_thing(item))
   1944 
   1945     def reindex_axis(self, new_axis, method=None, axis=0, copy=True):

KeyError: 'no item named RECORDING_SESSION_LABEL'

只需 print(samples)给出打印整个表的预期输出,包括第一列及其标题。尝试打印任何其他列(即;完全相同的代码,但'RECORDING_SESSION_LABEL'替换为'LEFT_GAZE_X')可以正常工作。此外,标题似乎已被正确读取,并且pandas将'RECORDING_SESSION_LABEL'识别为列名。这可以通过使用.info()方法并在读入之后查看样本的.columns属性来证明:

Simply doing print(samples) gives the expected output of printing the whole table, complete with the first column and its header. Trying to print any other column (ie; the exact same code, but with 'RECORDING_SESSION_LABEL' replaced with 'LEFT_GAZE_X') works as it should. Furthermore, the header seems to have been read in correctly, and pandas recognizes 'RECORDING_SESSION_LABEL' as a column name. This is evidenced by using the .info() method and viewing the .columns attribute of samples, after it's been read in:

>samples.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28 entries, 0 to 27
Data columns (total 7 columns):
RECORDING_SESSION_LABEL    28  non-null values
LEFT_GAZE_X                 28  non-null values
LEFT_GAZE_Y                 28  non-null values
RIGHT_GAZE_X                28  non-null values
RIGHT_GAZE_Y                28  non-null values
VIDEO_FRAME_INDEX           28  non-null values
VIDEO_NAME                  28  non-null values
dtypes: object(7)

>print(samples.columns)

Index(['RECORDING_SESSION_LABEL', 'LEFT_GAZE_X', 'LEFT_GAZE_Y', 'RIGHT_GAZE_X', 'RIGHT_GAZE_Y', 'VIDEO_FRAME_INDEX', 'VIDEO_NAME'], dtype=object)

我觉得的另一种错误行为当使用ipython的选项卡完成时,会发生相关关系,这允许我访问样本列,就好像它们是属性一样。它适用于除第一列之外的每一列。即;点击Tab键与> samples.R 仅建议 samples.RIGHT_GAZE_X samples.RIGHT_GAZE_Y

Another error behaviour that I feel is related occurs when using ipython's tab completion, which allows me to access the columns of samples as if they were attributes. It works for every column except the first. ie; hitting the tab key with >samples.R only suggests samples.RIGHT_GAZE_X samples.RIGHT_GAZE_Y.

那么为什么它在查看整个数据帧时表现正常,但在尝试按名称访问第一列时失败,即使它似乎已正确读取该名称?

So why is it behaving normally when looking at the whole dataframe, but failing when trying to access the first column by its name, even though it appears to have correctly read in that name?

推荐答案

听起来你只需要从文件的开头有条件地删除BOM。您可以使用文件包装器执行此操作,如下所示:

Sounds like you just need to conditionally remove the BOM from the start of your files. You can do this with a wrapper around the file like so:

def remove_bom(filename):
    fp = open(filename, 'rbU')
    if fp.read(2) != b'\xfe\xff':
        fp.seek(0, 0)
    return fp

# read_table also accepts a file pointer, so we can remove the bom first
samples = pd.read_table(remove_bom('~/datafile.txt'))

print(samples['RECORDING_SESSION_LABEL'])

这篇关于在阅读制表符分隔数据时,Pandas似乎忽略了第一列名称,给出了KeyError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆