在阅读制表符分隔数据时,Pandas似乎忽略了第一列名称,给出了KeyError [英] Pandas seems to ignore first column name when reading tab-delimited data, gives KeyError
问题描述
我在Ubuntu 13.10上的ipython3中使用pandas 0.12.0,以便在txt文件中拼写大的制表符分隔数据集。使用read_table从txt创建DataFrame似乎工作,第一行作为标题读取,但尝试使用其名称作为索引访问第一列会引发KeyError。我不明白为什么会发生这种情况,因为列名都显示已经正确读取,并且每个其他列都可以这种方式编入索引。
I am using pandas 0.12.0 in ipython3 on Ubuntu 13.10, in order to wrangle large tab-delimited datasets in txt files. Using read_table to create a DataFrame from the txt appears to work, and the first row is read as a header, but attempting to access the first column using its name as an index throws a KeyError. I don't understand why this happens, given that the column names all appear to have been read correctly, and every other column can be indexed in this way.
数据看起来像这样:
RECORDING_SESSION_LABEL LEFT_GAZE_X LEFT_GAZE_Y RIGHT_GAZE_X RIGHT_GAZE_Y VIDEO_FRAME_INDEX VIDEO_NAME
73_1 . . 395.1 302 . .
73_1 . . 395 301.9 . .
73_1 . . 394.9 301.7 . .
73_1 . . 394.8 301.5 . .
73_1 . . 394.6 301.3 . .
73_1 . . 394.7 300.9 . .
73_1 . . 394.9 301.3 . .
73_1 . . 395.2 302 1 1_1_just_act.avi
73_1 . . 395.3 302.3 1 1_1_just_act.avi
73_1 . . 395.4 301.9 1 1_1_just_act.avi
73_1 . . 395.7 301.5 1 1_1_just_act.avi
73_1 . . 395.9 301.5 1 1_1_just_act.avi
73_1 . . 396 301.5 1 1_1_just_act.avi
73_1 . . 395.9 301.5 1 1_1_just_act.avi
15_1 395.4 301.7 . . . .
分隔符绝对是制表符,并且没有尾随或前导空格。
The delimiter is definitely tabs, and there is no trailing or leading whitespace.
这个最小程序发生错误:
The error occurs with this minimal program:
import pandas as pd
samples = pd.read_table('~/datafile.txt')
print(samples['RECORDING_SESSION_LABEL'])
给出错误:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-65-137d3c16b931> in <module>()
----> 1 print(samples['RECORDING_SESSION_LABEL'])
/usr/lib/python3/dist-packages/pandas/core/frame.py in __getitem__(self, key)
2001 # get column
2002 if self.columns.is_unique:
-> 2003 return self._get_item_cache(key)
2004
2005 # duplicate columns
/usr/lib/python3/dist-packages/pandas/core/generic.py in _get_item_cache(self, item)
665 return cache[item]
666 except Exception:
--> 667 values = self._data.get(item)
668 res = self._box_item_values(item, values)
669 cache[item] = res
/usr/lib/python3/dist-packages/pandas/core/internals.py in get(self, item)
1654 def get(self, item):
1655 if self.items.is_unique:
-> 1656 _, block = self._find_block(item)
1657 return block.get(item)
1658 else:
/usr/lib/python3/dist-packages/pandas/core/internals.py in _find_block(self, item)
1934
1935 def _find_block(self, item):
-> 1936 self._check_have(item)
1937 for i, block in enumerate(self.blocks):
1938 if item in block:
/usr/lib/python3/dist-packages/pandas/core/internals.py in _check_have(self, item)
1941 def _check_have(self, item):
1942 if item not in self.items:
-> 1943 raise KeyError('no item named %s' % com.pprint_thing(item))
1944
1945 def reindex_axis(self, new_axis, method=None, axis=0, copy=True):
KeyError: 'no item named RECORDING_SESSION_LABEL'
只需 print(samples)
给出打印整个表的预期输出,包括第一列及其标题。尝试打印任何其他列(即;完全相同的代码,但'RECORDING_SESSION_LABEL'替换为'LEFT_GAZE_X')可以正常工作。此外,标题似乎已被正确读取,并且pandas将'RECORDING_SESSION_LABEL'识别为列名。这可以通过使用.info()方法并在读入之后查看样本的.columns属性来证明:
Simply doing print(samples)
gives the expected output of printing the whole table, complete with the first column and its header. Trying to print any other column (ie; the exact same code, but with 'RECORDING_SESSION_LABEL' replaced with 'LEFT_GAZE_X') works as it should. Furthermore, the header seems to have been read in correctly, and pandas recognizes 'RECORDING_SESSION_LABEL' as a column name. This is evidenced by using the .info() method and viewing the .columns attribute of samples, after it's been read in:
>samples.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 28 entries, 0 to 27
Data columns (total 7 columns):
RECORDING_SESSION_LABEL 28 non-null values
LEFT_GAZE_X 28 non-null values
LEFT_GAZE_Y 28 non-null values
RIGHT_GAZE_X 28 non-null values
RIGHT_GAZE_Y 28 non-null values
VIDEO_FRAME_INDEX 28 non-null values
VIDEO_NAME 28 non-null values
dtypes: object(7)
>print(samples.columns)
Index(['RECORDING_SESSION_LABEL', 'LEFT_GAZE_X', 'LEFT_GAZE_Y', 'RIGHT_GAZE_X', 'RIGHT_GAZE_Y', 'VIDEO_FRAME_INDEX', 'VIDEO_NAME'], dtype=object)
我觉得的另一种错误行为当使用ipython的选项卡完成时,会发生相关关系,这允许我访问样本列,就好像它们是属性一样。它适用于除第一列之外的每一列。即;点击Tab键与> samples.R
仅建议 samples.RIGHT_GAZE_X samples.RIGHT_GAZE_Y
。
Another error behaviour that I feel is related occurs when using ipython's tab completion, which allows me to access the columns of samples as if they were attributes. It works for every column except the first. ie; hitting the tab key with >samples.R
only suggests samples.RIGHT_GAZE_X samples.RIGHT_GAZE_Y
.
那么为什么它在查看整个数据帧时表现正常,但在尝试按名称访问第一列时失败,即使它似乎已正确读取该名称?
So why is it behaving normally when looking at the whole dataframe, but failing when trying to access the first column by its name, even though it appears to have correctly read in that name?
推荐答案
听起来你只需要从文件的开头有条件地删除BOM。您可以使用文件包装器执行此操作,如下所示:
Sounds like you just need to conditionally remove the BOM from the start of your files. You can do this with a wrapper around the file like so:
def remove_bom(filename):
fp = open(filename, 'rbU')
if fp.read(2) != b'\xfe\xff':
fp.seek(0, 0)
return fp
# read_table also accepts a file pointer, so we can remove the bom first
samples = pd.read_table(remove_bom('~/datafile.txt'))
print(samples['RECORDING_SESSION_LABEL'])
这篇关于在阅读制表符分隔数据时,Pandas似乎忽略了第一列名称,给出了KeyError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!