pandas read_csv index_col =每行末尾都不能使用定界符 [英] pandas read_csv index_col=None not working with delimiters at the end of each line

查看:119
本文介绍了pandas read_csv index_col =每行末尾都不能使用定界符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读"Python for Data Analysis"一书,在示例:2012年联邦选举委员会数据库"部分中遇到麻烦,无法将数据读取到DataFrame中.麻烦的是,即使将index_col参数设置为None,也总是将数据列之一设置为索引列.

I am going through the 'Python for Data Analysis' book and having trouble in the 'Example: 2012 Federal Election Commision Database' section reading the data to a DataFrame. The trouble is that one of the columns of data is always being set as the index column, even when the index_col argument is set to None.

以下是数据的链接: http://www.fec.gov/disclosurep/PDownload.do.

这是加载代码(为节省检查时间,我将nrows设置为10):

Here is the loading code (to save time in the checking, I set the nrows=10):

import pandas as pd
fec = pd.read_csv('P00000001-ALL.csv',nrows=10,index_col=None)

为了简短起见,我排除了数据列的输出,但这是我的输出(请不要提供索引值):

To keep it short I am excluding the data column outputs, but here is my output (please not the Index values):

In [20]: fec

Out[20]:
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, C00410118 to C00410118
Data columns:
...
dtypes: float64(4), int64(3), object(11)

这是这本书的输出(同样不包括数据列):

And here is the book's output (again with data columns excluded):

In [13]: fec = read_csv('P00000001-ALL.csv')
In [14]: fec
Out[14]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1001731 entries, 0 to 1001730
...
dtypes: float64(1), int64(1), object(14)

我的输出中的Index值实际上是文件中数据的第一列,然后将所有其余数据向左移动一个.有谁知道如何防止将这列数据列为索引?我希望索引仅+1递增的整数.

The Index values in my output are actually the first column of data in the file, which is then moving all the rest of the data to the left by one. Would anyone know how to prevent this column of data to be listed as an index? I would like to have the index just +1 increasing integers.

我对python和pandas相当陌生,对于给您带来的任何不便,我深表歉意.谢谢.

I am fairly new to python and pandas, so I apologize for any inconvenience. Thanks.

推荐答案

快速解答

每行末尾都有定界符时,请使用 index_col = False 而不是index_col = None来关闭索引列推断并丢弃最后一列.

Quick Answer

Use index_col=False instead of index_col=None when you have delimiters at the end of each line to turn off index column inference and discard the last column.

查看数据后,每行末尾都有一个逗号.引号(自创建该帖子以来,文档已被编辑):

After looking at the data, there is a comma at the end of each line. And this quote (the documentation has been edited since the time this post was created):

index_col:列号,列名或列号/名称列表,用作所得DataFrame的索引(行标签).默认情况下,它将对行进行编号而不使用任何列,除非数据列的数量比标题的数量多,在这种情况下,第一列将用作索引.

index_col: column number, column name, or list of column numbers/names, to use as the index (row labels) of the resulting DataFrame. By default, it will number the rows without using any column, unless there is one more data column than there are headers, in which case the first column is taken as the index.

来自文档显示,pandas认为您具有n个标题和n + 1个数据列,并将第一列视为索引.

from the documentation shows that pandas believes you have n headers and n+1 data columns and is treating the first column as the index.

编辑10/20/2014-更多信息

EDIT 10/20/2014 - More information

我发现另一个有价值的东西专门针对尾随限制器以及如何简单地忽略它们:

I found another valuable entry that is specifically about trailing limiters and how to simply ignore them:

如果文件中的数据列比列名的数量多,则第一列将用作DataFrame的行名:...

If a file has one more column of data than the number of column names, the first column will be used as the DataFrame’s row names: ...

通常,您可以使用index_col选项来实现此行为.

Ordinarily, you can achieve this behavior using the index_col option.

在某些情况下,如果在每个数据行的末尾准备了带有定界符的文件,则会使解析器感到困惑.要显式禁用索引列推断并丢弃最后一列,请传递index_col = False:...

There are some exception cases when a file has been prepared with delimiters at the end of each data line, confusing the parser. To explicitly disable the index column inference and discard the last column, pass index_col=False: ...

这篇关于pandas read_csv index_col =每行末尾都不能使用定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆