pandas :如何解决“错误标记数据"? [英] Pandas: How to workaround "error tokenizing data"?

查看:73
本文介绍了 pandas :如何解决“错误标记数据"?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于SO上的此主题,已经有关此问题的很多问题.(以及许多其他).到目前为止,在众多答案中,没有一个对我有真正的帮助.如果我错过了有用的邮件,请告诉我.

A lot of questions have been already asked about this topic on SO. (and many others). Among the numerous answers, none of them was really helpful to me so far. If I missed the useful one, please let me know.

我只是想将带有熊猫的CSV文件读取到数据框中.听起来很简单.

I simply would like to read a CSV file with pandas into a dataframe. Sounds like a simple task.

我的文件 Test.csv

1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4

我的代码:

import pandas as pd
df = pd.read_csv('Test.csv',header=None)

我的错误:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6

我对此问题的猜测是,Pandas期待第一行,并期望接下来的行中具有相同数量的令牌.如果不是这种情况,它将停止并显示错误.

My guess about the issue is that Pandas looks to the first line and expects the same number of tokens in the following rows. If this is not the case it will stop with an error.

在众多答案中,有关使用选项的建议如下: error_bad_lines = False header = None skiprows = 3 以及更多无用的建议.

In the numerous answers, the suggestions for using options are, e.g.: error_bad_lines=False or header=None or skiprows=3 and more non-helpful suggestions.

但是,我不想忽略任何行或跳过.而且我事先也不知道数据文件有多少列和行.

However, I don't want to ignore any lines or skip. And I don't know in advance how many columns and rows the datafile has.

因此,基本上可以归结为如何找到数据文件中的最大列数.这是要走的路吗?我希望有一种简单的方法可以简单地读取第一行中没有最大列号的CSV文件.谢谢您的任何提示.我在Win7上使用的是Python 3.6.3,Pandas 0.24.1.

So it basically boils down to how to find the maximum number of columns in the datafile. Is this the way to go? I hoped that there was an easy way to simply read a CSV file which does not have the maximum column number in the first line. Thank you for any hints. I'm using Python 3.6.3, Pandas 0.24.1 on Win7.

推荐答案

感谢@ALollz提供非常新鲜"的链接(幸运的巧合),感谢@Rich Andrews指出我的示例实际上不是严格正确的" CSV数据.

Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data.

因此,暂时适用于我的方式是通过@ALollz的紧凑型解决方案改编而成的( https://stackoverflow.com/a/55129746/7295599 )

So, the way it works for me for the time being is adapted from @ALollz' compact solution (https://stackoverflow.com/a/55129746/7295599)

### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens 
import pandas as pd

df = pd.read_csv('Test.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
# ... do some modifications with df
### end of code

df 在开头和中间包含用于缺少条目的空字符串'',在末尾包含用于缺少令牌的 None

df contains empty string '' for the missing entries at the beginning and the middle, and None for the missing tokens at the end.

   0  1  2  3     4     5     6
0  1  2  3  4     5  None  None
1  1  2  3  4     5     6  None
2        3  4     5  None  None
3  1  2  3  4     5     6     7
4     2     4  None  None  None

如果您通过以下方式将其再次写入文件:

If you write this again to a file via:

df.to_csv("Test.tab",sep ="\ t",header = False,index = False)

1   2   3   4   5       
1   2   3   4   5   6   
        3   4   5       
1   2   3   4   5   6   7
    2       4           

将被转换为空字符串'',一切都很好.

None will be converted to empty string '' and everything is fine.

下一个级别是考虑包含分隔符的引号中的数据字符串,但这是另一个主题.

The next level would be to account for data strings in quotes which contain the separator, but that's another topic.

1,2,3,4,5
,,3,"Hello, World!",5,6
1,2,3,4,5,6,7

这篇关于 pandas :如何解决“错误标记数据"?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆