numpy genfromtxt / pandas read_csv;忽略引号内的逗号 [英] numpy genfromtxt/pandas read_csv; ignore commas within quote marks

查看:656
本文介绍了numpy genfromtxt / pandas read_csv;忽略引号内的逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑一个文件 a.dat ,内容为:

 地址1,地址2,地址3,num1,num2,num3 
地址1,地址2,地址3,1.0,2.0,3
地址1,地址2,地址3,地址4, 1.0,2.0,3

我试图用 numpy.genfromtxt 。然而,该函数在第3行看到一个额外的列。我收到类似的错误与 pandas.read_csv

  np.genfromtxt('a.dat',delimiter =',',dtype = None,skiprows = 1)

ValueError:检测到一些错误!
第3行(有7列而不是6)

  pandas read_csv类型的作品 - 但它给了我一个未对齐的数据结构:

pd.read_csv('a.dat ')

pandas.parser.CParserError:标记数据出错。 C错误:预计在第3行的6个字段,看到7

我试图找到一个输入参数以弥补这一点。我不介意如果我结束了一个numpy的ndarray或熊猫数据框。

是否有一个参数,我可以在 genfromtxt 和/或 read_csv 这会让我忽略语音标记内的逗号吗?

我注意到 read_csv 包含一个 quotechar =''参数,字符串(长度1)用于表示开始
和引用结束的字符引用的项目可以包括分隔符和
它将被忽略。

这对我来说像read_csv应该为我的情况下默认 - 但它不。

我可以看到,我可以预处理该文件去除逗号 - 我想避免如果可能的话,但如果这是唯一的方法,将欢迎您的建议。 skipinitialspace = True / code> - 这个处理逗号分隔符之后的空格
$ b $ $ $ $ $ $ $ $ $ a $ pd.read_csv(' a.dat',quotechar ='',skipinitialspace = True)

地址1地址2地址3 num1 num2 num3
0地址1地址2地址3 1 2 3
1地址1地址2地址3地址4 1 2 3

这个工作: - )


Consider a file, a.dat, with contents:

address 1, address 2, address 3, num1, num2, num3
address 1, address 2, address 3, 1.0, 2.0, 3
address 1, address 2, "address 3, address4", 1.0, 2.0, 3

I am trying to import with numpy.genfromtxt. However the function sees an additional column in row 3. I get a similar error with pandas.read_csv:

np.genfromtxt('a.dat',delimiter=',',dtype=None,skiprows=1)

ValueError: Some errors were detected !
    Line #3 (got 7 columns instead of 6)

and

pandas read_csv sort of works - but it gives me an unaligned data structure:

pd.read_csv('a.dat')

pandas.parser.CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 7

I'm trying to find an input parameter to compensate for this. I don't mind if I end up with a numpy ndarray or pandas dataframe.

Is there a parameter that I can set within genfromtxt and/or read_csv that will let me ignore the comma within the speech marks?

I note that read_csv includes a quotechar='"' parameter, defined thus:

quotechar : string (length 1) The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.

This reads to me like read_csv should work for my case by default - yet it doesn't.

I can see that I could pre-process the file to strip out the commas - I'd like to avoid that if possible but would welcome suggestions if this is the only way.

解决方案

Just managed to find this:

The key parameter that I was missing is skipinitialspace=True - this "deals with the spaces after the comma-delimiter"

a=pd.read_csv('a.dat',quotechar='"',skipinitialspace=True)

   address 1  address 2            address 3  num1  num2  num3
0  address 1  address 2            address 3     1     2     3
1  address 1  address 2  address 3, address4     1     2     3

This works :-)

这篇关于numpy genfromtxt / pandas read_csv;忽略引号内的逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆