numpy genfromtxt / pandas read_csv;忽略引号内的逗号 [英] numpy genfromtxt/pandas read_csv; ignore commas within quote marks
问题描述
考虑一个文件 a.dat
,内容为:
地址1,地址2,地址3,num1,num2,num3
地址1,地址2,地址3,1.0,2.0,3
地址1,地址2,地址3,地址4, 1.0,2.0,3
我试图用 numpy.genfromtxt
。然而,该函数在第3行看到一个额外的列。我收到类似的错误与 pandas.read_csv
:
np.genfromtxt('a.dat',delimiter =',',dtype = None,skiprows = 1)
ValueError:检测到一些错误!
第3行(有7列而不是6)
和
pandas read_csv类型的作品 - 但它给了我一个未对齐的数据结构:
pd.read_csv('a.dat ')
pandas.parser.CParserError:标记数据出错。 C错误:预计在第3行的6个字段,看到7
我试图找到一个输入参数以弥补这一点。我不介意如果我结束了一个numpy的ndarray或熊猫数据框。
是否有一个参数,我可以在 genfromtxt
和/或 read_csv
这会让我忽略语音标记内的逗号吗?
我注意到 read_csv
包含一个 quotechar =''
参数,字符串(长度1)用于表示开始
和引用结束的字符引用的项目可以包括分隔符和
它将被忽略。
这对我来说像read_csv应该为我的情况下默认 - 但它不。
我可以看到,我可以预处理该文件去除逗号 - 我想避免如果可能的话,但如果这是唯一的方法,将欢迎您的建议。 /秒这个关键的参数是 skipinitialspace = True
/ code> - 这个处理逗号分隔符之后的空格
$ b $ $ $ $ $ $ $ $ $ a $ pd.read_csv(' a.dat',quotechar ='',skipinitialspace = True)
地址1地址2地址3 num1 num2 num3
0地址1地址2地址3 1 2 3
1地址1地址2地址3地址4 1 2 3
这个工作: - )
Consider a file, a.dat
, with contents:
address 1, address 2, address 3, num1, num2, num3
address 1, address 2, address 3, 1.0, 2.0, 3
address 1, address 2, "address 3, address4", 1.0, 2.0, 3
I am trying to import with numpy.genfromtxt
. However the function sees an additional column in row 3. I get a similar error with pandas.read_csv
:
np.genfromtxt('a.dat',delimiter=',',dtype=None,skiprows=1)
ValueError: Some errors were detected !
Line #3 (got 7 columns instead of 6)
and
pandas read_csv sort of works - but it gives me an unaligned data structure:
pd.read_csv('a.dat')
pandas.parser.CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 7
I'm trying to find an input parameter to compensate for this. I don't mind if I end up with a numpy ndarray or pandas dataframe.
Is there a parameter that I can set within genfromtxt
and/or read_csv
that will let me ignore the comma within the speech marks?
I note that read_csv
includes a quotechar='"'
parameter, defined thus:
quotechar : string (length 1) The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
This reads to me like read_csv should work for my case by default - yet it doesn't.
I can see that I could pre-process the file to strip out the commas - I'd like to avoid that if possible but would welcome suggestions if this is the only way.
Just managed to find this:
The key parameter that I was missing is skipinitialspace=True
- this "deals with the spaces after the comma-delimiter"
a=pd.read_csv('a.dat',quotechar='"',skipinitialspace=True)
address 1 address 2 address 3 num1 num2 num3
0 address 1 address 2 address 3 1 2 3
1 address 1 address 2 address 3, address4 1 2 3
This works :-)
这篇关于numpy genfromtxt / pandas read_csv;忽略引号内的逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!