双引号元素在csv无法读取与 pandas [英] double quoted elements in csv cant read with pandas
问题描述
我有一个输入文件,其中每个值都存储为字符串。
它在一个csv文件内,每个条目都在双引号内。
示例文件:
column1,column2,column3,column4,column5,column6
AM,07,1,SD ,PR,SD,SD,PR,SD,SD,SD,CR,
AM,08 SD
AM,01,2,SD,SD,SD
只有六列。我需要输入哪些选项来读取pandas read_csv以正确读取?
我目前正在尝试:
import pandas as pd
df = pd.read_csv(file,quotechar ='')
但是这给我错误信息:
CParserError:错误标记数据C错误:第3行中的第6个字段,第14行
这显然意味着它忽略了,并将每个逗号分析为一个字段。
然而,对于第3行,第3到第6列应该是带有逗号的字符串。 (1,2,3,PR,SD,SD,PR,SD,SD,PR,SD,SD)
<我得到了pandas.read_csv来正确解析这个?
谢谢。
p>这将工作。它回到python解析器(因为你有非常规的分隔符,例如它们是逗号,有时是空格)。如果你只有逗号,它会使用c解析器,并且更快。
在[1]:import csv
In [2]:!cat test.csv
column1,column2,column3,column4,column5,column6
AM ,07,1,SD,SD,CR
AM,08,1,2,3,PR,SD,SD ,SD,SD,SD,SD,SD,PR,SD,SD 3]:pd.read_csv('test.csv',sep =',\s +',quoting = csv.QUOTE_ALL)
pandas / io / parsers.py:637:ParserWarning: '引擎,因为'c'引擎不支持regex分隔符;你可以通过指定engine ='python'来避免这个警告。
ParserWarning)
Out [3]:
column1,column2column3column4column5column6
AM07 SDSDCR
AM081,2,3PR,SD,SDPR,SD,SDPR,SD,SD
AM012SDSDSD
I have an input file where every value is stored as a string. It is inside a csv file with each entry inside double quotes.
Example file:
"column1","column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"
There are only six columns. What options do I need to enter to pandas read_csv to read this correctly?
I currently am trying:
import pandas as pd
df = pd.read_csv(file, quotechar='"')
but this gives me the error message:
CParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 14
Which obviously means that it is ignoring the '"' and parsing every comma as a field. However, for line 3, columns 3 through 6 should be strings with commas in them. ("1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD")
How do I get pandas.read_csv to parse this correctly?
Thanks.
This will work. It falls back to the python parser (as you have non-regular separators, e.g. they are comma and sometimes space). If you only have commas it would use the c-parser and be much faster.
In [1]: import csv
In [2]: !cat test.csv
"column1","column2", "column3", "column4", "column5", "column6"
"AM", "07", "1", "SD", "SD", "CR"
"AM", "08", "1,2,3", "PR,SD,SD", "PR,SD,SD", "PR,SD,SD"
"AM", "01", "2", "SD", "SD", "SD"
In [3]: pd.read_csv('test.csv',sep=',\s+',quoting=csv.QUOTE_ALL)
pandas/io/parsers.py:637: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
ParserWarning)
Out[3]:
"column1","column2" "column3" "column4" "column5" "column6"
"AM" "07" "1" "SD" "SD" "CR"
"AM" "08" "1,2,3" "PR,SD,SD" "PR,SD,SD" "PR,SD,SD"
"AM" "01" "2" "SD" "SD" "SD"
这篇关于双引号元素在csv无法读取与 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!