Python无法正确分割带引号逗号的CSV档案 [英] CSV file with quoted comma can't be correctly split by Python
问题描述
def csv_split() :
raw = [
'"1,2,3" , "4,5,6" , "456,789"',
'"text":"a,b,c,d", "gate":"456,789"'
]
cr = csv.reader( raw, skipinitialspace=True )
for l in cr :
print len( l ), l
此函数输出以下内容:
3 ['1,2,3 ', '4,5,6 ', '456,789']
6 ['text:"a', 'b', 'c', 'd"', 'gate:"456', '789"']
您可以看到,第一行正确地分为3个条目. 但是第二行不是.我希望csv阅读器将其拆分 一分为二,我们这里有6个.我也考虑过正则表达式 的方法,但它假定了一些特定的引用方言.
As you can tell, the first line is correctly split into 3 entries. But the second line is NOT. I would expect the csv reader splits it into two, instead we've got 6 here. I have also thought about regex approaches, but it assumes some specific quoting dialect.
基本上我想要的是: 只要在没有成对引用的,"之间拆分字符串 的".
Basically what I want is: just split the string whenever there is a "," that is not quoted in a pair of "".
有没有快速而通用的方法来做到这一点?我见过一些正则表达式黑客 假定每个归档的文件都总是用引号引起,等等.我想我可以写一个小循环 这样做的效率很低,但是一定会体会到更多 专家建议.非常感谢!
Is there any quick and general way to do this? I have seen some regex hacks which assumes that every filed is ALWAYS quoted etc. I think I can write a small loop that does this very inefficiently, but would definitely appreciate some more expertly advice. Thanks a lot!
推荐答案
CSV不是标准格式,但是如果两个""
出现在文本中(例如"text"":""a,b,c,d"
),通常会使用引号来转引号. Python的CSV阅读器在这里做正确的事,因为它采用了这种约定.我不太确定您希望输出什么,但是这是我尝试的一个非常简单的CSV阅读器,该阅读器可能适合您的格式.随时进行相应的调整.
CSV isn't a standardized format, but it's common to escape quotation marks by using two ""
if they appear inside the text (e.g. "text"":""a,b,c,d"
). Python's CSV reader is doing the right thing here, because it assumes this convention. I'm not quite sure what do you expect as output, but here is my try for a very simple CSV reader which might suit your format. Feel free to adapt it accordingly.
raw = [
'"1,2,3" , "4,5,6" , "456,789"',
'"text":"a,b,c,d", "gate":"456,789"',
'1,2, 3,'
]
for line in raw:
i, quoted, row = 0, False, []
for j, c in enumerate(line):
if c == ',' and not quoted:
row.append(line[i:j].strip())
i = j + 1
elif c == '"':
quoted = not quoted
row.append(line[i:j+1].strip())
for i in range(len(row)):
if len(row[i]) >= 2 and row[i][0] == '"' and row[i][-1] == '"':
row[i] = row[i][1:-1] # remove quotation marks
print row
输出:
['1,2,3', '4,5,6', '456,789']
['text":"a,b,c,d', 'gate":"456,789']
['1', '2', '3', '']
这篇关于Python无法正确分割带引号逗号的CSV档案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!