Python无法正确分割带引号逗号的CSV档案 [英] CSV file with quoted comma can't be correctly split by Python

查看:236
本文介绍了Python无法正确分割带引号逗号的CSV档案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

def csv_split() :
    raw = [ 
            '"1,2,3" , "4,5,6" , "456,789"',
            '"text":"a,b,c,d", "gate":"456,789"'
          ]
    cr = csv.reader( raw, skipinitialspace=True )
    for l in cr :
        print len( l ), l

此函数输出以下内容:

3 ['1,2,3 ', '4,5,6 ', '456,789']
6 ['text:"a', 'b', 'c', 'd"', 'gate:"456', '789"']

您可以看到,第一行正确地分为3个条目. 但是第二行不是.我希望csv阅读器将其拆分 一分为二,我们这里有6个.我也考虑过正则表达式 的方法,但它假定了一些特定的引用方言.

As you can tell, the first line is correctly split into 3 entries. But the second line is NOT. I would expect the csv reader splits it into two, instead we've got 6 here. I have also thought about regex approaches, but it assumes some specific quoting dialect.

基本上我想要的是: 只要在没有成对引用的,"之间拆分字符串 的".

Basically what I want is: just split the string whenever there is a "," that is not quoted in a pair of "".

有没有快速而通用的方法来做到这一点?我见过一些正则表达式黑客 假定每个归档的文件都总是用引号引起,等等.我想我可以写一个小循环 这样做的效率很低,但是一定会体会到更多 专家建议.非常感谢!

Is there any quick and general way to do this? I have seen some regex hacks which assumes that every filed is ALWAYS quoted etc. I think I can write a small loop that does this very inefficiently, but would definitely appreciate some more expertly advice. Thanks a lot!

推荐答案

CSV不是标准格式,但是如果两个""出现在文本中(例如"text"":""a,b,c,d"),通常会使用引号来转引号. Python的CSV阅读器在这里做正确的事,因为它采用了这种约定.我不太确定您希望输出什么,但是这是我尝试的一个非常简单的CSV阅读器,该阅读器可能适合您的格式.随时进行相应的调整.

CSV isn't a standardized format, but it's common to escape quotation marks by using two "" if they appear inside the text (e.g. "text"":""a,b,c,d"). Python's CSV reader is doing the right thing here, because it assumes this convention. I'm not quite sure what do you expect as output, but here is my try for a very simple CSV reader which might suit your format. Feel free to adapt it accordingly.

raw = [
    '"1,2,3" , "4,5,6" , "456,789"',
    '"text":"a,b,c,d", "gate":"456,789"',
    '1,2,  3,'
]

for line in raw:
    i, quoted, row = 0, False, []
    for j, c in enumerate(line):
        if c == ',' and not quoted:
            row.append(line[i:j].strip())
            i = j + 1
        elif c == '"':
            quoted = not quoted
    row.append(line[i:j+1].strip())
    for i in range(len(row)):
        if len(row[i]) >= 2 and row[i][0] == '"' and row[i][-1] == '"':
            row[i] = row[i][1:-1] # remove quotation marks
    print row

输出:

['1,2,3', '4,5,6', '456,789']
['text":"a,b,c,d', 'gate":"456,789']
['1', '2', '3', '']

这篇关于Python无法正确分割带引号逗号的CSV档案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆