在Python中从gzip的CSV中删除BOM [英] Removing BOM from gzip'ed CSV in Python

查看:435
本文介绍了在Python中从gzip的CSV中删除BOM的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下代码来解压缩并保存CSV文件:

 使用gzip.open(filename_gz) f:
file = open(filename,w);
output = csv.writer(file,delimiter =',')
output.writerows(csv.reader(f,dialect ='excel',delimiter =';'))

一切似乎都有效,除了文件中的第一个字符是意外的。 Googling around似乎表明它是由于文件中的BOM。



我已经读过编码utf-8-sig的内容应该解决这个问题。但是,添加:

  .read()。encoding('utf-8-sig')



c在csv.reader中的f失败,并且失败:

 文件ckan_gz_datastore.py,第16行,在< module> 
output.writerows(csv.reader(f.read()。encode('utf-8-sig'),dialect ='excel',delimiter =';'))
文件/ usr /lib/python2.7/encodings/utf_8_sig.py,第15行,在encode
中return(codecs.BOM_UTF8 + codecs.utf_8_encode(input,errors)[0],len(input))
UnicodeDecodeError:'ascii'编解码器无法解码位置0中的字节0xef:序数不在范围内(128)

如何删除BOM,只是将内容保存在正确的utf-8中?

解决方案

首先,



其次, csv 模块不喜欢Python 2.7中的unicode字符串,因此已解码您的数据,需要转换回utf-8。



最后, csv.reader



这样:

  csv.reader(f.read()。decode('utf-8-sig')。encode('utf-8')。splitlines())
<但是,您可能认为只是手动删除BOM更简单/更高效:

  def remove_bom(line):
return line [3:] if line.startswith(codecs.BOM_UTF8)else line

csv .reader((f中的行的remove_bom(行)),dialect ='excel',delimiter =';')

这是微妙的不同,因为它从任何以一个开头的行删除一个BOM,而不只是第一行。如果你不需要保留其他BOM,那么你可以用以下方法修复:

  def remove_bom_from_first 
f = iter(iterable)
firstline = next(f,None)
如果第一行不是None:
yield remove_bom(firstline)
for line in f:
yield f


I'm using the following code to unzip and save a CSV file:

with gzip.open(filename_gz) as f:
    file = open(filename, "w");
    output = csv.writer(file, delimiter = ',')
    output.writerows(csv.reader(f, dialect='excel', delimiter = ';'))

Everything seems to work, except for the fact that the first characters in the file are unexpected. Googling around seems to indicate that it is due to BOM in the file.

I've read that encoding the content in utf-8-sig should fix the issue. However, adding:

.read().encoding('utf-8-sig')

to f in csv.reader fails with:

File "ckan_gz_datastore.py", line 16, in <module>
    output.writerows(csv.reader(f.read().encode('utf-8-sig'), dialect='excel', delimiter = ';'))
File "/usr/lib/python2.7/encodings/utf_8_sig.py", line 15, in encode
    return (codecs.BOM_UTF8 + codecs.utf_8_encode(input, errors)[0], len(input))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

How can I remove the BOM and just save the content in correct utf-8?

解决方案

First, you need to decode the file contents, not encode them.

Second, the csv module doesn't like unicode strings in Python 2.7, so having decoded your data you need to convert back to utf-8.

Finally, csv.reader is passed an iteration over the lines of the file, not a big string with linebreaks in it.

So:

csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines())

However, you might consider it simpler / more efficent just to remove the BOM manually:

def remove_bom(line):
    return line[3:] if line.startswith(codecs.BOM_UTF8) else line

csv.reader((remove_bom(line) for line in f), dialect = 'excel', delimiter = ';')

That is subtly different, since it removes a BOM from any line that starts with one, instead of just the first line. If you don't need to keep other BOMs that's OK, otherwise you can fix it with:

def remove_bom_from_first(iterable):
    f = iter(iterable)
    firstline = next(f, None)
    if firstline is not None:
        yield remove_bom(firstline)
        for line in f:
            yield f

这篇关于在Python中从gzip的CSV中删除BOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆