通过python csv编写utf-8吗? (上一个答案无效) [英] Write utf-8 through python csv? (prev answer not working)

查看:85
本文介绍了通过python csv编写utf-8吗? (上一个答案无效)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

将utf-8格式的Python列表写入CSV @abamert 建议使用csv文档中的一些示例代码来处理这种情况。



我无法解决该代码的问题,我想知道我做错了什么。



这是我的测试代码:

 #-*-编码:UTF-8-*-
import csv
import codecs
import csvutf8 #csv文档中的示例代码。
x =所有者的
,带有codecs.open('simpleout.txt','wb','UTF_8'),格式如下:
spamwriter = csvutf8.UnicodeWriter(of)
spamwriter.writerow([x])

和csvutf8.py,即我复制并



codecs.py 中的错误消息库中的内容是:

  UnicodeDecodeError:'ascii'编解码器无法解码位置5的字节0xe2:序数不在范围内( 128)

我该怎么办?



csvutf8.py

  Helper类通过CSV输出utf_8 Python 2.x 

导入csv,编解码器,cStringIO

类UTF8Recoder:

迭代器,读取已编码的流并将输入重新编码为UTF-8

def __init __(self,f,encoding):
self.reader = codecs.getreader(encoding) (f)

def __iter __(self):
返回自身

def next(self):
返回self.reader.next()。 encode( utf-8)

类UnicodeReader:

一个CSV阅读器,它将遍历CSV文件 f,
中的行以给定的编码方式进行编码。


def __init __(self,f,方言= csv.excel,encoding = utf-8,** kwds):
f = UTF8Recoder(f,编码)
self.reader = csv.reader(f,方言=方言,** kwds)

def next(self):
row = self.reader.next( )
return [unicode(s, utf-8)for s in row]

def __iter __(self):
返回自身

class UnicodeWriter:

一个CSV书写器,它将行写入CSV文件 f,即
,该文件以给定的编码进行编码。


def __init __(self,f,Dialect = csv.excel,encoding = utf-8,** kwds):
#将输出重定向到a队列
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue,方言=方言,** kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self,row):
self.writer.writerow([s.encode( utf-8)对于行中的s])
#从队列中获取UTF-8输出...
data = self.queue.getvalue()
data = data.decode( utf-8 )
#...并将其重新编码为目标编码
data = self.encoder.encode(data)
#写入目标流
self.stream.write(数据)
#空队列
self.queue.truncate(0)

def writerows(自身,行):
行中的行:
self.writerow(row)


解决方案

UnicodeWriter 示例代码旨在与纯字节文件一起使用,就像从 open获取的一样,而不是像从 codecs.open (或 io.open )。最简单的解决方法是在主脚本中仅使用 open 而不是 codecs.open

 ,带有open('simpleout.txt','wb'),开头为:






如果您要在项目中使用 csvutf8 '将回到现在的一年,或者与其他同事一起工作,您可能要考虑在 __ init __ 方法中添加这样的测试,因此下次您犯了此错误(将要犯此错误),它将立即显示,并出现一个更明显的错误:

  if isinstance( f,(
codecs.StreamReader,codecs.StreamWriter,
codecs.StreamReaderWriter,io.TextIOBase)):
提高TypeError(
'需要纯字节文件,而不是{}' .format(f .__ class__))






但是您将继续使用Python 2, * ,直到掌握了这些错误,这些错误很难找到,因此您应该学习如何立即发现它们。以下是具有相同错误的一些简单代码:

  data1 = u'[所有者的]'
data2 = data1.encode ('utf-8')
data3 = data2.encode('utf-8')

在交互式解释器中对此进行测试,并查看每个中间步骤的代表,类型等。您会看到 data2 str ,而不是 unicode 。这意味着它只是一堆字节。将一堆字节编码为UTF-8是什么意思? ** 唯一有意义的事情是使用默认编码(由于未设置其他字符,所以使用ASCII编码)将这些字节解码为Unicode,以便随后可以将其编码回字节。 / p>

因此,当您看到有关ASCII的 UnicodeDecodeError 之一时,就可以确定您正在调用编码,而不是解码),通常是这个问题。检查您正在调用的类型,它可能是 str 而不是 unicode * **






*我假设您有充分的理由无法控制仍然使用Python在2018年的第2个。如果不是,答案就容易得多:只需使用Python 3,就不可能解决整个问题(并且代码更简单,并且运行速度更快)。



**如果您认为对于Python来说,实际上不要尝试去猜测您的意思,并把它弄成错误,这实际上是有意义的……您是对的,那是其中的主要内容之一Python 3存在的原因。



***当然,您仍然需要弄清楚为什么为什么有字节
您期望使用Unicode。有时候,这确实很愚蠢,就像您执行 u = s.decode('latin1')一样,但是随后您仍然使用 s 而不是 u 。有时像这样的情况有些棘手,您使用的是自动为您编码的库,但您没有意识到。有时候情况甚至更糟,就像您忘记了在网站上解码一些文本并且整日运行一样,它会默默地创建成千上万个页面的mojibake,然后再进入带有斯拉夫名称的第一个页面,最后出现错误。


In Writing utf-8 formated Python lists to CSV @abamert suggests some sample code from the csv documentation to handle this case.

I am unable to fix the problem with that code, and I wonder what I am doing wrong.

Here is my test code:

# -*- coding: UTF-8 -*-
import csv
import codecs
import csvutf8  # sample code from csv documentation.
x = u'owner’s'
with codecs.open('simpleout.txt', 'wb', 'UTF_8') as of:
    spamwriter = csvutf8.UnicodeWriter(of)
    spamwriter.writerow([x])

and csvutf8.py, the file into which I copied and pasted the code from the documentation, is at the end of this message.

The error message from codecs.py in the library is:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 5: ordinal not in range(128)

What can I do to make this work?

csvutf8.py

"""Helper classes to output UTF_8 through CSV in Python 2.x"""

import csv, codecs, cStringIO

class UTF8Recoder:
    """
    Iterator that reads an encoded stream and reencodes the input to UTF-8
    """
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)

    def __iter__(self):
        return self

    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

    def __iter__(self):
        return self

    class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

解决方案

The UnicodeWriter sample code is meant to be used with a plain bytes file like you get from open, not a Unicode file like you get from codecs.open (or io.open). The simplest fix is to just use open instead of codecs.open in your main script:

with open('simpleout.txt', 'wb') as of:


If you're going to be using csvutf8 in a project you'll be coming back to a year from now, or working on with other colleagues, you may want to consider adding a test like this in the __init__ methods, so the next time you make this mistake (which you will) it'll show up immediately, and with a more obvious error:

if isinstance(f, (
     codecs.StreamReader, codecs.StreamWriter,
     codecs.StreamReaderWriter, io.TextIOBase)):
    raise TypeError(
        'Need plain bytes files, not {}'.format(f.__class__))


But if you're going to stick with Python 2,* these errors are hard to find until you get the hang of it, so you should learn how to spot them now. Here's some simpler code with the same error:

data1 = u'[owner’s]'
data2 = data1.encode('utf-8')
data3 = data2.encode('utf-8')

Test this in the interactive interpreter, and look at the repr, type, etc. of each intermediate step. You'll see that data2 is a str, not a unicode. That means it's just a bunch of bytes. What does it mean to encode a bunch of bytes to UTF-8? The only thing that makes sense** is to decode those bytes using your default encoding (which is ASCII because you haven't set anything else) into Unicode so that it can then encoded back to bytes.

So, when you see one of those UnicodeDecodeErrors about ASCII (and you're pretty sure you were calling encode rather than decode), it's usually this problem. Check the type you're calling it on, and it's probably a str rather than a unicode.***


* I assume you have a good reason beyond your control for still using Python 2 in 2018. If not, the answer is a lot easier: just use Python 3 and this whole problem is impossible (and the code is simpler, and runs faster).

** If you think it would actually make a lot more sense for Python to just not try to guess what you meant, and make this an error… you're right, and that's one of the main reasons Python 3 exists.

*** Of course you still need to figure out why you have bytes where you expected Unicode. Sometimes it's really silly, like you did u = s.decode('latin1') but then you kept using s instead of u. Sometimes it's a little trickier, like this case, where you're using a library that's automatically encoding for you, but you didn't realize it. Sometimes it's even worse, like you've forgotten to decode some text off a website and it runs all day silently creates mojibake for thousands of pages before running into the first one with a Slavic name and finally gets an error.

这篇关于通过python csv编写utf-8吗? (上一个答案无效)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆