UnicodeEncodeError:'ascii'编解码器无法在位置42编码字符u'\ xfa':序数不在范围内(128) [英] UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 42: ordinal not in range(128)

查看:78
本文介绍了UnicodeEncodeError:'ascii'编解码器无法在位置42编码字符u'\ xfa':序数不在范围内(128)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

def main():
    client = ##client_here
    db = client.brazil
    rio_bus = client.tweets
    result_cursor = db.tweets.find()
    first = result_cursor[0]
    ordered_fieldnames = first.keys()
    with open('brazil_tweets.csv','wb') as csvfile:

        csvwriter = csv.DictWriter(csvfile,fieldnames = ordered_fieldnames,extrasaction='ignore')
        csvwriter.writeheader()
        for x in result_cursor:
            print x
            csvwriter.writerow( {k: str(x[k]).encode('utf-8') for k in x})

        #[ csvwriter.writerow(x.encode('utf-8')) for x in result_cursor ]


if __name__ == '__main__':
    main()

基本上,问题是推文中包含一堆葡萄牙语字符.我试图通过将所有内容编码为unicode值,然后再将其放入要添加到该行的字典中来进行纠正.但是,这不起作用.还有其他想法格式化这些值,以便csv阅读器和dictreader可以读取它们吗?

Basically the issue is that the tweets contain a bunch of characters in Portuguese. I tried to correct for this by encoding everything into unicode values before putting them in the dictionary that was to be added to the row. However this doesn't work. Any other ideas for formatting these values so that csv reader and dictreader can read them?

推荐答案

str(x[k]).encode('utf-8')是问题.

str(x[k])将使用Python 2中的默认ascii编解码器将Unicode字符串转换为字节字符串.

str(x[k]) will convert a Unicode string to an byte string using the default ascii codec in Python 2:

>>> x = u'résumé'
>>> str(x)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128)

非Unicode值(如布尔值)将转换为字节字符串,但是Python会在调用.encode()之前将字节字符串隐式解码为Unicode字符串,因为您只能对Unicode字符串进行编码.这通常不会导致错误,因为大多数非Unicode对象都具有ASCII表示形式.这是一个示例,其中自定义对象返回非ASCII str()表示形式:

Non-Unicode values, like booleans, will be converted to byte strings, but then Python will implicitly decode the byte string to a Unicode string before calling .encode(), because you can only encode Unicode strings. This usually won't cause an error because most non-Unicode objects have an ASCII representation. Here's an example where a custom object returns a non-ASCII str() representation:

>>> class Test(object):
...  def __str__(self):
...    return 'r\xc3\xa9sum\xc3\xa9'
...
>>> x=Test()
>>> str(x)
'r\xc3\xa9sum\xc3\xa9'
>>> str(x).encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

请注意,以上是解码错误,而不是编码错误.

Note the above was a decode error instead of an encode error.

如果仅str()可以将布尔值强制转换为字符串,请改为将其强制转换为Unicode字符串:

If str() is only there to coerce booleans to a string, coerce it to a Unicode string instead:

unicode(x[k]).encode('utf-8')

非Unicode值将转换为Unicode字符串,然后可以正确编码,但Unicode字符串将保持不变,因此也将正确编码.

Non-Unicode values will be converted to Unicode strings, which can then be correctly encoded, but Unicode strings will remain unchanged, so they will also be encoded correctly.

>>> x = True
>>> unicode(x)
u'True'
>>> unicode(x).encode('utf8')
'True'
>>> x = u'résumé'
>>> unicode(x).encode('utf8')
'r\xc3\xa9sum\xc3\xa9'    

P.S. Python 3不会在字节和Unicode字符串之间进行隐式编码/解码,并使这些错误更容易发现.

P.S. Python 3 does not do implicit encode/decode between byte and Unicode strings and makes these errors easier to spot.

这篇关于UnicodeEncodeError:'ascii'编解码器无法在位置42编码字符u'\ xfa':序数不在范围内(128)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆