UnicodeDecodeError:'ascii'编解码器无法解码位置47中的字节0x92:序号不在范围(128) [英] UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

查看:181
本文介绍了UnicodeDecodeError:'ascii'编解码器无法解码位置47中的字节0x92:序号不在范围(128)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

$ p

首先,当我这样做,copy_from()抛出一个错误:ERROR:编码UTF8的无效字节序列:0xc92所以我遵循这个问题



我发现我的Postgres数据库有UTF8编码。



我正在写我的数据的文件/ StringIO对象显示其编码如下:
setgid非ISO扩展ASCII英文文本,具有非常长的行,带有CRLF行终止符



我试图将我正在写入的中文文件/ StringIO对象的每个字符串编码为UTF8格式。为此,为每个字符串使用.encode(encoding ='UTF-8',errors ='strict'))



这是我现在遇到的错误:
UnicodeDecodeError:'ascii'编解码器无法解码位置47中的字节0x92:序号不在范围(128)



这是什么意思?如何修复?



编辑:
我使用的是Python 2.7
我的代码中的一些代码:



我从MySQL数据库中读取数据,按照MySQL Workbench的格式编码UTF-8数据。
这是将我的数据(从MySQL数据库获得)写入StringIO对象的几行代码:

 填充table_data变量,以\\\
分隔的行和由\t
row_num = 0
分隔的行在cursor.fetchall()中的行:

#分隔行一个新的行分隔符
的表(如果(row_num!= 0)):
table_data.write(\\\


col_num = 0
行:
#通过制表符分隔符分隔单元格
if(col_num!= 0):
table_data.write(\t)

table_data .write(cell.encode(encoding ='UTF-8',errors ='strict'))
col_num = col_num + 1

row_num = row_num + 1

这是从我的StringIO对象table_data写入Postgres数据库的代码:

  cursor = db_connection.cursor()
cursor.copy_from(table_data,< postgres_table_name>)


解决方案

问题是你打电话编码 str 对象。



A str 是一个字节字符串,通常表示以某种方式编码的文本,如UTF-8。当您调用 encode 时,首先必须将其解码回文本,因此文本可以重新编码。默认情况下,Python通过调用 s.decode(sys.getgetdefaultencoding()) getdefaultencoding()通常返回'ascii'



所以,你在说UTF-8编码的文本,像它一样解码是ASCII,然后以UTF-8重新编码。



一般解决方案是明确地调用 decode 正确的编码,而不是让Python使用默认值,然后是 encode 结果。



但是当右边编码已经是你想要的,更简单的解决方案是跳过 .decode('utf-8')。encode('utf-8'),只需使用UTF-8 str 作为UTF-8 str 已经是。



或者,如果您的MySQL包装器有一个功能可以让您指定一个编码,并返回 CHAR的 unicode / VARCHAR / TEXT 列而不是 str 值(例如,i n MySQLdb,你通过 use_unicode = True connect 调用,或 charset ='UTF -8'如果您的数据库太旧,无法自动检测到它),只需这样做。然后,您将有 unicode 对象,您可以在其上调用 .encode('utf-8')。一般来说,处理Unicode问题的最佳方式是尽可能早地解码所有内容,执行Unicode中的所有处理,然后编码为晚尽可能。但无论如何,你必须保持一致。不要在可能是 unicode 的东西上调用 str ;不要将 str 文字连接到 unicode 或将其传递给其替换方法;等等,任何时候你混合搭配,Python将使用你的默认编码来隐含地转换,这几乎从来不是你想要的。



作为附注,这是Python 3.x的Unicode更改帮助的许多事情之一。首先, str 现在是Unicode文本,不是编码的字节。更重要的是,如果你有编码字节,例如,在字节对象中,调用 encode 将给你一个 AttributeError ,而不是试图静音解码,以便它可以重新编码。而且,类似地,尝试混合和匹配Unicode和字节将给您一个明显的 TypeError ,而不是在某些情况下成功的隐式转换,并给出一个关于编码的隐含消息或解码您没有在其他人要求。


I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function.

First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question.

I figured out that my Postgres database has UTF8 encoding.

The file/StringIO object I am writing my data into shows its encoding as the following: setgid Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators

I tried to encode every string that I am writing to the intermediate file/StringIO object into UTF8 format. To do this used .encode(encoding='UTF-8',errors='strict')) for every string.

This is the error I got now: UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

What does it mean? How do I fix it?

EDIT: I am using Python 2.7 Some pieces of my code:

I read from a MySQL database that has data encoded in UTF-8 as per MySQL Workbench. This is a few lines code for writing my data (that's obtained from MySQL db) to StringIO object:

# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :

    # Separate rows in a table by new line delimiter
    if(row_num!=0):
        table_data.write("\n")

    col_num=0
    for cell in row:    
        # Separate cells in a row by tab delimiter
        if(col_num!=0):
            table_data.write("\t") 

        table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
        col_num = col_num+1

    row_num = row_num+1   

This is the code that writes to Postgres database from my StringIO object table_data:

cursor = db_connection.cursor()
cursor.copy_from(table_data, <postgres_table_name>)

解决方案

The problem is that you're calling encode on a str object.

A str is a byte string, usually representing text encoded in some way like UTF-8. When you call encode on that, it first has to be decoded back to text, so the text can be re-encoded. By default, Python does that by calling s.decode(sys.getgetdefaultencoding()), and getdefaultencoding() usually returns 'ascii'.

So, you're talking UTF-8 encoded text, decoding it as if it were ASCII, then re-encoding it in UTF-8.

The general solution is to explicitly call decode with the right encoding, instead of letting Python use the default, and then encode the result.

But when the right encoding is already the one you want, the easier solution is to just skip the .decode('utf-8').encode('utf-8') and just use the UTF-8 str as the UTF-8 str that it already is.

Or, alternatively, if your MySQL wrapper has a feature to let you specify an encoding and get back unicode values for CHAR/VARCHAR/TEXT columns instead of str values (e.g., in MySQLdb, you pass use_unicode=True to the connect call, or charset='UTF-8' if your database is too old to auto-detect it), just do that. Then you'll have unicode objects, and you can call .encode('utf-8') on them.

In general, the best way to deal with Unicode problems is the last one—decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But either way, you have to be consistent. Don't call str on something that might be a unicode; don't concatenate a str literal to a unicode or pass one to its replace method; etc. Any time you mix and match, Python is going to implicitly convert for you, using your default encoding, which is almost never what you want.

As a side note, this is one of the many things that Python 3.x's Unicode changes help with. First, str is now Unicode text, not encoded bytes. More importantly, if you have encoded bytes, e.g., in a bytes object, calling encode will give you an AttributeError instead of trying to silently decode so it can re-encode. And, similarly, trying to mix and match Unicode and bytes will give you an obvious TypeError, instead of an implicit conversion that succeeds in some cases and gives a cryptic message about an encode or decode you didn't ask for in others.

这篇关于UnicodeDecodeError:'ascii'编解码器无法解码位置47中的字节0x92:序号不在范围(128)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆