Python + PostgreSQL +奇怪的ascii = UTF8编码错误 [英] Python + PostgreSQL + strange ascii = UTF8 encoding error

查看:2666
本文介绍了Python + PostgreSQL +奇怪的ascii = UTF8编码错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有ascii字符串,其中包含字符\x80来表示欧元符号:

 >>>打印\x80

将包含此字符的字符串数据插入到我的数据库,我得到:

  psycopg2.DataError:编码UTF8的无效字节序列:0x80 
提示:此错误如果字节序列与由server_encoding控制的服务器所期望的编码
ng不匹配也可能发生。

我是一个unicode新手。如何将包含\x80的字符串转换为包含相同欧元符号的有效UTF-8?我尝试在各种字符串上调用 .encode .decode ,但遇到错误:

 >>> \x80.encode(utf-8)
追溯(最近的最后一次呼叫):
文件< pyshell#14>,第1行,< module>
\x80.encode(utf-8)
UnicodeDecodeError:'ascii'编解码器无法解码位置0的字节0x80:序号不在范围(128)


解决方案

问题以虚假的前提开始:


我有ascii字符串,其中包含字符\x80来表示欧元符号。


ASCII字符在\x00到\x7F范围内。



以前接受的现在删除答案是在两个总体误解下进行操作(1)该区域设置==编码(2)latin1编码将\x80映射到欧元字符。



事实上,所有ISO-8859-x编码都将\x80映射到U + 0080,这是C1控制字符之一,而不是欧元字符。这些编码中只有3个(x,(7,15,16))将欧元字符提供为\xA4。请参阅本维基百科文章



你需要知道你的数据是什么编码的。它创建了什么机器?怎么样?它创建的地方(不一定是你的)可能会给你一个线索。



请注意,我的数据在latin1中编码在那里,支票在邮件和当然我会在早上爱你。您的数据可能是在Windows平台上找到的cp125x编码之一编码的。请注意,除了cp1251(Windows Cyrillic)之外,所有这些都映射到欧元字符\x80:

 >> >对于范围(9)中的x的['\x80'.decode('cp125'+ str(x),'replace')] 
[u'\\\€',u'\\\Ђ',u '\\\€',u'\\\€',u'\\\€',u'\\\€',u'\\\€',u'\\\€']

更新以回应OP的评论


我从一个文件读取这些数据,例如开(FNAME).read()。它包含代表欧元字符的\x80的字符串。它只是一个纯文本文件。它是由另一个程序生成的,但我不知道如何生成文本。什么是好的解决方案?我想我可以假设它为欧元输出\x80,这意味着我可以认为它是用一个cp125x编码的,这个字符是欧元。


这有点混乱:首先你说


它包含了\x80的字符串这表示欧元字符


但是后来你说


我想我可以假设它为欧元输出\x80


请解释。 / p>

选择适当的cp125x编码:创建文件的位置(地理位置)?文字是用什么语言写的?值为>\x7f的推定欧元以外的任何字符?如果是这样,他们使用了哪些和什么上下文?



更新2 如果你不知道程序是如何写的你们也不能就是否总是用\80表示欧元的意见。尽管如此,否则将是巨大的愚蠢,不能排除。



如果文字用英文写成和/或写在美国,和/或它写在Windows平台上,那么可以肯定的是, cp1252 是要走的路,直到你得到相反的证据,在这种情况下, d需要自己猜测编码或回答(什么语言,什么地方)问题。


I have ascii strings which contain the character "\x80" to represent the euro symbol:

>>> print "\x80"
€

When inserting string data containing this character into my database, I get:

psycopg2.DataError: invalid byte sequence for encoding "UTF8": 0x80
HINT:  This error can also happen if the byte sequence does not match the encodi
ng expected by the server, which is controlled by "client_encoding".

I'm a unicode newbie. How can I convert my strings containing "\x80" to valid UTF-8 containing that same euro symbol? I've tried calling .encode and .decode on various strings, but run into errors:

>>> "\x80".encode("utf-8")
Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    "\x80".encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

解决方案

The question starts with a false premise:

I have ascii strings which contain the character "\x80" to represent the euro symbol.

ASCII characters are in the range "\x00" to "\x7F" inclusive.

The previously-accepted now-deleted answer operated under two gross misapprehensions (1) that locale == encoding (2) that the latin1 encoding maps "\x80" to a Euro character.

In fact, all of the ISO-8859-x encodings map "\x80" to U+0080 which is one of the C1 control characters, not a Euro character. Only 3 of those encodings (x in (7, 15, 16)) provide the Euro character, as "\xA4". See this Wikipedia article.

You need to know what encoding your data is in. What machine was it created on? How? The locale it was created in (not necessarily yours) may give you a clue.

Note that "My data is encoded in latin1" is up there with "The cheque's in the mail" and "Of course I'll love you in the morning". Your data is probably encoded in one of the cp125x encodings found on Windows platforms. Note that all of them except cp1251 (Windows Cyrillic) map "\x80" to the euro character:

>>> ['\x80'.decode('cp125' + str(x), 'replace') for x in range(9)]
[u'\u20ac', u'\u0402', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac', u'\u20ac']

Update in response to the OP's comment

I'm reading this data from a file, e.g. open(fname).read(). It contains strings with \x80 in them that represents the euro character. it's just a plain text file. it is generated by another program, but I don't know how it goes about generating the text. what would be a good solution? I'm thinking I can assume that it outputs "\x80" for a euro character, meaning I can assume it's encoded with a cp125x that has that char as the euro.

This is a bit confusing: First you say

It contains strings with \x80 in them that represents the euro character

But later you say

I'm thinking I can assume that it outputs "\x80" for a euro character

Please explain.

Selecting an appropriate cp125x encoding: Where (geographical location) was the file created? In what language(s) is the text written? Any characters other than the presumed euro with values > "\x7f"? If so, which ones and what context are they used in?

Update 2 If you don't "know how the program is written", neither you nor we can form an opinion on whether it always uses "\x80" for the euro character. Although doing otherwise would be monumental silliness, it can't be ruled out.

If the text is written in the English language and/or it is written in the USA, and/or it's written on a Windows platform, then it's reasonably certain that cp1252 is the way to go ... until you get evidence to the contrary, in which case you'd need to guess an encoding by yourself or answer the (what language, what locality) questions.

这篇关于Python + PostgreSQL +奇怪的ascii = UTF8编码错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆