在我的python文件中编写utf-8字符串 [英] Writing utf-8 string inside my python files

查看:132
本文介绍了在我的python文件中编写utf-8字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的.py文件中的这一行给我一个信息:"UnicodeDecodeError:'utf8'编解码器无法解码位置8-13中的字节:不支持的Unicode代码范围"

This line in my .py file is giving me a: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-13: unsupported Unicode code range"

if line.startswith(u"Fußnote"):

该文件保存在utf-8中,并在顶部具有编码: #--编码:utf-8--

The file is saved in utf-8 and has the encoding at the top: # -- coding: utf-8 --

我在注释和数组中还有很多其他utf-8编码的中文文本的py文件,例如:arr = [u"chinese text",],所以我想知道为什么在这种情况下特别对我不起作用.

I've got a lot of other py files with utf-8 encoded chinese text in them in the comments and in arrays for example: arr = [u"chinese text",] so I'm wondering why this case in particular doesn't work for me.

推荐答案

让我们仔细检查该错误消息:

Let's examine that error message very closely:

"UnicodeDecodeError:'utf8'编解码器无法解码位置8-13中的字节:不支持的Unicode代码范围"

"UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-13: unsupported Unicode code range"

请注意,它说的是位置8-13中的字节",即 6字节UTF-8序列.在黑暗时代这可能是有效的,但是由于Unicode被冻结为21位,因此最大为4个字节. UTF-8验证和错误报告最近已得到加强;出于兴趣,您到底在运行什么版本的Python?

Note carefully that it says "bytes in position 8-13" -- that's a 6-byte UTF-8 sequence. That might have been valid in the dark ages, but since Unicode was frozen at 21 bits, the maximum is FOUR bytes. UTF-8 validations and error reporting were tightened up recently; as a matter of interest, exactly what version of Python are you running?

至少在2.7.1和2.6.6下,该错误变得更加有用"...无法解码位置8的字节XXXX:无效的起始字节",其中XXXX只能是0xfc或0xfd(如果旧的消息建议使用6个字节的序列.在ISO-8859-1或cp1252中,0xfc表示U + 00FC带小写字母的拉丁文小写字母U(又名u-umlaut,可能是可疑的); 0xfd表示U + 00FD带有小写字母的拉丁文小写字母Y(不太可能).

With 2.7.1 and 2.6.6 at least, that error becomes the more useful "... can't decode byte XXXX in position 8: invalid start byte" where XXXX can be only be 0xfc or 0xfd if the old message suggested a 6-byte sequence. In ISO-8859-1 or cp1252, 0xfc represents U+00FC LATIN SMALL LETTER U WITH DIAERESIS (aka u-umlaut, a likely suspect); 0xfd represents U+00FD LATIN SMALL LETTER Y WITH ACUTE (less likely).

问题不在于源文件中的if line.startswith(u"Fußnote"):语句.如果它不是正确的UTF-8,则会在COMPILE时收到一条消息,并且该消息将以"SyntaxError"而不是"UnicodeDecodeError"开头.无论如何,该字符串的UTF-8编码只有8个字节长,而不是14个字节.

The problem is NOT with the if line.startswith(u"Fußnote"): statement in your source file. You would have got a message at COMPILE time if it wasn't proper UTF-8, and the message would have started with "SyntaxError", not "UnicodeDecodeError". In any case the UTF-8 encoding of that string is only 8 bytes long, not 14.

问题在于(正如@Mark Tolonen所指出的),无论行"指的是什么.它只能是一个str对象.

The problem is (as @Mark Tolonen has pointed out) in whatever "line" is referring to. It can only be a str object.

要进一步了解,您需要回答Mark的问题(1)print repr(line)的结果(2)site.py更改.

To get further you need to answer Mark's questions (1) result of print repr(line) (2) site.py change.

在这个阶段,最好是将strunicode对象混合在一起(在许多操作中,不仅是a.startswith(b)).

At this stage it's a good idea to clear the air about mixing str and unicode objects (in many operations, not just a.startswith(b)).

除非定义了操作以产生str对象,否则不会将unicode对象强制为str.对于a.startswith(b)并非如此,它将尝试使用默认(通常为"ascii")编码对str对象进行解码.

Unless the operation is defined to produce a str object, it will NOT coerce the unicode object to str. This is not the case with a.startswith(b).It will attempt to decode the str object using the default (usually 'ascii') encoding.

示例:

>>> "\xff".startswith(u"\xab")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

>>> u"\xff".startswith("\xab")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)

此外,说混合并得到UnicodeDecodeError"是不正确的.很有可能str对象已有效地以默认编码(通常为"ascii")编码- -没有异常.

Furthermore, it is NOT correct to say "Mix and you get UnicodeDecodeError". It is quite possible that the str object is validly encoded in the default encoding (usually 'ascii') -- no exception is raised.

示例:

>>> "abc".startswith(u"\xff")
False
>>> u"\xff".startswith("abc")
False
>>>

这篇关于在我的python文件中编写utf-8字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆