集合和Unicode字符串的问题 [英] Problem with sets and Unicode strings

查看:69
本文介绍了集合和Unicode字符串的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

嗨!


UTF-8编码文件中的以下程序:

# - * - 编码:UTF-8 - * -


FIELDS =(" F?cher",)

FROZEN_FIELDS = frozenset(FIELDS)

FIELDS_SET = set(FIELDS)


print uF?cher在FROZEN_FIELDS

print uF?cher在FIELDS_SET

print uF?cher在FIELDS

给出这个输出

False

False

回溯(最近一次调用最后一次):

文件" test.py",第9行,在?

print uF ?? cher在FIELDS中

UnicodeDecodeError:''ascii''编解码器无法解码位置1中的字节0xc3:

序数不在范围内(128)

为什么前两个打印语句成功,第三个打印语句失败

有例外?


为什么使用set / frozenset删除异常?

谢谢,

丹尼斯

解决方案

6月27日,Dennis Benzinger <德************** @ gmx.net>写道:

嗨!

UTF-8编码文件中的以下程序:

# - * - 编码:UTF-8 - * -

FIELDS =(" F?cher",)
FROZEN_FIELDS = frozenset(FIELDS)
FIELDS_SET = set(FIELDS)

打印u" ;·F&谢尔QUOT;在FROZEN_FIELDS
print uF?cher在FIELDS_SET
print uF?cher在FIELDS

给出这个输出

False
False
Traceback(最近一次调用最后一次):
文件" test.py" ,第9行,在?
打印你的F ?? cher在FIELDS中
UnicodeDecodeError:''ascii''编解码器无法解码位置1的字节0xc3:
序数不在范围内(128)

为什么前两个打印陈述成功,第三个失败
异常?


实际上所有三个语句都无法产生正确的结果。

为什么使用set / frozenset删除异常?




因为集合使用散列算法来查找匹配,而最后的

语句直接将unicode字符串与字节字符串进行比较。字节

字符串只能包含ascii字符,这就是python引发

异常的原因。问题很容易解决:使用unicode字符串

所有非ascii字符串。


Serge Orlov写道:

2006年6月27日,Dennis Benzinger< De ************** @ gmx.net>写道:

嗨!

UTF-8编码文件中的以下程序:

# - * - 编码:UTF-8 - * -

FIELDS =(" F?cher",)
FROZEN_FIELDS = frozenset(FIELDS)
FIELDS_SET = set(FIELDS)

打印u" ;·F&谢尔QUOT;在FROZEN_FIELDS
print uF?cher在FIELDS_SET
print uF?cher在FIELDS

给出这个输出

False
False
Traceback(最近一次调用最后一次):
文件" test.py" ,第9行,在?
打印你的F ?? cher在FIELDS中
UnicodeDecodeError:''ascii''编解码器无法解码位置1的字节0xc3:
序数不在范围内(128)

为什么前两个打印语句成功,第三个语句失败
异常?
实际上所有三个语句都无法产生正确的结果。




所以这是一个错误在Python中?

frozenset删除异常?

因为集合使用散列算法来查找匹配,而最后的
语句直接将unicode字符串与字节字符串进行比较。字节
字符串只能包含ascii字符,这就是python引发异常的原因。问题很容易解决:对所有非ascii字符串使用unicode字符串。




否,字节字符串包含至少8位的字符位宽

< http://docs.python.org/ref/types.html>。但是我不明白Python试图解码的是什么,以及为什么异常会说明ASCII编解码器的原因,因为我的文件是用UTF-8编码的。

Dennis


6月27日,Dennis Benzinger< De ************** @ gmx.net>写道:

Serge Orlov写道:

6月27日,Dennis Benzinger< De ************** @ gmx达网络>写道:

嗨!

UTF-8编码文件中的以下程序:

# - * - 编码:UTF-8 - * -

FIELDS =(" F?cher",)
FROZEN_FIELDS = frozenset(FIELDS)
FIELDS_SET = set(FIELDS)

打印u" ;·F&谢尔QUOT;在FROZEN_FIELDS
print uF?cher在FIELDS_SET
print uF?cher在FIELDS

给出这个输出

False
False
Traceback(最近一次调用最后一次):
文件" test.py" ,第9行,在?
打印你的F ?? cher在FIELDS中
UnicodeDecodeError:''ascii''编解码器无法解码位置1的字节0xc3:
序数不在范围内(128)

为什么前两个打印语句成功,第三个语句失败
异常?
实际上所有三个语句都无法产生正确的结果。



这是Python中的一个错误? / blockquote>


编号

frozenset删除异常?

因为集合使用散列算法来查找匹配项,而最后一个
语句直接将unicode字符串与字节字符串进行比较。字节
字符串只能包含ascii字符,这就是python引发异常的原因。问题很容易修复:对所有非ascii字符串使用unicode字符串。



不,字节字符串包含至少8位宽的字符
< http://docs.python.org/ref/types.html>。




是的,但后来写的是非ascii字符没有

赋予它们的通用含义。换句话说,如果你将字节

0xE4放入一个字节字符串中,所有python都知道它是*某些*

字符。如果你将字符U + 00E4放入一个unicode字符串python中

知道它是一个带有分音符的拉丁小写字母。试图比较

*某些*字符与特定字符显然是未定义的。

但我不明白什么是Python试图解码的原因以及为什么这个例外说明了ASCII编解码器,因为我的文件是用UTF-8编码的。




因为字节字符串可以来自不同的来源(网络,文件,

等)不仅来自你的程序python的来源不能假设

所有这些都是utf-8。它假设它们是ascii,因为大多数

广泛的文本编码都是ascii基础。实际上这是一个猜测,

,因为有utf-16,utf-32和其他非ascii编码。如果你想要体验没有猜测的生活

sys.setdefaultencoding(" undefined")到site.py


Hi!

The following program in an UTF-8 encoded file:
# -*- coding: UTF-8 -*-

FIELDS = ("F?cher", )
FROZEN_FIELDS = frozenset(FIELDS)
FIELDS_SET = set(FIELDS)

print u"F?cher" in FROZEN_FIELDS
print u"F?cher" in FIELDS_SET
print u"F?cher" in FIELDS
gives this output
False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"F??cher" in FIELDS
UnicodeDecodeError: ''ascii'' codec can''t decode byte 0xc3 in position 1:
ordinal not in range(128)
Why do the first two print statements succeed and the third one fails
with an exception?

Why does the use of set/frozenset remove the exception?
Thanks,
Dennis

解决方案

On 6/27/06, Dennis Benzinger <De**************@gmx.net> wrote:

Hi!

The following program in an UTF-8 encoded file:
# -*- coding: UTF-8 -*-

FIELDS = ("F?cher", )
FROZEN_FIELDS = frozenset(FIELDS)
FIELDS_SET = set(FIELDS)

print u"F?cher" in FROZEN_FIELDS
print u"F?cher" in FIELDS_SET
print u"F?cher" in FIELDS
gives this output
False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"F??cher" in FIELDS
UnicodeDecodeError: ''ascii'' codec can''t decode byte 0xc3 in position 1:
ordinal not in range(128)
Why do the first two print statements succeed and the third one fails
with an exception?
Actually all three statements fail to produce correct result.
Why does the use of set/frozenset remove the exception?



Because sets use hash algorithm to find matches, whereas the last
statement directly compares a unicode string with a byte string. Byte
strings can only contain ascii characters, that''s why python raises an
exception. The problem is very easy to fix: use unicode strings for
all non-ascii strings.


Serge Orlov wrote:

On 6/27/06, Dennis Benzinger <De**************@gmx.net> wrote:

Hi!

The following program in an UTF-8 encoded file:
# -*- coding: UTF-8 -*-

FIELDS = ("F?cher", )
FROZEN_FIELDS = frozenset(FIELDS)
FIELDS_SET = set(FIELDS)

print u"F?cher" in FROZEN_FIELDS
print u"F?cher" in FIELDS_SET
print u"F?cher" in FIELDS
gives this output
False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"F??cher" in FIELDS
UnicodeDecodeError: ''ascii'' codec can''t decode byte 0xc3 in position 1:
ordinal not in range(128)
Why do the first two print statements succeed and the third one fails
with an exception?
Actually all three statements fail to produce correct result.



So this is a bug in Python?
frozenset remove the exception?

Because sets use hash algorithm to find matches, whereas the last
statement directly compares a unicode string with a byte string. Byte
strings can only contain ascii characters, that''s why python raises an
exception. The problem is very easy to fix: use unicode strings for
all non-ascii strings.



No, byte strings contain characters which are at least 8-bit wide
<http://docs.python.org/ref/types.html>. But I don''t understand what
Python is trying to decode and why the exception says something about
the ASCII codec, because my file is encoded with UTF-8.
Dennis


On 6/27/06, Dennis Benzinger <De**************@gmx.net> wrote:

Serge Orlov wrote:

On 6/27/06, Dennis Benzinger <De**************@gmx.net> wrote:

Hi!

The following program in an UTF-8 encoded file:
# -*- coding: UTF-8 -*-

FIELDS = ("F?cher", )
FROZEN_FIELDS = frozenset(FIELDS)
FIELDS_SET = set(FIELDS)

print u"F?cher" in FROZEN_FIELDS
print u"F?cher" in FIELDS_SET
print u"F?cher" in FIELDS
gives this output
False
False
Traceback (most recent call last):
File "test.py", line 9, in ?
print u"F??cher" in FIELDS
UnicodeDecodeError: ''ascii'' codec can''t decode byte 0xc3 in position 1:
ordinal not in range(128)
Why do the first two print statements succeed and the third one fails
with an exception?
Actually all three statements fail to produce correct result.



So this is a bug in Python?



No.

frozenset remove the exception?

Because sets use hash algorithm to find matches, whereas the last
statement directly compares a unicode string with a byte string. Byte
strings can only contain ascii characters, that''s why python raises an
exception. The problem is very easy to fix: use unicode strings for
all non-ascii strings.



No, byte strings contain characters which are at least 8-bit wide
<http://docs.python.org/ref/types.html>.



Yes, but later it''s written that non-ascii characters do not have
universal meaning assigned to them. In other words if you put byte
0xE4 into a bytes string all python knows about it is that it''s *some*
character. If you put character U+00E4 into a unicode string python
knows it''s a "latin small letter a with diaeresis". Trying to compare
*some* character with a specific character is obviously undefined.
But I don''t understand what
Python is trying to decode and why the exception says something about
the ASCII codec, because my file is encoded with UTF-8.



Because byte strings can come from different sources (network, files,
etc) not only from the sources of your program python cannot assume
all of them are utf-8. It assumes they are ascii, because most of
wide-spread text encodings are ascii bases. Actually it''s a guess,
since there are utf-16, utf-32 and other non-ascii encodings. If you
want to experience the life without guesses put
sys.setdefaultencoding("undefined") into site.py


这篇关于集合和Unicode字符串的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆