unicode编码usablilty问题 [英] unicode encoding usablilty problem

查看:116
本文介绍了unicode编码usablilty问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很早就发现严格的ASCII令人沮丧的Python默认编码。

首先,我更喜欢垃圾字符而不是异常。但

最大的问题是Unicode异常经常会出现在意想不到的地方,而且只有当非ASCII或unicode角色首次进入
$时才会出现
b $ b系统。


下面是一个例子。该程序可能在开始时正常运行。但是,随着一个unicode角色u''b'被引入,很快就出现了这个节目,这个节目出乎意料地飙升了。

sys.getdefaultencoding()
''ascii''a =''\ xe5''
#可以打印,你认为你''好的
....打印一个

? b = u''b''
a == b
回溯(最近一次调用最后一次):

文件"< stdin>",第1行,in ?

UnicodeDecodeError:''ascii''编解码器无法解码位置0的字节0xe5:

ordinal不在范围内(128)



有人可能会建议正确的方法是使用解码,例如


a.decode('''latin-1 '')== b

这引出了另一个问题。大多数参考书和书籍专注于

输入unicode文字并使用编码/解码方法。谬误

是字符串是整个程序中使用的基本数据类型,你

每次使用时都不想做出个人决定/>
string(对任何疏忽都要处以罚款)。 Java有一个更可靠的
可用模型,内部使用unicode和编码/解码决定

在处理输入和输出时只需要两次。

我相信这些错误对那些半信半疑的人来说是一件麻烦事。

unicode。即使是那些选择使用unicode的人,也几乎不可能确保他们的程序正常工作。

解决方案

匿名懦夫< au ****** @ gmail.com>写道:

这引出了另一个问题。大多数参考书和书籍专注于输入unicode
literal和使用编码/解码方法。谬论是字符串是整个程序中使用的基本数据类型,你真的不想在每次使用字符串时做出个人决定(并且因任何疏忽而受到处罚) 。 Java有一个更实用的模型,内部使用unicode,编码/解码决策只需要处理输入和输出两次。


当然,你也应该如何用Python做事。一个unicode字符串

在内部使用unicode。在路上解码,在出路时进行编码,并且

就可以正常工作了。


事实上你可以通过混合unicode字符串搞砸了二进制

字符串并不意味着你必须在你的程序中将unicode字符串与二进制字符串混合



即使对于那些选择的人使用unicode,几乎不可能确保他们的程序正常工作。




好​​吧,如果你按照预期的方式使用unicode,它只是工作。


< / F>


2005年2月18日星期五19:24:10 + 0100,Fredrik Lundh< fr ***** @ pythonware.com>

写道:

那就是你应该如何用Python做事当然也是。一个unicode
字符串
内部使用unicode。在进入时解码,在出路时进行编码,以及
事情正常工作。

你可以通过将unicode字符串与
二进制混合来搞乱事实字符串并不意味着你必须在你的程序中将unicode字符串与二进制字符串
混合。




我不是想要混合它们。但我怎么能找到它们呢?我怎么知道这个

声明可能是潜在的问题


如果a == b:


其中a和b可以单独实例化远离这行将它们组合在一起的
代码?


在Java中它们是不同的数据类型,编译器会捕获它们所有

使用不当。在Python中,解释器似乎帮助我们将二进制字符串提升为unicode。事情很好,单元测试通过,所有直到

第一个非ASCII字符进来然后程序中断。


是否有Python的方案开发人员使用它们是否安全?

错误混合?


aurora写道:

[...]
在Java中,它们是不同的数据类型,编译器会捕获所有不正确的用法。在Python中,解释器似乎帮助我们将二进制字符串提升为unicode。一切正常,单元测试通过,
直到第一个非ASCII字符进来然后程序中断。

是否有一个Python开发人员使用的方案不正确混合它们是否安全?




放置以下内容:


import sys

sys.setdefaultencoding(" undefined")


在Python路径中的某个名为sitecustomize.py的文件中

Python每次都会抱怨

str和unicode之间存在隐式转换。


HTH,

Walter D?rwald


I have long find the Python default encoding of strict ASCII frustrating.
For one thing I prefer to get garbage character than an exception. But the
biggest issue is Unicode exception often pop up in unexpected places and
only when a non-ASCII or unicode character first found its way into the
system.

Below is an example. The program may runs fine at the beginning. But as
soon as an unicode character u''b'' is introduced, the program boom out
unexpectedly.

sys.getdefaultencoding() ''ascii'' a=''\xe5''
# can print, you think you''re ok .... print a
? b=u''b''
a==b Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: ''ascii'' codec can''t decode byte 0xe5 in position 0:
ordinal not in range(128)


One may suggest the correct way to do it is to use decode, such as

a.decode(''latin-1'') == b
This brings up another issue. Most references and books focus exclusive on
entering unicode literal and using the encode/decode methods. The fallacy
is that string is such a basic data type use throughout the program, you
really don''t want to make a individual decision everytime when you use
string (and take a penalty for any negligence). The Java has a much more
usable model with unicode used internally and encoding/decoding decision
only need twice when dealing with input and output.

I am sure these errors are a nuisance to those who are half conscious to
unicode. Even for those who choose to use unicode, it is almost impossible
to ensure their program work correctly.

解决方案

anonymous coward <au******@gmail.com> wrote:

This brings up another issue. Most references and books focus exclusive on entering unicode
literal and using the encode/decode methods. The fallacy is that string is such a basic data type
use throughout the program, you really don''t want to make a individual decision everytime when
you use string (and take a penalty for any negligence). The Java has a much more usable model
with unicode used internally and encoding/decoding decision only need twice when dealing with
input and output.
that''s how you should do things in Python too, of course. a unicode string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

the fact that you can mess things up by mixing unicode strings with binary
strings doesn''t mean that you have to mix unicode strings with binary strings
in your program.
Even for those who choose to use unicode, it is almost impossible to ensure their program work
correctly.



well, if you use unicode the way it was intended to, it just works.

</F>


On Fri, 18 Feb 2005 19:24:10 +0100, Fredrik Lundh <fr*****@pythonware.com>
wrote:

that''s how you should do things in Python too, of course. a unicode
string
uses unicode internally. decode on the way in, encode on the way out, and
things just work.

the fact that you can mess things up by mixing unicode strings with
binary
strings doesn''t mean that you have to mix unicode strings with binary
strings
in your program.



I don''t want to mix them. But how could I find them? How do I know this
statement can be potential problem

if a==b:

where a and b can be instantiated individually far away from this line of
code that put them together?

In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to ''help'' us to promote
binary string to unicode. Things works fine, unit tests pass, all until
the first non-ASCII characters come in and then the program breaks.

Is there a scheme for Python developer to use so that they are safe from
incorrect mixing?


aurora wrote:

[...]
In Java they are distinct data type and the compiler would catch all
incorrect usage. In Python, the interpreter seems to ''help'' us to
promote binary string to unicode. Things works fine, unit tests pass,
all until the first non-ASCII characters come in and then the program
breaks.

Is there a scheme for Python developer to use so that they are safe
from incorrect mixing?



Put the following:

import sys
sys.setdefaultencoding("undefined")

in a file named sitecustomize.py somewhere in your Python path and
Python will complain whenever there''s an implicit conversion between
str and unicode.

HTH,
Walter D?rwald


这篇关于unicode编码usablilty问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆