Python对unicode代理的处理 [英] Python's handling of unicode surrogates

查看:81
本文介绍了Python对unicode代理的处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如在另一个帖子[1]中看到的那样,对于代理人来说,存在很大的混淆。大多数程序员都假设Python的unicode

类型只暴露完整的字符。甚至CPython自己的功能

偶尔会这样做。这会导致

平台上出现不同的行为,并且无法正确支持所有

语言。


为了解决这个问题我建议Python的unicode类型使用UTF-16在其索引中应该有

的间隙,允许它只暴露完整的unicode标量

值。迭代将产生代理对而不是单个代理人,指向代理对的前半部分

将产生整个对(指向下半部分将提高

IndexError),切片将不需要分隔代理

对(否则为IndexError)。


注意这不会损害性能,也不会影响已经正确处理UTF-16和UTF-32的程序。


显示目前各平台的不同之处,这里使用UTF-32的

示例:


>> a,b = u''\U00100000'',u''\ nFFFF''
a,b



(u''\U00100000'',u''\\\￿'')


>> list(a),list(b)



([ u''\U00100000'',[u''\\\￿''])


>>已排序([a,b])



[u''\\\￿'',u ''\U00100000'']


现在将sorted()的输出与使用UTF-16时得到的结果进行对比:


>> a,b = u''\U00100000'',u''\ nFFFF''
a ,b



(u''\U00100000'',u''\\\￿'')
< blockquote class =post_quotes>


>> list(a),list(b)



([u''\\\�'',''\ deadc00''],[u''\\\￿''])
< blockquote class =post_quotes>


>> sorted([a,b])



[u''\U00100000'',u''\\\￿'']


如你所见,顺序已被颠倒,因为排序在代码单位而不是标量值上运行



将代理人视为不可分割的理由:

* \ U转义和repr()已经这样做了

* unichr(0x10000)适用于所有unicode标量值

*没有单独的字符类型;一个字符由

a字符串表示一个项目。

*迭代在所有平台上都是相同的

*排序将是相同的所有平台

*包含代理人的UTF-8或UTF-32,或包含隔离的

代理人的UTF-16,都是格式错误的[2]。


这种改变的原因:

*打破范围(len(s))或枚举的代码。这可以通过首先使用s = list(s)来解决。

*中断执行s [s.find(sub)]的代码,其中sub是a单身

代理人,期待半个代理人(为什么?)。不破坏代码

其中sub始终是单个代码单元,也不会破坏使用s [s.find(sub):s的
假定更长子代码的代码.find(sub)+ len(sub)]

*改变UTF-16平台上的排序顺序(以匹配UTF-32
平台的排序顺序,更不用说UTF-8编码的字节字符串)

*当前的Python相当容忍格式错误的unicode数据。

改变这种情况可能会破坏一些代码。但是,如果你确实需要转换低级别的UTF编码,那么字节类型会不会更好?

*没有人强迫你去使用0xFFFF以上的字符。这是一个

的稻草人。 Unicode超出了0xFFFF,因为真正的语言需要它。

软件不应该因为用户使用与程序员不同的

语言而中断。


来自你们所有读者的想法?赞成/反对?如果有足够的支持我会在python-3000上发布这个想法。


[1] http://groups.google.com/group/comp ... .76e191831da6de

[2]第23-24页 http://unicode.org/versions/Unicode4.0.0/ch03.pdf


-

亚当Olsen,又名Rhamphoryncus

解决方案

Adam Olsen:


为了解决这个问题,我提出了Python使用UTF-16的unicode类型在其索引中应该有

的空白,允许它只暴露完整的unicode标量

值。迭代将产生代理对而不是单个代理人,指向代理对的前半部分

将产生整个对(指向下半部分将提高

IndexError),切片将不需要分隔代理

对(否则为IndexError)。



我希望序列具有无法访问的索引将证明

过于令人惊讶。它们的行为与现有的Python

序列非常相似,除非与其他

序列完美匹配的代码很少抛出异常。


将代理人视为不可分割的原因:

* \ U逃避并且repr()已经这样做了

* unichr(0x10000)将适用于所有unicode标量值



unichr可以返回2代码单元字符串而不强制代理

indivisibility。


*没有单独的字符类型;一个字符由

表示为一个项目的字符串。



可以将其修改为一个或两个项目的字符串。


*迭代将在所有平台上都是相同的



可能有一个辅助迭代器迭代字符

而不是代码单元。


*排序在所有平台上都是相同的



这应该可以在当前方案中修复。


*包含代理人的UTF-8或UTF-32,或包含孤立的

代理人的UTF-16,是不正确的[2]。



看看指定(和执行)

UTF-16相对于当前实施需要多少才会有用。这是因为如果一个操作会产生一个不成对的代理或其他错误,那么就会产生一个异常。单个元素索引是

a问题,虽然它可以产生非字符串类型。


反对这种变化的原因:

*中断代码(len(s))或enumerate(s)。这可以通过首先使用s = list(s)来处理



代码将很好地为实现者工作,然后当

暴露给代理时会中断。


*没有人强迫你使用0xFFFF以上的字符。这是一个

的稻草人。 Unicode超出了0xFFFF,因为真正的语言需要它。

软件不应该因为用户使用与程序员不同的语言而打破。



超过0xFFFF的字符非常罕见。大多数补充

多语言飞机是历史语言,我不认为

是幸存的腓尼基人。也许额外的数学

标志或音乐符号将证明是有用的,这些范围实现了一个软件和字体

。补充表意平面是

历史悠久的中文,可能会有更多的用户。


我认为努力会更好地用于实施

似乎是UTF-32,但在内部使用UTF-16。绝大多数时间没有代理人,所以操作可以很简单,而且快速。当一个字符串包含一个代理项时,会翻转一个标志,并且所有的操作都会通过更复杂和更慢的代码路径。这样,

类型的消费者看到一个简单,一致的界面,使用时不会报告奇怪的错误。


顺便说一句,我刚刚为Scintilla(一个文本编辑组件)实现了对补充飞机(代理,

4字节UTF-8序列)的支持。


Neil


来自所有读者的想法?赞成/反对?


参见PEP 261.当时已经讨论过这个问题,

并明确决定我的想法(*)你的建议是

被采取。如果你愿意,你可以尝试恢复

决定,但你需要写一个PEP。


问候,

Martin


(*)我不完全理解你的建议。你说你想要'b $ b'想要[字符串''索引中的空白,但我不确定这意味着什么

。如果你在索引4上有一个代理对,那么

意味着s [5]不存在,或者它是否意味着

s [5]是下面的字符代理对?

对字符串的长度有什么影响吗?可能是

对于s和k的某些值len(s [k])是2吗?


4月19日,11: 02 pm,Neil Hodgson< nyamatongwe + thun ... @ gmail.com>

写道:


Adam Olsen:


为了解决这个问题,我建议Python的unicode类型使用UTF-16在其索引中应该有

的空白,允许它只暴露完整的unicode标量

值。迭代将产生代理对而不是单个代理人,指向代理对的前半部分

将产生整个对(指向下半部分将提高

IndexError),切片将不需要分隔代理

对(否则为IndexError)。



我希望序列具有无法访问的索引将证明

过于令人惊讶。它们的行为与现有的Python

序列非常相似,除非与其他

序列完美匹配的代码很少抛出异常。



错误永远不应该默默传递。


我能想到让代理人不足为奇的唯一方法就是是
使用UTF-8,从而轰炸程序员使用可变长度的

字符。


将代理人视为不可分割的理由:

* \ U转义和repr()已经这样做

* unichr(0x10000)可以处理所有unicode标量值



unichr可以返回2代码单元字符串而不强制代理

不可分割性。



的确如此。我真的很惊讶它没有,原来我已经用\ U和repr()列出了


< blockquote class =post_quotes>
*"没有单独的字符类型;一个字符由

表示为一个项目的字符串。



可以将此修改为一个或两个项目的字符串。


*迭代将在所有平台上都是相同的



可能有一个辅助迭代器迭代字符

而不是代码单元。



但是因为你应该在90%以上的时间内使用那个迭代器,为什么不在默认情况下使用

< br>


*排序在所有平台上都是相同的



这应该是可修复当前方案。



True。


* UTF-8或包含代理人的UTF-32,或含有孤立的

代理人的UTF-16,是不正确的[2]。



看看当前实施中指定(和执行)

UTF-16到目前为止需要多长时间。这是因为如果一个操作会产生一个不成对的代理或其他错误,那么就会产生一个异常。单元素索引是一个问题,虽然它可以产生非字符串类型。



错误,当你使用非字符串类型时,有什么意义

可以很容易地生成包含代理项的字符串对吗?

那是我提议的一半。


反对这种情况的原因更改:

*中断代码(len(s))或enumerate(s)。这可以通过首先使用s = list(s)来处理



代码将很好地为实现者工作,然后在

暴露给代理时中断。



代码可能已经破裂。我只是说清楚了。


*"没有人强迫你使用0xFFFF以上的字符。这是一个

的稻草人。 Unicode超出了0xFFFF,因为真正的语言需要它。

软件不应该因为用户使用与程序员不同的语言而打破。



超过0xFFFF的字符非常罕见。大多数补充

多语言飞机是历史语言,我不认为

是幸存的腓尼基人。也许额外的数学

标志或音乐符号将证明是有用的,这些范围实现了一个软件和字体

。补充表意平面是

历史中文,可能有更多用户。



然而,Unicode无论如何都认为它们值得包括在内。我认为没有理由

让他们更加痛苦。


今天编写的使用它们的程序很可能a)避免

迭代,和b)用切片替换索引(s [i] -s [i:i

+ len(sub)]。如果他们需要迭代,他们将拥有重新实现它,

提供我建议的确切行为。或者他们可以重新编译Python

来使用UTF-32,但为什么不应该通过

默认?


我认为最好花在

出现的实现上是UTF-32但是在内部使用UTF-16。绝大部分时间都没有代理,因此操作很简单,而且快速.b
。一个字符串包含一个代理,一个标志被翻转,所有的操作都会通过更复杂和更慢的代码路径。这样,这个类型的消费者看到一个简单,一致的消费者使用时不会报告奇怪错误的界面




您的解决方案需要代码重复,速度会慢一些。我的

解决方案没有重复,也不会慢。我喜欢

我的。 ;)


BTW,我刚刚实现了对补充飞机(代理,

4字节UTF-8序列)的支持Scintilla,一个文本编辑组件。



我梦想有一天完整的unicode支持是普遍的。随着

足够的努力,我们有一天可能会到达那里。 :)


-

Adam Olsen,又名Rhamphoryncus


As was seen in another thread[1], there''s a great deal of confusion
with regard to surrogates. Most programmers assume Python''s unicode
type exposes only complete characters. Even CPython''s own functions
do this on occasion. This leads to different behaviour across
platforms and makes it unnecessarily difficult to properly support all
languages.

To solve this I propose Python''s unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

Note that this would not harm performance, nor would it affects
programs that already handle UTF-16 and UTF-32 correctly.

To show how things currently differ across platforms, here''s an
example using UTF-32:

>>a, b = u''\U00100000'', u''\uFFFF''
a, b

(u''\U00100000'', u''\uffff'')

>>list(a), list(b)

([u''\U00100000''], [u''\uffff''])

>>sorted([a, b])

[u''\uffff'', u''\U00100000'']

Now contrast the output of sorted() with what you get when using UTF-16:

>>a, b = u''\U00100000'', u''\uFFFF''
a, b

(u''\U00100000'', u''\uffff'')

>>list(a), list(b)

([u''\udbc0'', ''\udc00''], [u''\uffff''])

>>sorted([a, b])

[u''\U00100000'', u''\uffff'']

As you can see, the order has be reversed, because the sort operates
on code units rather than scalar values.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values
* "There is no separate character type; a character is represented by
a string of one item."
* iteration would be identical on all platforms
* sorting would be identical on all platforms
* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.
* Breaks code which does s[s.find(sub)], where sub is a single
surrogate, expecting half a surrogate (why?). Does NOT break code
where sub is always a single code unit, nor does it break code that
assumes a longer sub using s[s.find(sub):s.find(sub) + len(sub)]
* Alters the sort order on UTF-16 platforms (to match that of UTF-32
platforms, not to mention UTF-8 encoded byte strings)
* Current Python is fairly tolerant of ill-formed unicode data.
Changing this may break some code. However, if you do have a need to
twiddle low-level UTF encodings, wouldn''t the bytes type be better?
* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.

Thoughts, from all you readers out there? For/against? If there''s
enough support I''ll post the idea on python-3000.

[1] http://groups.google.com/group/comp....76e191831da6de
[2] Pages 23-24 of http://unicode.org/versions/Unicode4.0.0/ch03.pdf

--
Adam Olsen, aka Rhamphoryncus

解决方案

Adam Olsen:

To solve this I propose Python''s unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).

I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values

unichr could return a 2 code unit string without forcing surrogate
indivisibility.

* "There is no separate character type; a character is represented by
a string of one item."

Could amend this to "a string of one or two items".

* iteration would be identical on all platforms

There could be a secondary iterator that iterates over characters
rather than code units.

* sorting would be identical on all platforms

This should be fixable in the current scheme.

* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].

It would be interesting to see how far specifying (and enforcing)
UTF-16 over the current implementation would take us. That is for the 16
bit Unicode implementation raising an exception if an operation would
produce an unpaired surrogate or other error. Single element indexing is
a problem although it could yield a non-string type.

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.

The code will work happily for the implementor and then break when
exposed to a surrogate.

* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.

Characters over 0xFFFF are *very* rare. Most of the Supplementary
Multilingual Plane is for historical languages and I don''t think there
are any surviving Phoenician speakers. Maybe the extra mathematical
signs or musical symbols will prove useful one software and fonts are
implemented for these ranges. The Supplementary Ideographic Plane is
historic Chinese and may have more users.

I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.

BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.

Neil


Thoughts, from all you readers out there? For/against?

See PEP 261. This things have all been discussed at that time,
and an explicit decision against what I think (*) your proposal is
was taken. If you want to, you can try to revert that
decision, but you would need to write a PEP.

Regards,
Martin

(*) I don''t fully understand your proposal. You say that you
want "gaps in [the string''s] index", but I''m not sure what
that means. If you have a surrogate pair on index 4, would
it mean that s[5] does not exist, or would it mean that
s[5] is the character following the surrogate pair? Is
there any impact on the length of the string? Could it be
that len(s[k]) is 2 for some values of s and k?


On Apr 19, 11:02 pm, Neil Hodgson <nyamatongwe+thun...@gmail.com>
wrote:

Adam Olsen:

To solve this I propose Python''s unicode type using UTF-16 should have
gaps in its index, allowing it to only expose complete unicode scalar
values. Iteration would produce surrogate pairs rather than
individual surrogates, indexing to the first half of a surrogate pair
would produce the entire pair (indexing to the second half would raise
IndexError), and slicing would be required to not separate a surrogate
pair (IndexError otherwise).


I expect having sequences with inaccessible indices will prove
overly surprising. They will behave quite similar to existing Python
sequences except when code that works perfectly well against other
sequences throws exceptions very rarely.

"Errors should never pass silently."

The only way I can think of to make surrogates unsurprising would be
to use UTF-8, thereby bombarding programmers with variable-length
characters.

Reasons to treat surrogates as undivisible:
* \U escapes and repr() already do this
* unichr(0x10000) would work on all unicode scalar values


unichr could return a 2 code unit string without forcing surrogate
indivisibility.

Indeed. I was actually surprised that it didn''t, originally I had it
listed with \U and repr().

* "There is no separate character type; a character is represented by
a string of one item."


Could amend this to "a string of one or two items".

* iteration would be identical on all platforms


There could be a secondary iterator that iterates over characters
rather than code units.

But since you should use that iterator 90%+ of the time, why not make
it the default?

* sorting would be identical on all platforms


This should be fixable in the current scheme.

True.

* UTF-8 or UTF-32 containing surrogates, or UTF-16 containing isolated
surrogates, are ill-formed[2].


It would be interesting to see how far specifying (and enforcing)
UTF-16 over the current implementation would take us. That is for the 16
bit Unicode implementation raising an exception if an operation would
produce an unpaired surrogate or other error. Single element indexing is
a problem although it could yield a non-string type.

Err, what would be the point in having a non-string type when you
could just as easily produce a string containing the surrogate pair?
That''s half my proposal.

Reasons against such a change:
* Breaks code which does range(len(s)) or enumerate(s). This can be
worked around by using s = list(s) first.


The code will work happily for the implementor and then break when
exposed to a surrogate.

The code may well break already. I just make it explicit.

* "Nobody is forcing you to use characters above 0xFFFF". This is a
strawman. Unicode goes beyond 0xFFFF because real languages need it.
Software should not break just because the user speaks a different
language than the programmer.


Characters over 0xFFFF are *very* rare. Most of the Supplementary
Multilingual Plane is for historical languages and I don''t think there
are any surviving Phoenician speakers. Maybe the extra mathematical
signs or musical symbols will prove useful one software and fonts are
implemented for these ranges. The Supplementary Ideographic Plane is
historic Chinese and may have more users.

Yet Unicode has deemed them worth including anyway. I see no reason
to make them more painful then they have to be.

A program written to use them today would most likely a) avoid
iteration, and b) replace indexes with slices (s[i] -s[i:i
+len(sub)]. If they need iteration they''ll have to reimplement it,
providing the exact behaviour I propose. Or they can recompile Python
to use UTF-32, but why shouldn''t such features be available by
default?

I think that effort would be better spent on an implementation that
appears to be UTF-32 but uses UTF-16 internally. The vast majority of
the time, no surrogates will be present, so operations can be simple and
fast. When a string contains a surrogate, a flag is flipped and all
operations go through more complex and slower code paths. This way,
consumers of the type see a simple, consistent interface which will not
report strange errors when used.

Your solution would require code duplication and would be slower. My
solution would have no duplication and would not be slower. I like
mine. ;)

BTW, I just implemented support for supplemental planes (surrogates,
4 byte UTF-8 sequences) for Scintilla, a text editing component.

I dream of a day when complete unicode support is universal. With
enough effort we may get there some day. :)

--
Adam Olsen, aka Rhamphoryncus


这篇关于Python对unicode代理的处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆