unicode之谜 [英] unicode mystery

查看:74
本文介绍了unicode之谜的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近发现unicode(\ 347,iso-8859-1)是

小写c-with-cedilla,所以我开始围捕unicode数字

你需要法语的额外字符,我发现它们都只是

罚款除了oe ligature(oeuvre等)。我检查了unicode

字符,从0到900没有找到它;然后我查看了
www.unicode.org ,但我看到了这些数字( 0152和0153)没有
工作。有人可以帮我这个吗? (我可能需要为第二个参数给出一个

不同的值吗?)


和平,

STM


PS:我正在考虑将pyscript作为制作

图表以包含在LaTeX文档中的一种方法。如果有人可以分享关于pyscript的意见,我很有兴趣听到它。


和平

I recently found out that unicode("\347", "iso-8859-1") is the
lowercase c-with-cedilla, so I set out to round up the unicode numbers
of the extra characters you need for French, and I found them all just
fine EXCEPT for the o-e ligature (oeuvre, etc). I examined the unicode
characters from 0 to 900 without finding it; then I looked at
www.unicode.org but the numbers I got there (0152 and 0153) didn''t
work. Can anybody put a help on me wrt this? (Do I need to give a
different value for the second parameter, maybe?)

Peace,
STM

PS: I''m considering looking into pyscript as a means of making
diagrams for inclusion in LaTeX documents. If anyone can share an
opinion about pyscript, I''m interested to hear it.

Peace

推荐答案

2005年1月10日星期一07:48:44 -0800,Sean McIlroy写道:
On Mon, Jan 10, 2005 at 07:48:44PM -0800, Sean McIlroy wrote:
我最近发现了unicode(\ 347" ;,iso-8859-1是
小写c-with-cedilla,所以我开始围绕法语所需的额外字符的unicode数字,我找到了他们都只是
罚款除了结束(全部等)。我在没有找到的情况下检查了从0到900的unicode
字符;然后我查看了
www.unicode.org ,但我看到了这些数字( 0152和0153)没有工作。有人可以帮我这个吗? (我可能需要给第二个参数赋予不同的值吗?)
I recently found out that unicode("\347", "iso-8859-1") is the
lowercase c-with-cedilla, so I set out to round up the unicode numbers
of the extra characters you need for French, and I found them all just
fine EXCEPT for the o-e ligature (oeuvre, etc). I examined the unicode
characters from 0 to 900 without finding it; then I looked at
www.unicode.org but the numbers I got there (0152 and 0153) didn''t
work. Can anybody put a help on me wrt this? (Do I need to give a
different value for the second parameter, maybe?)




??不是ISO 8859-1的一部分,所以你不能这样做。你可以做

其中一个


u''\ u0153''


或者,如果你必须的话,


unicode(" \305 \223"," utf-8")


-

John Lenton(jo**@grulic.org.ar) - 随机财富:

Lisp,Lisp,Lisp Machine,

Lisp Machine很有趣。

Lisp,Lisp,Lisp Machine,

每个人的乐趣。


-----开始PGP签名--- -

版本:GnuPG v1.2.5(GNU / Linux)

iD8DBQFB42K4gPqu395ykGsRAuYHAKCWQPoNdtAaBm6XeKqN4 / cdsVIhJgCggMRq

NlFH8U / HGRTNkYrZsFCulVg =

= 47J7

----- END PGP SIGNATURE -----



?? isn''t part of ISO 8859-1, so you can''t get it that way. You can do
one of

u''\u0153''

or, if you must,

unicode("\305\223", "utf-8")

--
John Lenton (jo**@grulic.org.ar) -- Random fortune:
Lisp, Lisp, Lisp Machine,
Lisp Machine is Fun.
Lisp, Lisp, Lisp Machine,
Fun for everyone.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB42K4gPqu395ykGsRAuYHAKCWQPoNdtAaBm6XeKqN4/cdsVIhJgCggMRq
NlFH8U/HGRTNkYrZsFCulVg=
=47J7
-----END PGP SIGNATURE-----




Sean McIlroy写道:

Sean McIlroy wrote:
我最近发现unicode(\ 347,iso-8859-1)是
小写c- with-cedilla,所以我开始围绕法语所需的额外字符的unicode
数字,我f除了o-e连字(全部等)之外,所有
都很好。我检查了
unicode字符,从0到900而没有找到它;然后我查看了
www.unicode.org ,但我看到了这些数字( 0152和0153)没有工作。有人可以帮我这个吗? (我可能需要给第二个参数赋予不同的值吗?)
I recently found out that unicode("\347", "iso-8859-1") is the
lowercase c-with-cedilla, so I set out to round up the unicode numbers of the extra characters you need for French, and I found them all just fine EXCEPT for the o-e ligature (oeuvre, etc). I examined the unicode characters from 0 to 900 without finding it; then I looked at
www.unicode.org but the numbers I got there (0152 and 0153) didn''t
work. Can anybody put a help on me wrt this? (Do I need to give a
different value for the second parameter, maybe?)




iso-8859-1中的字符直接映射到Unicode。

也就是说,Unicode的前256个字符与

iso-8859-1相同。


考虑这个:



Characters that are in iso-8859-1 are mapped directly into Unicode.
That is, the first 256 characters of Unicode are identical to
iso-8859-1.

Consider this:

c_cedilla = unicode(" \ '347"," iso-8859-1")
c_cedilla
u''\xe7''ord(c_cedilla)
231 ord(" \ 347")
c_cedilla = unicode("\347", "iso-8859-1")
c_cedilla u''\xe7'' ord(c_cedilla) 231 ord("\347")



231


你用c_cedilla做了什么" working"因为它实际上没有做任何事情。但是如果你执行unicode(char,encoding),其中char不在
编码中,它就不会工作。


作为John Lenton已经指出,如果你在Unicode

表中找到一个字符,你可以直接使用它。在这个

的情况下,没有必要使用unicode()。


HTH,

John


231

What you did with c_cedilla "worked" because it was effectively doing
nothing. However if you do unicode(char, encoding) where char is not in
encoding, it won''t "work".

As John Lenton has pointed out, if you find a character in the Unicode
tables, you can just use it directly. There is no need in this
circumstance to use unicode().

HTH,
John




一些海报写道(与另一个主题有关):

Some poster wrote (in connexion with another topic):
... unicode(" \ 347" ;,iso-8859-1)...
... unicode("\347", "iso-8859-1") ...




嗯,我好久没有好好的咆哮,所以这里有:


我是一个复古标本,能够(除其他外)召回来自ICT 1900系列的八进制

操作码(070 =电话, 072 =退出,074 =分支,...)

但是现在我认为继续使用八进制作为痘和

瘟疫。

1.八进制表示法对计算机上的系统程序员有用

一个单词中的位数是3的倍数。还有

生产使用? AFAIK字大小分别为12,24,36,48和60位 -

所有4的倍数,因此可以使用十六进制。


2。考虑一下对于那些从未听说过八进制的新手的影响:



Well, I haven''t had a good rant for quite a while, so here goes:

I''m a bit of a retro specimen, being able (inter alia) to recall octal
opcodes from the ICT 1900 series (070=call, 072=exit, 074=branch, ...)
but nowadays I regard continued usage of octal as a pox and a
pestilence.

1. Octal notation is of use to systems programmers on computers where
the number of bits in a word is a multiple of 3. Are there any still in
production use? AFAIK word sizes were 12, 24, 36, 48, and 60 bits --
all multiples of 4, so hexadecimal could be used.

2. Consider the effect on the newbie who''s never even heard of "octal":

import datetime
datetime.date(2005,01,01)
datetime.date(2005,1,1)datetime.date(2005,09,09)
import datetime
datetime.date(2005,01,01) datetime.date(2005, 1, 1) datetime.date(2005,09,09)



文件"< stdin>",第1行

datetime.date(2005,09,09)

^

语法错误:无效令牌


[直接出自BOFH Po-faced错误消息手册]


3考虑来自re模块的文档的这个摘录:

""

\ number

匹配的内容相同数量的组。团体从1开始编号
。例如,(。+)\ 1匹配'''或''55

55'',但不是''结束''(注意组后面的空格)。这个特殊的

序列只能用于匹配前99个组中的一个。如果

的第一个数字是0,或者数字是3个八位数,那么它将不会被解释为组匹配,而是作为八进制的字符

值编号。在[]中和]一个字符类,所有数字

转义被视为字符。

"""


我帮助了几年前理顺这个描述,但我担心它还不是100%准确。更糟糕的是,看看必要的代码

来实现这个。


===


我们(非语言地)隐含地将前导零(或者甚至只是

\ [0-7])视为八进制,而不是要求使用十六进制显示为明确的内容为

。字符串中的可变长度想法没有帮助:

" \ 18"," \ 0128"和\ 1238都是长度为2的字符串。


我在GvR的Python Regrets中没有看到任何八进制的提及。或AMK's

PEP 3000。为什么不?难道不后悔吗?


File "<stdin>", line 1
datetime.date(2005,09,09)
^
SyntaxError: invalid token

[straight out of the "BOFH Manual of Po-faced Error Messages"]

3. Consider this extract from the docs for the re module:
"""
\number
Matches the contents of the group of the same number. Groups are
numbered starting from 1. For example, (.+) \1 matches ''the the'' or ''55
55'', but not ''the end'' (note the space after the group). This special
sequence can only be used to match one of the first 99 groups. If the
first digit of number is 0, or number is 3 octal digits long, it will
not be interpreted as a group match, but as the character with octal
value number. Inside the "[" and "]" of a character class, all numeric
escapes are treated as characters.
"""

I helped to straighten out this description a few years ago, but I fear
it''s still not 100% accurate. Worse, take a peek at the code necessary
to implement this.

===

We (un-Pythonically) implicitly take a leading zero (or even just
\[0-7]) as meaning octal, instead of requiring something explicit as
with hexadecimal. The variable length idea in strings doesn''t help:
"\18", "\128" and "\1238" are all strings of length 2.

I don''t see any mention of octal in GvR''s "Python Regrets" or AMK''s
"PEP 3000". Why not? Is it not regretted?


这篇关于unicode之谜的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆