标识符归一化:为什么微符号要转换成希腊字母mu? [英] Identifier normalization: Why is the micro sign converted into the Greek letter mu?

查看:38
本文介绍了标识符归一化:为什么微符号要转换成希腊字母mu?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我偶然发现了以下奇怪的情况:

<预><代码>>>>课堂测试:µ = '富'>>>测试.µ'富'>>>getattr(测试,'μ')回溯(最近一次调用最后一次):文件<pyshell#4>",第 1 行,在 <module> 中getattr(测试,'μ')AttributeError: 类型对象Test"没有属性µ">>>'µ'.encode(), dir(Test)[-1].encode()(b'xc2xb5', b'xcexbc')

我输入的字符总是键盘上的 µ 符号,但由于某种原因它被转换了.为什么会发生这种情况?

解决方案

这里涉及两个不同的字符.一个是键盘上的MICRO SIGN,另一个是希腊小写字母 MU.

要了解发生了什么,我们应该看看 Python 如何在语言参考<中定义标识符/a>:

标识符 ::= xid_start xid_continue*id_start ::= <一般类别中的所有字符 Lu、Ll、Lt、Lm、Lo、Nl、下划线和具有 Other_ID_Start 属性的字符>id_continue ::= <id_start 中的所有字符,加上类别 Mn、Mc、Nd、Pc 中的字符以及具有 Other_ID_Continue 属性的其他字符>xid_start ::= <id_start 中NFKC 规范化在id_start xid_continue*"中的所有字符>xid_continue ::= <id_continue 中NFKC 规范化在id_continue*"中的所有字符>

我们的字符 MICRO SIGN 和 GREEK SMALL LETTER MU 都是 Ll unicode 组(小写字母)的一部分,因此它们都可以在标识符中的任何位置使用.现在注意identifier的定义实际上是指xid_startxid_continue,它们被定义为各自非x定义中的所有字符NFKC 规范化导致标识符的有效字符序列.

Python 显然只关心规范化形式的标识符.这一点在下面得到证实:

<块引用>

所有标识符在解析时都转换成NFKC的范式;标识符的比较基于NFKC.

NFKC 是一种 Unicode 规范化,可将字符分解为单独的部分.MICRO SIGN 分解为希腊小写字母 MU,这正是那里发生的事情.

还有很多其他字符也受到这种规范化的影响.另一个例子是 OHM SIGN 分解为 希腊大写字母 OMEGA.使用它作为标识符给出了类似的结果,这里使用 locals 显示:

<预><代码>>>>Ω = '条'>>>当地人()['Ω']回溯(最近一次调用最后一次):文件<pyshell#1>",第 1 行,在 <module> 中当地人()['Ω']关键错误:'Ω'>>>[k for k, v in locals().items() if v == 'bar'][0].encode()b'xcexa9'>>>'Ω'.encode()b'xe2x84xa6'

所以说到底,这只是 Python 所做的事情.不幸的是,并没有真正的好方法来检测这种行为,导致错误,如所示.通常,当标识符仅被称为标识符时,即它像真正的变量或属性一样使用时,一切都会好起来的:每次都运行规范化,并找到标识符.

唯一的问题是基于字符串的访问.字符串只是字符串,当然没有规范化发生(这只是一个坏主意).以及此处显示的两种方式,getattrlocals,都对字典进行操作.getattr() 通过对象的 __dict__ 访问对象的属性,locals() 返回一个字典.在字典中,keys 可以是任何字符串,所以里面有 MICRO SIGN 或 OHM SIGN 是完全没问题的.

在这些情况下,您需要记住自己执行规范化.我们可以利用 unicodedata.normalize 来实现这一点,然后还允许我们从 locals() 内部正确获取我们的值(或使用 getattr):

<预><代码>>>>normalized_ohm = unicodedata.normalize('NFKC', 'Ω')>>>当地人()[normalized_ohm]'酒吧'

I just stumbled upon the following odd situation:

>>> class Test:
        µ = 'foo'

>>> Test.µ
'foo'
>>> getattr(Test, 'µ')
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    getattr(Test, 'µ')
AttributeError: type object 'Test' has no attribute 'µ'
>>> 'µ'.encode(), dir(Test)[-1].encode()
(b'xc2xb5', b'xcexbc')

The character I entered is always the µ sign on the keyboard, but for some reason it gets converted. Why does this happen?

解决方案

There are two different characters involved here. One is the MICRO SIGN, which is the one on the keyboard, and the other is GREEK SMALL LETTER MU.

To understand what’s going on, we should take a look at how Python defines identifiers in the language reference:

identifier   ::=  xid_start xid_continue*
id_start     ::=  <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
id_continue  ::=  <all characters in id_start, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
xid_start    ::=  <all characters in id_start whose NFKC normalization is in "id_start xid_continue*">
xid_continue ::=  <all characters in id_continue whose NFKC normalization is in "id_continue*">

Both our characters, MICRO SIGN and GREEK SMALL LETTER MU, are part of the Ll unicode group (lowercase letters), so both of them can be used at any position in an identifier. Now note that the definition of identifier actually refers to xid_start and xid_continue, and those are defined as all characters in the respective non-x definition whose NFKC normalization results in a valid character sequence for an identifier.

Python apparently only cares about the normalized form of identifiers. This is confirmed a bit below:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

NFKC is a Unicode normalization that decomposes characters into individual parts. The MICRO SIGN decomposes into GREEK SMALL LETTER MU, and that’s exactly what’s going on there.

There are a lot other characters that are also affected by this normalization. One other example is OHM SIGN which decomposes into GREEK CAPITAL LETTER OMEGA. Using that as an identifier gives a similar result, here shown using locals:

>>> Ω = 'bar'
>>> locals()['Ω']
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    locals()['Ω']
KeyError: 'Ω'
>>> [k for k, v in locals().items() if v == 'bar'][0].encode()
b'xcexa9'
>>> 'Ω'.encode()
b'xe2x84xa6'

So in the end, this is just something that Python does. Unfortunately, there isn’t really a good way to detect this behavior, causing errors such as the one shown. Usually, when the identifier is only referred to as an identifier, i.e. it’s used like a real variable or attribute, then everything will be fine: The normalization runs every time, and the identifier is found.

The only problem is with string-based access. Strings are just strings, of course there is no normalization happening (that would be just a bad idea). And the two ways shown here, getattr and locals, both operate on dictionaries. getattr() accesses an object’s attribute via the object’s __dict__, and locals() returns a dictionary. And in dictionaries, keys can be any string, so it’s perfectly fine to have a MICRO SIGN or a OHM SIGN in there.

In those cases, you need to remember to perform a normalization yourself. We can utilize unicodedata.normalize for this, which then also allows us to correctly get our value from inside locals() (or using getattr):

>>> normalized_ohm = unicodedata.normalize('NFKC', 'Ω')
>>> locals()[normalized_ohm]
'bar'

这篇关于标识符归一化:为什么微符号要转换成希腊字母mu?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆