标识符归一化:为什么微符号会转换为希腊字母mu? [英] Identifier normalization: Why is the micro sign converted into the Greek letter mu?

查看:160
本文介绍了标识符归一化:为什么微符号会转换为希腊字母mu?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我偶然发现了以下奇怪情况:

I just stumbled upon the following odd situation:

>>> class Test:
        µ = 'foo'

>>> Test.µ
'foo'
>>> getattr(Test, 'µ')
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    getattr(Test, 'µ')
AttributeError: type object 'Test' has no attribute 'µ'
>>> 'µ'.encode(), dir(Test)[-1].encode()
(b'\xc2\xb5', b'\xce\xbc')

我输入的字符始终是键盘上的µ符号,但是由于某种原因,它会被转换.为什么会这样?

The character I entered is always the µ sign on the keyboard, but for some reason it gets converted. Why does this happen?

推荐答案

此处涉及两个不同的字符.一个是 MICRO SIGN ,它是键盘上的一个,另一个是希腊小写字母MU .

There are two different characters involved here. One is the MICRO SIGN, which is the one on the keyboard, and the other is GREEK SMALL LETTER MU.

要了解发生了什么,我们应该看一下Python如何在语言参考:

To understand what’s going on, we should take a look at how Python defines identifiers in the language reference:

identifier   ::=  xid_start xid_continue*
id_start     ::=  <all characters in general categories Lu, Ll, Lt, Lm, Lo, Nl, the underscore, and characters with the Other_ID_Start property>
id_continue  ::=  <all characters in id_start, plus characters in the categories Mn, Mc, Nd, Pc and others with the Other_ID_Continue property>
xid_start    ::=  <all characters in id_start whose NFKC normalization is in "id_start xid_continue*">
xid_continue ::=  <all characters in id_continue whose NFKC normalization is in "id_continue*">

我们的字符MICRO SIGN和GREEK SMALL LETTER MU都是Ll unicode组(小写字母)的一部分,因此它们都可以在标识符的任何位置使用.现在请注意,identifier的定义实际上是指xid_startxid_continue,它们被定义为相应非x定义中的所有字符,它们的NFKC归一化导致标识符的有效字符序列.

Both our characters, MICRO SIGN and GREEK SMALL LETTER MU, are part of the Ll unicode group (lowercase letters), so both of them can be used at any position in an identifier. Now note that the definition of identifier actually refers to xid_start and xid_continue, and those are defined as all characters in the respective non-x definition whose NFKC normalization results in a valid character sequence for an identifier.

Python显然只关心标识符的 normalized 形式.确认如下:

Python apparently only cares about the normalized form of identifiers. This is confirmed a bit below:

所有标识符在解析时都转换为普通形式的NFKC;标识符的比较是基于NFKC.

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

NFKC是 Unicode规范化,它将字符分解为各个部分. MICRO SIGN分解为希腊小写字母MU,这就是那里发生的事情.

NFKC is a Unicode normalization that decomposes characters into individual parts. The MICRO SIGN decomposes into GREEK SMALL LETTER MU, and that’s exactly what’s going on there.

还有许多其他字符也受此规范化影响.另一个示例是 OHM SIGN ,它分解为

There are a lot other characters that are also affected by this normalization. One other example is OHM SIGN which decomposes into GREEK CAPITAL LETTER OMEGA. Using that as an identifier gives a similar result, here shown using locals:

>>> Ω = 'bar'
>>> locals()['Ω']
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    locals()['Ω']
KeyError: 'Ω'
>>> [k for k, v in locals().items() if v == 'bar'][0].encode()
b'\xce\xa9'
>>> 'Ω'.encode()
b'\xe2\x84\xa6'

最后,这只是Python要做的.不幸的是,并没有一种很好的方法来检测这种行为,从而导致诸如所示的错误.通常,当标识符仅被称为标识符时,即像真实变量或属性一样使用时,一切都会很好:每次都进行规范化,然后找到标识符.

So in the end, this is just something that Python does. Unfortunately, there isn’t really a good way to detect this behavior, causing errors such as the one shown. Usually, when the identifier is only referred to as an identifier, i.e. it’s used like a real variable or attribute, then everything will be fine: The normalization runs every time, and the identifier is found.

唯一的问题是基于字符串的访问.字符串只是字符串,当然不会发生规范化(那将是一个坏主意).此处显示的两种方式是 getattr

The only problem is with string-based access. Strings are just strings, of course there is no normalization happening (that would be just a bad idea). And the two ways shown here, getattr and locals, both operate on dictionaries. getattr() accesses an object’s attribute via the object’s __dict__, and locals() returns a dictionary. And in dictionaries, keys can be any string, so it’s perfectly fine to have a MICRO SIGN or a OHM SIGN in there.

在这种情况下,您需要记住自己进行标准化.我们可以为此使用 unicodedata.normalize locals()内部的值(或使用getattr):

In those cases, you need to remember to perform a normalization yourself. We can utilize unicodedata.normalize for this, which then also allows us to correctly get our value from inside locals() (or using getattr):

>>> normalized_ohm = unicodedata.normalize('NFKC', 'Ω')
>>> locals()[normalized_ohm]
'bar'

这篇关于标识符归一化:为什么微符号会转换为希腊字母mu?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆