标识符中的Unicode下标和上标,为什么Python会考虑XU ==Xᵘ==Xᵤ? [英] Unicode subscripts and superscripts in identifiers, why does Python consider XU == Xᵘ == Xᵤ?
问题描述
Python允许使用unicode标识符.我定义了Xᵘ = 42
,期望XU
和Xᵤ
会导致NameError
.但是实际上,当我定义Xᵘ
时,Python(默默地?)将Xᵘ
变成了Xu
,这使我感到有些不可思议.为什么会这样?
Python allows unicode identifiers. I defined Xᵘ = 42
, expecting XU
and Xᵤ
to result in a NameError
. But in reality, when I define Xᵘ
, Python (silently?) turns Xᵘ
into Xu
, which strikes me as somewhat of an unpythonic thing to do. Why is this happening?
>>> Xᵘ = 42
>>> print((Xu, Xᵘ, Xᵤ))
(42, 42, 42)
推荐答案
Python将所有标识符转换为它们的 NFKC规范形式;来自参考文档的 标识符部分 :
Python converts all identifiers to their NFKC normal form; from the Identifiers section of the reference documentation:
所有标识符在解析时都转换为普通形式的NFKC;标识符的比较是基于NFKC.
All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.
上标和下标字符的NFKC形式均为小写u
:
The NFKC form of both the super and subscript characters is the lowercase u
:
>>> import unicodedata
>>> unicodedata.normalize('NFKC', 'Xᵘ Xᵤ')
'Xu Xu'
所以最后,您只有一个标识符Xu
:
So in the end, all you have is a single identifier, Xu
:
>>> import dis
>>> dis.dis(compile('Xᵘ = 42\nprint((Xu, Xᵘ, Xᵤ))', '', 'exec'))
1 0 LOAD_CONST 0 (42)
2 STORE_NAME 0 (Xu)
2 4 LOAD_NAME 1 (print)
6 LOAD_NAME 0 (Xu)
8 LOAD_NAME 0 (Xu)
10 LOAD_NAME 0 (Xu)
12 BUILD_TUPLE 3
14 CALL_FUNCTION 1
16 POP_TOP
18 LOAD_CONST 1 (None)
20 RETURN_VALUE
上面对已编译字节码的反汇编表明,标识符在编译期间已被规范化;这是在解析过程中发生的,在创建编译器用来生成字节码的AST(抽象解析树)时,所有标识符都将被规范化.
The above disassembly of the compiled bytecode shows that the identifiers have been normalised during compilation; this happens during parsing, any identifiers are normalised when creating the AST (Abstract Parse Tree) which the compiler uses to produce bytecode.
对标识符进行了规范化处理,以避免出现许多潜在的外观相似"错误,否则您可能会同时使用两个find()
(使用 U + FB01拉丁文小连字FI 字符,后跟ASCII码nd
字符)和find()
,想知道您的代码为什么有错误.
Identifiers are normalized to avoid many potential 'look-alike' bugs, where you'd otherwise could end up using both find()
(using the U+FB01 LATIN SMALL LIGATURE FI character followed by the ASCII nd
characters) and find()
and wonder why your code has a bug.
这篇关于标识符中的Unicode下标和上标,为什么Python会考虑XU ==Xᵘ==Xᵤ?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!