标识符中的Unicode下标和上标,为什么Python会考虑XU ==Xᵘ==Xᵤ? [英] Unicode subscripts and superscripts in identifiers, why does Python consider XU == Xᵘ == Xᵤ?

查看:115
本文介绍了标识符中的Unicode下标和上标,为什么Python会考虑XU ==Xᵘ==Xᵤ?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python允许使用unicode标识符.我定义了Xᵘ = 42,期望XUXᵤ会导致NameError.但是实际上,当我定义Xᵘ时,Python(默默地?)将Xᵘ变成了Xu,这使我感到有些不可思议.为什么会这样?

Python allows unicode identifiers. I defined Xᵘ = 42, expecting XU and Xᵤ to result in a NameError. But in reality, when I define Xᵘ, Python (silently?) turns Xᵘ into Xu, which strikes me as somewhat of an unpythonic thing to do. Why is this happening?

>>> Xᵘ = 42
>>> print((Xu, Xᵘ, Xᵤ))
(42, 42, 42)

推荐答案

Python将所有标识符转换为它们的 NFKC规范形式;来自参考文档的 标识符部分 :

Python converts all identifiers to their NFKC normal form; from the Identifiers section of the reference documentation:

所有标识符在解析时都转换为普通形式的NFKC;标识符的比较是基于NFKC.

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

上标和下标字符的NFKC形式均为小写u:

The NFKC form of both the super and subscript characters is the lowercase u:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', 'Xᵘ Xᵤ')
'Xu Xu'

所以最后,您只有一个标识符Xu:

So in the end, all you have is a single identifier, Xu:

>>> import dis
>>> dis.dis(compile('Xᵘ = 42\nprint((Xu, Xᵘ, Xᵤ))', '', 'exec'))
  1           0 LOAD_CONST               0 (42)
              2 STORE_NAME               0 (Xu)

  2           4 LOAD_NAME                1 (print)
              6 LOAD_NAME                0 (Xu)
              8 LOAD_NAME                0 (Xu)
             10 LOAD_NAME                0 (Xu)
             12 BUILD_TUPLE              3
             14 CALL_FUNCTION            1
             16 POP_TOP
             18 LOAD_CONST               1 (None)
             20 RETURN_VALUE

上面对已编译字节码的反汇编表明,标识符在编译期间已被规范化;这是在解析过程中发生的,在创建编译器用来生成字节码的AST(抽象解析树)时,所有标识符都将被规范化.

The above disassembly of the compiled bytecode shows that the identifiers have been normalised during compilation; this happens during parsing, any identifiers are normalised when creating the AST (Abstract Parse Tree) which the compiler uses to produce bytecode.

对标识符进行了规范化处理,以避免出现许多潜在的外观相似"错误,否则您可能会同时使用两个find()(使用 U + FB01拉丁文小连字FI 字符,后跟ASCII码nd字符)和find(),想知道您的代码为什么有错误.

Identifiers are normalized to avoid many potential 'look-alike' bugs, where you'd otherwise could end up using both find() (using the U+FB01 LATIN SMALL LIGATURE FI character followed by the ASCII nd characters) and find() and wonder why your code has a bug.

这篇关于标识符中的Unicode下标和上标,为什么Python会考虑XU ==Xᵘ==Xᵤ?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆