unicode 在 Python 内部是如何表示的? [英] How is unicode represented internally in Python?

查看:38
本文介绍了unicode 在 Python 内部是如何表示的?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Unicode 字符串在 Python 的内存中是如何按字面表示的?

How is Unicode string literally represented in Python's memory?

例如,我可以将 'abc' 可视化为它在内存中的等效 ASCII 字节.整数可以被认为是 2 的恭维表示.但是 u'\u2049',即使在 UTF-8 中表示为 '\xe2\x81\x89' - 3 个字节长,我如何可视化内存中的文字 u'\u2049' 代码点?

For example I could visualize 'abc' as its equivalent ASCII bytes in Memory. Integer could be thought of as the 2's compliment representation. However u'\u2049', even though is represented in UTF-8 as '\xe2\x81\x89' - 3 bytes long, how do I visualize the literal u'\u2049' codepoint in the memory?

是否有特定的方式将其存储在内存中?Python 2 和 Python 3 对待它的方式不同吗?

Is there a specific way it is stored in memory? Does Python 2 and Python 3 treat it differently?

一些好奇的人的相关问题:

Few related questions for anyone curious :

1) ​​这些怎么样Python 解释器内部表示的字符串 ?我不明白

2) 什么是字符串的内部表示Python 3.x

推荐答案

我假设您想了解标准实现 CPython.Python 2 和 Python 3.0-3.2 对 Unicode 字符使用 UCS2* 或 UCS4,这意味着它将为 每个字符 使用 2 个字节或 4 个字节.选择哪一个是编译时选项.

I'm assuming you want to know about CPython, the standard implementation. Python 2 and Python 3.0-3.2 use either UCS2* or UCS4 for Unicode characters, meaning it'll either use 2 bytes or 4 bytes for each character. Which one is picked is a compile-time option.

\u2049 然后表示为 \x49\x20\x20\x49\x49\x20\x00\x00\x00\x00\x20\x49 取决于系统的本机字节顺序以及是否选择了 UCS2 或 UCS4.Unicode 字符串中的 ASCII 字符仍然使用每个字符 2 或 4 个字节.

\u2049 is then represented as either \x49\x20 or \x20\x49 or \x49\x20\x00\x00 or \x00\x00\x20\x49 depending on the native byte order of your system and if UCS2 or UCS4 was picked. ASCII characters in a unicode string still use 2 or 4 bytes per character too.

Python 3.3 切换到新的内部表示,使用表示字符串中所有字符所需的最紧凑的形式.选择 1 个字节、2 个字节或 4 个字节.ASCII 和 Latin-1 文本每个字符只使用 1 个字节,其余的 BMP 字符需要 2 个字节,之后使用 4 个字节.

Python 3.3 switched to a new internal representation, using the most compact form needed to represent all characters in a string. Either 1 byte, 2 bytes or 4 bytes are picked. ASCII and Latin-1 text uses just 1 byte per character, the rest of the BMP characters require 2 bytes and after that 4 bytes is used.

请参阅PEP-393:灵活的字符串表示 了解这些表示的完整内容.

See PEP-393: Flexible String Representation for the full low-down on these representations.

* 从技术上讲,UCS-2 构建使用 UTF-16,因为非 BMP 字符使用 UTF-16 代理将每个字符编码为 4 个字节(2 个 UTF-16 字符).但是,Python 文档仍将其称为 UCS2.

* Technically speaking the UCS-2 build uses UTF-16, as non-BMP characters use UTF-16 surrogates to encode to 4 bytes (2 UTF-16 characters) each. However, Python documentation still refers to this as UCS2.

这确实会导致意外行为,例如非 BMP unicode 字符串上的 len() 比包含的字符数长.

This does lead to unexpected behaviour such as the len() on non-BMP unicode strings being longer than the number of characters contained.

这篇关于unicode 在 Python 内部是如何表示的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆