Unicode vs UTF-8在Python / Django中的混淆? [英] Unicode vs UTF-8 confusion in Python / Django?

查看:71
本文介绍了Unicode vs UTF-8在Python / Django中的混淆?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Django教程中偶然发现了这段话:

I stumbled over this passage in the Django tutorial:


Django模型具有默认的 str ()方法,该方法调用 unicode ()并将结果转换为UTF -8个字节串。这意味着unicode(p)将返回Unicode字符串,而str(p)将返回正常字符串,其字符编码为UTF-8。

Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.

现在,我很困惑,因为afaik Unicode没有任何特定的表示形式,那么Python中的 Unicode字符串是什么?这是否意味着UCS-2? Googling出现了此 Python Unicode教程 ,其中大胆指出了

Now, I'm confused because afaik Unicode is not any particular representation, so what is a "Unicode string" in Python? Does that mean UCS-2? Googling turned up this "Python Unicode Tutorial" which boldly states


Unicode是一种两字节的编码,涵盖了世界上所有常见的书写系统。

Unicode is a two-byte encoding which covers all of the world's common writing systems.

这是完全错误的,还是?字符集和编码问题使我很困惑,但是在这里,我很确定我正在阅读的文档很困惑。有人知道Python给我一个 Unicode字符串时是怎么回事吗?

which is plain wrong, or is it? I have been confused many times by character set and encoding issues, but here I'm quite sure that the documentation I'm reading is confused. Does anybody know what's going on in Python when it gives me a "Unicode string"?

推荐答案


Python中的 Unicode字符串是什么?这是否意味着UCS-2?

what is a "Unicode string" in Python? Does that mean UCS-2?

Python中的Unicode字符串在内部存储为UCS-2(定长16位表示形式) ,几乎与UTF-16相同)或UCS-4 / UTF-32(固定长度的32位表示形式)。这是一个编译时选项;在Windows上,它始终是UTF-16,而许多Linux发行版都为其Python版本设置了UTF-32(宽模式)。

Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.

通常,您不必关心:您将在字符串中将Unicode代码点视为单个元素,并且您将不知道它们是以两个或四个字节存储的。如果您使用的是UTF-16版本,并且需要在Basic Multilingual Plane之外处理字符,那您肯定做错了,但这仍然非常罕见,真正需要额外字符的用户应该编译广泛的版本。 / p>

You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.


普通错误,还是这样?

plain wrong, or is it?



Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).

还有其他的混乱源于Windows习惯于使用术语 Unicode来表示NT内部使用的UTF-16LE编码。来自Microsoftland的人们可能经常复制这种有点误导性的习惯。

There is an additional source of confusion stemming from Windows's habit of using the term "Unicode" to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.

这篇关于Unicode vs UTF-8在Python / Django中的混淆?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆