为什么在utf-8中编码仍然会导致ascii? [英] Why encoding in utf-8 still results in ascii?
问题描述
根据此代码:
# coding=utf-8
import sys
import chardet
print(sys.getdefaultencoding())
a = 'abc'
print(type(a))
print(chardet.detect(a))
b = a.decode('ascii')
print(type(b))
c = '中文'
print(type(c))
print(chardet.detect(c))
m = b.encode('utf-8')
print(type(m))
print(chardet.detect(m))
n = u'abc'
print(type(n))
x = n.encode(encoding='utf-8')
print(type(x))
print(chardet.detect(x))
我使用 utf-8
来编码 n
,但结果仍显示结果为 ascii
。
I use utf-8
to encode n
but the result still show the result is ascii
.
所以我想知道 utf-8
, ascii
和 unicode
。
So I want to know, what is relation between utf-8
, ascii
and unicode
.
我与python2一起运行。
i run with python2.
===================结果=============== =================
===================result=================================
= =====================最终结果============================ ===
=======================end result =============================
推荐答案
UTF-8实际上是宽度可变的编码,恰好碰巧ASCII字符将直接映射
UTF-8 is actually a variable-width encoding, and it just so happens that ASCII characters will map directly in UTF-8.
由于您的UTF-8字符串仅包含 个ASCII字符,因此该字符串实际上是ASCII和UTF- 8个字符串。
Since your UTF-8 string contains only ASCII characters, the string is, well honestly both an ASCII and UTF-8 string.
此视觉效果可能会有所帮助:
This visual might help:
>>> c = '中文abc中文'
>>>
>>>
>>> c
'中文abc中文'
>>> c.encode(encoding="UTF-8")
b'\xe4\xb8\xad\xe6\x96\x87abc\xe4\xb8\xad\xe6\x96\x87'
请注意,UTF-8字符串中的 abc如何仅字节?它们仍然是与ascii对应的相同字节!
Notice how the "abc" in the UTF-8 string are only single-byte? They are still the same bytes as their ascii counterparts!
这篇关于为什么在utf-8中编码仍然会导致ascii?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!