解码编码UTF-8不会导致原始unicode [英] decode-encode UTF-8 doesn't lead to the original unicode

查看:128
本文介绍了解码编码UTF-8不会导致原始unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我尝试通过对两个Unicode字符进行解码和再次编码来分隔两个Unicode字符时,我没有得到相同的Unicode,但是得到了一个不同的Unicode字符.

When I am trying to separate two Unicode characters by decoding and encoding them again I do not get the same Unicode in return but I get a different one.

我尝试这样做时附有答复.

Attached are the responses when I try to do so.

>>> s ='\xf0\x9f\x93\xb1\xf0\x9f\x9a\xac'
>>> u = s.decode("utf-8")
>>> u
u'\U0001f4f1\U0001f6ac'
>>> u[0].encode("utf-8")
'\xed\xa0\xbd'
>>> u[1].encode("utf-8")
'\xed\xb3\xb1'
>>> u[0]
u'\ud83d'
>>> u[1]
u'\udcf1'

推荐答案

您的python版本使用UCS-2(每个字符16位),但是这些特定的unicode字符需要32位,因此u的元素表示以下内容的一半"一个角色. u.encode('utf-8')可以正常工作,因为它了解编码.

Your version of python is using UCS-2 (16 bits per character) but these particular unicode characters require 32 bits, so element of u represents "half" of a character. u.encode('utf-8') works properly because it understanding the encoding.

您的utf-8字符串编码这两个字符:

Your utf-8 string encodes these two characters:

U+1F4F1 MOBILE PHONE character(📱)

U+1F4F1 MOBILE PHONE character (📱)

U+1F6AC SMOKING SYMBOL character(🚬)

U+1F6AC SMOKING SYMBOL character (🚬)

(通过此解码器: http://software.hixie.ch/utilities/cgi/unicode-decoder/utf8-decoder )

这篇关于解码编码UTF-8不会导致原始unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆