字符串编码IDNA->UTF-8(Python) [英] String Encodings IDNA -> UTF-8 (Python)

查看:74
本文介绍了字符串编码IDNA->UTF-8(Python)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

字符串编码和格式总是让我失望.

String encodings and formats always throw me.

这就是我所拥有的:

'ไทย'

我认为这是UTF-8,并且

Which I believe is UTF-8, and

'xn--o3cw4h'

'xn--o3cw4h'

在IDNA编码中应该是同一件事.但是,我不知道如何使python从一个转换为另一个.

Which should be the same thing in IDNA encoding. However, I can't figure out how to get python to convert from one to the other.

我只是尝试

a = u'xn--o3cw4h'
b = a.encode('idna')
b.decode('utf-8')

但是我得到了完全相同的字符串("xn--o3cw4h",尽管不再是unicode).我目前正在使用python 3.5.

but I get the exact same string back ('xn--o3cw4h', although no longer unicode). I am using python 3.5 currently.

推荐答案

要从一种编码转换为另一种编码,必须首先将字符串解码为Unicode,然后再次以目标编码对其进行编码.

To convert from one encoding to another encoding, one must first decode the string to Unicode, then encode it again in the target encoding.

例如,

idna_encoded_bytes = b'xn--o3cw4h'
unicode_string = idna_encoded_bytes.decode('idna')
utf8_encoded_bytes = unicode_string.encode('utf-8')

print (repr(idna_encoded_bytes))
print (repr(utf8_encoded_bytes))
print (repr(unicode_string))

Python2结果:

Python2 result:

'xn--o3cw4h'
'\xe0\xb9\x84\xe0\xb8\x97\xe0\xb8\xa2'
u'\u0e44\u0e17\u0e22'

如您所见,第一行是ไทย的IDNA编码,第二行是utf8编码,最后一行是Unicode代码点U-0E44,U-0E17和U-0E22的未编码序列

As you can see, the first line is the IDNA encoding of ไทย, the second line is the utf8 encoding, and the final line is the unencoded sequence of Unicode code points U-0E44, U-0E17, and U-0E22.

要一步完成转换,只需链接操作:

To do the conversion in one step, just chain the operations:

utf8_encoded_bytes = idna_encoded_bytes.decode('idna').encode('utf8')


回复评论:


Responding to a comment:

我要开始的不是b'xn--o3cw4h',而是字符串'xn--o3cw4h'.[在Python3中.]

I'm starting with isn't b'xn--o3cw4h' but just the string 'xn--o3cw4h'. [in Python3].

你那里有只奇怪的鸭子.您已经将明显编码的数据存储在unicode字符串中.我们需要以某种方式将其转换为 bytes 对象.一种简单的方法是使用(容易混淆的)ASCII编码:

You have an odd duck there. You have apparently-encoded data stored in a unicode string. We'll need to convert that to a bytes object somehow. An easy way to do that is to use (confusingly) ASCII encoding:

improperly_encoded_idna = 'xn--o3cw4h'
idna_encoded_bytes = improperly_encoded_idna.encode('ascii')
unicode_string = idna_encoded_bytes.decode('idna')
utf8_encoded_bytes = unicode_string.encode('utf-8')

print (repr(idna_encoded_bytes))
print (repr(utf8_encoded_bytes))
print (repr(unicode_string))

这篇关于字符串编码IDNA->UTF-8(Python)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆