Python 3.6,utf-8 到 unicode 的转换,带双反斜杠的字符串 [英] Python 3.6, utf-8 to unicode conversion, string with double backslashes

查看:36
本文介绍了Python 3.6,utf-8 到 unicode 的转换,带双反斜杠的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多关于 utf-8 > unicode 转换的问题,但我仍然没有找到我的问题的答案.

There are many questions about utf-8 > unicode conversion, but I still haven't found answer for my issue.

让我们有这样的字符串:

Lets have strings like this:

a = "Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"

Python 3.6 理解这个字符串就像 Je-li pro za\xc5\x99azov\xc3\xa1n\xc3\xad.我需要将此类似 utf-8 的字符串转换为 unicode 表示形式.最终结果应该是Je-li pro zařazování.

Python 3.6 understands this string like Je-li pro za\xc5\x99azov\xc3\xa1n\xc3\xad. I need to convert this utf-8-like string to unicode representation. The final result should be Je-li pro zařazování.

使用 a.decode("utf-8") 我得到 AttributeError: 'str' object has no attribute 'decode',因为 Python 意味着该对象已经解码.

With a.decode("utf-8") I get AttributeError: 'str' object has no attribute 'decode', because Python means the object is already decoded.

如果我先用 bytes(a, "utf-8") 将它转换成字节,反斜杠只会翻倍,.decode("utf-8") 再次将其返回到我当前的 a.

If I convert it to bytes first with bytes(a, "utf-8"), the backslashes are doubled only and .decode("utf-8") returns it to my current a again.

如何从这个a获取unicode字符串Je-li pro zařazování?

How to get unicode string Je-li pro zařazování from this a?

推荐答案

你必须编码/解码 4 次才能得到想要的结果:

You have to encode/decode 4 times to get the desired result:

print(
  "Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"

  # actually any encoding support printable ASCII would work, for example utf-8
  .encode('ascii')

  # unescape the string
  # source: https://stackoverflow.com/a/1885197
  .decode('unicode-escape')

  # latin-1 also works, see https://stackoverflow.com/q/7048745
  .encode('iso-8859-1')

  # finally
  .decode('utf-8')
)

在线试用!

此外,如果可以,请考虑告诉您的目标程序(数据源)提供不同的输出格式(例如字节数组或 base64 编码).

Besides, consider telling your target program (data source) to give different output format (byte array or base64 encoded, for example), if you can.

不安全但更短的方法:

st = "Je-li pro za\\xc5\\x99azov\\xc3\\xa1n\\xc3\\xad"
print(eval("b'"+st+"'").decode('utf-8'))

网上试试吧!

ast.literal_eval,但这里可能不值得使用.

There are ast.literal_eval, but it may not worth using here.

这篇关于Python 3.6,utf-8 到 unicode 的转换,带双反斜杠的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆