Python - Unicode到ASCII转换 [英] Python - Unicode to ASCII conversion

查看:517
本文介绍了Python - Unicode到ASCII转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法将以下Unicode转换为ASCII码,而不会丢失数据:

  u'ABRA\xc3O JOS\我试过编码和$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ code> decode ,他们不会这样做。



有没有人有建议?

解决方案

Unicode字符 u'\xce0' u'\xc9 '没有任何相应的ASCII值。所以,如果你不想丢失数据,你必须以某种方式对ASCII数据进行编码。选项包括:

 >>> print s.encode('ascii',errors ='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii',errors ='xmlcharrefreplace')
ABRAÃ O JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e

所有这些都是ASCII字符串,并且包含原始Unicode字符串的所有信息(因此它们都可以颠覆而不会丢失数据),但是对于最终用户来说,它们都不是很漂亮(而且没有一个可以被颠倒 decode('ascii'))。



请参阅 str.encode Python特定编码 Unicode HOWTO 获取更多信息。






作为一个附注,当有人说ASCII时,他们并不意味着ASCII,而是任何8位字符集是ASCII的超集或我拥有的某些特定的8位字符集心神。如果这是您的意思,解决方案是编码到正确的8位字符集:

 >>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'

困难的部分是知道你的意思是哪个字符集。如果您正在编写生成8位字符串的代码和使用该代码的代码,并且您不了解任何更好的内容,则表示UTF-8。如果使用8位字符串的代码是,例如打开函数或您为某个页面提供的Web浏览器或其他内容,则会更复杂,而没有更多的信息就没有简单的答案。


I am unable to convert the following Unicode to ASCII without losing data:

u'ABRA\xc3O JOS\xc9'

I tried encode and decode and they won’t do it.

Does anyone have a suggestion?

解决方案

The Unicode characters u'\xce0' and u'\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:

>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e

All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).

See str.encode, Python Specific Encodings, and Unicode HOWTO for more info.


As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:

>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'

The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.

这篇关于Python - Unicode到ASCII转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆