Python:清理 unicode 的字符串? [英] Python: Sanitize a string for unicode?

查看:50
本文介绍了Python:清理 unicode 的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

<块引用>

可能的重复:
Python UnicodeDecodeError - 我误解了编码吗?

我有一个字符串,我正试图确保 unicode() 函数的安全:

<预><代码>>>>s = " foo "bar bar "weasel">>>s.encode('utf-8', '忽略')回溯(最近一次调用最后一次):文件<pyshell#8>",第 1 行,在 <module> 中s.encode('utf-8', '忽略')UnicodeDecodeError: 'ascii' 编解码器无法解码位置 5 中的字节 0x93:序号不在范围内 (128)>>>unicode(s)回溯(最近一次调用最后一次):文件<pyshell#9>",第 1 行,在 <module> 中unicode(s)UnicodeDecodeError: 'ascii' 编解码器无法解码位置 5 中的字节 0x93:序号不在范围内 (128)

我大部分时间都在这儿闲逛.我需要怎么做才能从字符串中删除不安全的字符?

与此问题有些相关,尽管我无法从中解决我的问题.

这也失败了:

<预><代码>>>>秒' foo x93bar bar x94 鼬鼠'>>>s.decode('utf-8')回溯(最近一次调用最后一次):文件<pyshell#13>",第 1 行,在 <module> 中s.decode('utf-8')文件C:Python25254libencodingsutf_8.py",第 16 行,解码返回 codecs.utf_8_decode(输入,错误,真)UnicodeDecodeError: 'utf8' 编解码器无法解码位置 5 的字节 0x93:意外的代码字节

解决方案

好问题.编码问题很棘手.让我们从 我有一个字符串"开始. Python 2 中的字符串并不是真正的字符串",它们是字节数组.所以你的字符串,它来自哪里以及它是什么编码?您的示例在文字中显示了卷曲引号,我什至不确定您是如何做到的.我尝试将其粘贴到 Python 解释器中,或在 OS X 上使用 Option-[ 键入它,但它没有通过.

虽然看你的第二个例子,你有一个十六进制 93 的字符.那不能是 UTF-8,因为在 UTF-8 中,任何高于 127 的字节都是多字节的一部分顺序.所以我猜它应该是Latin-1.问题是,x93 不是 Latin-1 字符集中的字符.在从 x7f 到 x9f 的 Latin-1 中有这个无效"范围被认为是非法的.但是,Microsoft 看到了未使用的范围,并决定在其中放置卷曲引号".在这样做的过程中,他们创建了一种名为windows-1252"的类似编码,它类似于拉丁文-1,其中包含该无效范围内的内容.

所以,让我们假设它是 windows-1252.现在怎么办?String.decode 将字节转换为 Unicode,这就是您想要的.你的第二个例子是在正确的轨道上,但它失败了,因为字符串不是 UTF-8.试试:

<预><代码>>>>uni = 'foo x93bar barx94 weasel'.decode("windows-1252")u'foo u201cbar baru201d weasel'>>>印刷大学foo酒吧酒吧"黄鼠狼>>>类型(单)<输入'unicode'>

这是正确的,因为左引号是 Unicode U+201C.现在你有了 Unicode,你可以用你选择的任何编码将它序列化为字节(如果你需要通过网络传递它),或者如果它留在 Python 中,就将它保留为 Unicode.如果要转换为 UTF-8,请使用反对函数 string.encode.

<预><代码>>>>uni.encode("utf-8")'foo xe2x80x9cbar bar xe2x80x9d 鼬鼠'

卷曲引号需要 3 个字节以 UTF-8 编码.您可以使用 UTF-16,它们只有两个字节.但是,您不能编码为 ASCII 或 Latin-1,因为它们没有大引号.

Possible Duplicate:
Python UnicodeDecodeError - Am I misunderstanding encode?

I have a string that I'm trying to make safe for the unicode() function:

>>> s = " foo "bar bar " weasel"
>>> s.encode('utf-8', 'ignore')

Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    s.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
>>> unicode(s)

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    unicode(s)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)

I'm mostly flailing around here. What do I need to do to remove the unsafe characters from the string?

Somewhat related to this question, although I was unable to solve my problem from it.

This also fails:

>>> s
' foo x93bar bar x94 weasel'
>>> s.decode('utf-8')

Traceback (most recent call last):
  File "<pyshell#13>", line 1, in <module>
    s.decode('utf-8')
  File "C:Python25254libencodingsutf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 5: unexpected code byte

解决方案

Good question. Encoding issues are tricky. Let's start with "I have a string." Strings in Python 2 aren't really "strings," they're byte arrays. So your string, where did it come from and what encoding is it in? Your example shows curly quotes in the literal, and I'm not even sure how you did that. I try to paste it into a Python interpreter, or type it on OS X with Option-[, and it doesn't come through.

Looking at your second example though, you have a character of hex 93. That can't be UTF-8, because in UTF-8, any byte higher than 127 is part of a multibyte sequence. So I'm guessing it's supposed to be Latin-1. The problem is, x93 isn't a character in the Latin-1 character set. There's this "invalid" range in Latin-1 from x7f to x9f that's considered illegal. However, Microsoft saw that unused range and decided to put "curly quotes" in there. In doing so they created this similar encoding called "windows-1252", which is like Latin-1 with stuff in that invalid range.

So, let's assume it is windows-1252. What now? String.decode converts bytes into Unicode, so that's the one you want. Your second example was on the right track, but it failed because the string wasn't UTF-8. Try:

>>> uni = 'foo x93bar barx94 weasel'.decode("windows-1252")
u'foo u201cbar baru201d weasel'
>>> print uni
foo "bar bar" weasel
>>> type(uni)
<type 'unicode'>

That's correct, because opening curly quote is Unicode U+201C. Now that you have Unicode, you can serialize it to bytes in any encoding you choose (if you need to pass it across the wire) or just keep it as Unicode if it's staying within Python. If you want to convert to UTF-8, use the oppose function, string.encode.

>>> uni.encode("utf-8")
'foo xe2x80x9cbar bar xe2x80x9d weasel'

Curly quotes take 3 bytes to encode in UTF-8. You could use UTF-16 and they'd only be two bytes. You can't encode as ASCII or Latin-1 though, because those don't have curly quotes.

这篇关于Python:清理 unicode 的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆