Python:清理unicode的字符串? [英] Python: Sanitize a string for unicode?

查看:127
本文介绍了Python:清理unicode的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


可能重复:

Python UnicodeDecodeError - 我是否误解了编码?

我有一个字符串,我试图为 unicode()函数安全:

I have a string that I'm trying to make safe for the unicode() function:

>>> s = " foo "bar bar " weasel"
>>> s.encode('utf-8', 'ignore')

Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    s.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)
>>> unicode(s)

Traceback (most recent call last):
  File "<pyshell#9>", line 1, in <module>
    unicode(s)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 5: ordinal not in range(128)

我大部分时间都在这里。我需要做什么来从字符串中删除不安全的字符?

I'm mostly flailing around here. What do I need to do to remove the unsafe characters from the string?

与这个有点相关问题,虽然我无法解决我的问题。

Somewhat related to this question, although I was unable to solve my problem from it.



This also fails:

>>> s
' foo \x93bar bar \x94 weasel'
>>> s.decode('utf-8')

Traceback (most recent call last):
  File "<pyshell#13>", line 1, in <module>
    s.decode('utf-8')
  File "C:\Python25\254\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 5: unexpected code byte


推荐答案

好问题。编码问题很棘手。让我们从我有一个字符串开始。 Python 2中的字符串不是真正的字符串,它们是字节数组。所以你的字符串,它是从哪里来的,它是什么编码?你的例子在文字中显示了卷曲引号,我甚至不知道你是怎么做的。我尝试将它粘贴到一个Python解释器,或者在OS X上用Option- [键入它,它不会通过。

Good question. Encoding issues are tricky. Let's start with "I have a string." Strings in Python 2 aren't really "strings," they're byte arrays. So your string, where did it come from and what encoding is it in? Your example shows curly quotes in the literal, and I'm not even sure how you did that. I try to paste it into a Python interpreter, or type it on OS X with Option-[, and it doesn't come through.

看看你的第二个例子,您的字符为hex 93.这不能是 UTF-8 ,因为在UTF-8中,任何高于127的字节都是多字节序列的一部分。所以我猜这应该是拉丁语-1。问题是,x93不是Latin-1字符集中的字符。这个无效范围在拉丁语-1从x7f到x9f被认为是非法的。然而,微软看到了未使用的范围,并决定把卷曲报价在那里。在这样做的时候,他们创建了一个名为windows-1252的类似的编码,就像Latin-1中的东西在无效的范围。

Looking at your second example though, you have a character of hex 93. That can't be UTF-8, because in UTF-8, any byte higher than 127 is part of a multibyte sequence. So I'm guessing it's supposed to be Latin-1. The problem is, x93 isn't a character in the Latin-1 character set. There's this "invalid" range in Latin-1 from x7f to x9f that's considered illegal. However, Microsoft saw that unused range and decided to put "curly quotes" in there. In doing so they created this similar encoding called "windows-1252", which is like Latin-1 with stuff in that invalid range.

strong> windows-1252 。现在怎么办? String.decode将字节转换为Unicode,所以这是你想要的。你的第二个例子是在正确的轨道,但它失败,因为字符串不是UTF-8。尝试:

So, let's assume it is windows-1252. What now? String.decode converts bytes into Unicode, so that's the one you want. Your second example was on the right track, but it failed because the string wasn't UTF-8. Try:

>>> uni = 'foo \x93bar bar\x94 weasel'.decode("windows-1252")
u'foo \u201cbar bar\u201d weasel'
>>> print uni
foo "bar bar" weasel
>>> type(uni)
<type 'unicode'>

这是正确的,因为打开curly quote是Unicode U + 201C。现在你有了Unicode,你可以将它序列化为任何编码中的字节(如果你需要传递它的线),或者保持为Unicode如果它停留在Python。如果要转换为UTF-8,请使用oppose函数string.encode。

That's correct, because opening curly quote is Unicode U+201C. Now that you have Unicode, you can serialize it to bytes in any encoding you choose (if you need to pass it across the wire) or just keep it as Unicode if it's staying within Python. If you want to convert to UTF-8, use the oppose function, string.encode.

>>> uni.encode("utf-8")
'foo \xe2\x80\x9cbar bar \xe2\x80\x9d weasel'

卷曲引号需要3个字节以UTF-8编码。你可以使用UTF-16,他们只有两个字节。你不能编码为ASCII或Latin-1虽然,因为那些没有卷曲的引号。

Curly quotes take 3 bytes to encode in UTF-8. You could use UTF-16 and they'd only be two bytes. You can't encode as ASCII or Latin-1 though, because those don't have curly quotes.

这篇关于Python:清理unicode的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆