有没有一种简单的方法可以使unicode在python中工作? [英] Is there an easy way to make unicode work in python?

查看:72
本文介绍了有没有一种简单的方法可以使unicode在python中工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在python 2.7.2中处理unicode.我知道有.encode('utf-8')东西,但是添加它的时间是我的1/2,但是我收到错误,而没有添加它的时间是我1/2,但是我却添加了错误.

I'm trying to deal with unicode in python 2.7.2. I know there is the .encode('utf-8') thing but 1/2 the time when I add it, I get errors, and 1/2 the time when I don't add it I get errors.

有没有办法告诉python-我认为这是最新的&现代语言只使用unicode作为字符串,而不必让我不得不跟.encode('utf-8')东西放屁吗?

Is there any way to tell python - what I thought was an up-to-date & modern language to just use unicode for strings and not make me have to fart around with .encode('utf-8') stuff?

我知道... python 3.0应该可以做到这一点,但是我不能使用3.0,而2.7并不是那么古老...

I know... python 3.0 is supposed to do this, but I can't use 3.0 and 2.7 isn't all that old anyways...

例如:

url = "http://en.wikipedia.org//w/api.php?action=query&list=search&format=json&srlimit=" + str(items) + "&srsearch=" + urllib2.quote(title.encode('utf-8'))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

更新 如果我从所有代码中删除所有.encode语句,并将# -*- coding: utf-8 -*-添加到文件顶部,就在#!/usr/bin/python的正下方,则得到以下内容,就像我没有在# -*- coding: utf-8 -*-处添加全部.

Update If I remove all my .encode statements from all my code and add # -*- coding: utf-8 -*- to the top of my file, right under the #!/usr/bin/python then I get the following, same as if I didn't add the # -*- coding: utf-8 -*- at all.

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1250: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))
Traceback (most recent call last):
  File "classes.py", line 583, in <module>
    wiki.getPage(title)
  File "classes.py", line 146, in getPage
    url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&format=json&rvprop=content&rvlimit=1&titles=" + urllib2.quote(title)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1250, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xf1'

我没有手动输入任何字符串,而是从网站解析HTML和json.因此,脚本/字节流/无论它们是什么,都是由python创建的.

I'm not manually typing in any string, I parsing HTML and json from websites. So the scripts/bytestreams/whatever they are, are all created by python.

更新2 我可以继续处理错误,但是它只会在新地方出现.我希望python是一个有用的脚本工具,但是看起来三天没有运气之后,我将尝试使用另一种语言.可惜的是,在osx上预装了python.我已将正确的答案标记为修复了我发布的错误的一个实例.

Update 2 I can move the error along, but it just keeps coming up in new places. I was hoping python would be a useful scripting tool, but looks like after 3 days of no luck I'll just try a different language. Its a shame, python is preinstalled on osx. I've marked correct the answer that fixed the one instance of the error I posted.

推荐答案

除了在各处使用unicode字符串并立即解码收到的任何编码字符串外,没有其他方法可以使unicode正常工作".问题在于,无论是处理编码数据还是未编码数据,或者使用为您跟踪数据的工具,您都必须始终保持直截了当.

There is no way to make unicode "just work" apart from using unicode strings everywhere and immediately decoding any encoded string you receive. The problem is that you MUST ALWAYS keep straight whether you're dealing with encoded or unencoded data, or use tools that keep track of it for you, or you're going to have a bad time.

Python 2对此有一些问题:它使str成为默认"而不是unicode,例如字符串文字,它在将两者相加时默默地将str强制为unicode,它使您可以在已编码的字符串上调用.encode()对其进行双重编码.结果,有很多python编码器和python库不知道它们设计要使用哪种编码,但是由于str类型旨在让程序员自己管理编码.而且每次使用这些库时都必须考虑编码,因为它们本身不支持unicode类型.

Python 2 does some things that are problematic for this: it makes str the "default" rather than unicode for things like string literals, it silently coerces str to unicode when you add the two, and it lets you call .encode() on an already-encoded string to double-encode it. As a result, there are a lot of python coders and python libraries out there that have no idea what encodings they're designed to work with, but are nonetheless designed to deal with some particular encoding since the str type is designed to let the programmer manage the encoding themselves. And you have to think about the encoding each time you use these libraries since they don't support the unicode type themselves.

在您的特定情况下,第一个错误告诉您您正在处理已编码的UTF-8数据,并尝试对其进行双重编码,而第二个错误则告诉您正在处理未编码的数据. 看上去看起来像您同时拥有两者.您应该真正找到并解决问题的根源(我怀疑这与我上面提到的沉默的强制性有关),但这是一个可以在短期内解决此问题的技巧:

In your particular case, the first error tells you you're dealing with encoded UTF-8 data and trying to double-encode it, while the 2nd tells you you're dealing with UNencoded data. It looks like you may have both. You should really find and fix the source of the problem (I suspect it has to do with the silent coercion I mentioned above), but here's a hack that should fix it in the short term:

encoded_title = title
if isinstance(encoded_title, unicode):
    encoded_title = title.encode('utf-8')


如果这实际上是在无声地强迫您,则您应该能够使用出色的 unicode-nazi 工具:


If this is in fact a case of silent coercion biting you, you should be able to easily track down the problem using the excellent unicode-nazi tool:

python -Werror -municodenazi myprog.py

这将使您在Unicode泄漏到您的非Unicode字符串时立即进行追溯,而不是尝试从实际问题的途中对该异常进行故障排除.有关详细信息,请参阅我对这个相关问题的回答.

This will give you a traceback right at the point unicode leaks into your non-unicode strings, instead of trying troubleshooting this exception way down the road from the actual problem. See my answer on this related question for details.

这篇关于有没有一种简单的方法可以使unicode在python中工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆