如何让python解释器正确处理字符串操作中的非ASCII字符? [英] How to make the python interpreter correctly handle non-ASCII characters in string operations?
问题描述
我有一个看起来像这样的字符串:
I have a string that looks like so:
6Â 918Â 417Â 712
修剪这个字符串的明确方法(按照我对 Python 的理解)只是说字符串在一个名为 s
的变量中,我们得到:
The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s
, we get:
s.replace('Â ', '')
这应该可以解决问题.但当然它会抱怨文件 blabla.py 中的非 ASCII 字符 '\xc2'
未编码.
That should do the trick. But of course it complains that the non-ASCII character '\xc2'
in file blabla.py is not encoded.
我一直不太明白如何在不同的编码之间切换.
I never quite could understand how to switch between different encodings.
这是代码,它确实和上面的一样,但现在是在上下文中.该文件在记事本中保存为 UTF-8,并具有以下标题:
Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:
#!/usr/bin/python2.4
# -*- coding: utf-8 -*-
代码:
f = urllib.urlopen(url)
soup = BeautifulSoup(f)
s = soup.find('div', {'id':'main_count'})
#making a print 's' here goes well. it shows 6Â 918Â 417Â 712
s.replace('Â ','')
save_main_count(s)
它只不过是 s.replace
...
推荐答案
Python 2 使用 ascii
作为源文件的默认编码,这意味着您必须在文件顶部指定另一种编码才能在文字中使用非 ascii unicode 字符.Python 3 使用 utf-8
作为源文件的默认编码,所以这不是什么问题.
Python 2 uses ascii
as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8
as the default encoding for source files, so this is less of an issue.
见:http://docs.python.org/tutorial/interpreter.html#source-代码编码
要启用 utf-8 源编码,这将放在前两行之一中:
To enable utf-8 source encoding, this would go in one of the top two lines:
# -*- coding: utf-8 -*-
以上在文档中,但这也有效:
The above is in the docs, but this also works:
# coding: utf-8
其他注意事项:
源文件也必须在文本编辑器中使用正确的编码进行保存.
The source file must be saved using the correct encoding in your text editor as well.
在 Python 2 中,unicode 文字前必须有一个 u
,如 s.replace(u"Â ", u"")
但是在 Python 3 中,只需使用引号.在 Python 2 中,您可以 from __future__ import unicode_literals
来获取 Python 3 的行为,但请注意这会影响整个当前模块.
In Python 2, the unicode literal must have a u
before it, as in s.replace(u"Â ", u"")
But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals
to obtain the Python 3 behavior, but be aware this affects the entire current module.
s.replace(u"Â ", u"")
如果 s
不是 unicode 字符串,也会失败.
s.replace(u"Â ", u"")
will also fail if s
is not a unicode string.
string.replace
返回一个新字符串并且不会就地编辑,因此请确保您也使用返回值
string.replace
returns a new string and does not edit in place, so make sure you're using the return value as well
这篇关于如何让python解释器正确处理字符串操作中的非ASCII字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!