如何让python解释器正确处理字符串操作中的非ASCII字符? [英] How to make the python interpreter correctly handle non-ASCII characters in string operations?

查看:39
本文介绍了如何让python解释器正确处理字符串操作中的非ASCII字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的字符串:

I have a string that looks like so:

6 918 417 712

修剪这个字符串的明确方法(按照我对 Python 的理解)只是说字符串在一个名为 s 的变量中,我们得到:

The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get:

s.replace('Â ', '')

这应该可以解决问题.但当然它会抱怨文件 blabla.py 中的非 ASCII 字符 '\xc2' 未编码.

That should do the trick. But of course it complains that the non-ASCII character '\xc2' in file blabla.py is not encoded.

我一直不太明白如何在不同的编码之间切换.

I never quite could understand how to switch between different encodings.

这是代码,它确实和上面的一样,但现在是在上下文中.该文件在记事本中保存为 UTF-8,并具有以下标题:

Here's the code, it really is just the same as above, but now it's in context. The file is saved as UTF-8 in notepad and has the following header:

#!/usr/bin/python2.4
# -*- coding: utf-8 -*-

代码:

f = urllib.urlopen(url)

soup = BeautifulSoup(f)

s = soup.find('div', {'id':'main_count'})

#making a print 's' here goes well. it shows 6Â 918Â 417Â 712

s.replace('Â ','')

save_main_count(s)

它只不过是 s.replace...

推荐答案

Python 2 使用 ascii 作为源文件的默认编码,这意味着您必须在文件顶部指定另一种编码才能在文字中使用非 ascii unicode 字符.Python 3 使用 utf-8 作为源文件的默认编码,所以这不是什么问题.

Python 2 uses ascii as the default encoding for source files, which means you must specify another encoding at the top of the file to use non-ascii unicode characters in literals. Python 3 uses utf-8 as the default encoding for source files, so this is less of an issue.

见:http://docs.python.org/tutorial/interpreter.html#source-代码编码

要启用 utf-8 源编码,这将放在前两行之一中:

To enable utf-8 source encoding, this would go in one of the top two lines:

# -*- coding: utf-8 -*-

以上在文档中,但这也有效:

The above is in the docs, but this also works:

# coding: utf-8

其他注意事项:

  • 源文件也必须在文本编辑器中使用正确的编码进行保存.

  • The source file must be saved using the correct encoding in your text editor as well.

在 Python 2 中,unicode 文字前必须有一个 u,如 s.replace(u"Â ", u"") 但是在 Python 3 中,只需使用引号.在 Python 2 中,您可以 from __future__ import unicode_literals 来获取 Python 3 的行为,但请注意这会影响整个当前模块.

In Python 2, the unicode literal must have a u before it, as in s.replace(u"Â ", u"") But in Python 3, just use quotes. In Python 2, you can from __future__ import unicode_literals to obtain the Python 3 behavior, but be aware this affects the entire current module.

s.replace(u"Â ", u"") 如果 s 不是 unicode 字符串,也会失败.

s.replace(u"Â ", u"") will also fail if s is not a unicode string.

string.replace 返回一个新字符串并且不会就地编辑,因此请确保您也使用返回值

string.replace returns a new string and does not edit in place, so make sure you're using the return value as well

这篇关于如何让python解释器正确处理字符串操作中的非ASCII字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆