在python中处理非ASCII字符串 [英] handle non ascii code string in python

查看:481
本文介绍了在python中处理非ASCII字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在python中处理非ascii代码char是非常令人困惑的。可以解释吗?

It is really confusing to handle non-ascii code char in python. Can any one explain?

我正在尝试阅读一个纯文本文件,并用空格替换所有非字母字符。

I'm trying to read a plain text file and replace all non-alphabetic characters with spaces.

我有一个字符列表:

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—')

我得到,我通过调用

    for punc in ignorelist:
        token = token.replace(punc, ' ')

注意在$ $的末尾有一个非ascii代码字符c $ c> ignorelist : u' - '

notice there's a non ascii code character at the end of ignorelist: u'—'

每次当我的代码遇到那个角色,它崩溃并说:

Everytime when my code encounters that character, it crashes and say:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position

我试图通过在文件的顶部添加# - * - encoding:utf-8 - * - ,但仍然不起作用。谁知道为什么?谢谢!

I tried to declare the encoding by adding # -*- coding: utf-8 -*- at the top of the file, but still not working. anyone knows why? Thanks!

推荐答案

你使用的是Python 2.x,它会尝试自动转换 unicode s和plain str s,但是通常会导致非ASCII字符失败。

You are using Python 2.x, and it will try to auto-convert unicodes and plain strs, but it often fails with non-ascii characters.

您不应该将 unicode s和 str s混合在一起。您可以坚持 unicode s:

You shouldn't mix unicodes and strs together. You can either stick to unicodes:

ignorelist = (u'!', u'-', u'_', u'(', u')', u',', u'.', u':', u';', u'"', u'\'', u'?', u'#', u'@', u'$', u'^', u'&', u'*', u'+', u'=', u'{', u'}', u'[', u']', u'\\', u'|', u'<', u'>', u'/', u'—')

if not isinstance(token, unicode):
    token = token.decode('utf-8') # assumes you are using UTF-8
for punc in ignorelist:
    token = token.replace(punc, u' ')

或只使用简单的 str s(注意最后一个):

or use only plain strs (note the last one):

ignorelist = ('!', '-', '_', '(', ')', ',', '.', ':', ';', '"', '\'', '?', '#', '@', '$', '^', '&', '*', '+', '=', '{', '}', '[', ']', '\\', '|', '<', '>', '/', u'—'.encode('utf-8'))
# and other parts do not need to change

通过手动将 u' - '编码到 str ,Python不需要自己尝试。

By manually encoding your u'—' into a str, Python won't need to try that by itself.

我建议你使用 unicode 所有的程序,以避免这种错误。但是,如果工作太多,可以使用后一种方法。但是,当您在标准库或第三方模块中调用某些功能时,请小心。

I suggest you use unicode all across your program to avoid this kind of errors. But if it'd be too much work, you can use the latter method. However, take care when you call some functions in standard library or third party modules.

# - * - 编码:utf-8 - * - 只告诉Python你的代码是用UTF-8编写的(或者你会得到一个 SyntaxError )。

# -*- coding: utf-8 -*- only tells Python that your code is written in UTF-8 (or you'll get a SyntaxError).

这篇关于在python中处理非ASCII字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆