有效地替换坏字符 [英] efficiently replace bad characters
本文介绍了有效地替换坏字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我经常使用包含以下字符的 utf-8 文本:
I often work with utf-8 text containing characters like:
\xc2\x99
\xc2\x95
\xc2\x85
等
这些字符会混淆我使用的其他库,因此需要替换.
These characters confuse other libraries I work with so need to be replaced.
有什么有效的方法可以做到这一点,而不是:
What is an efficient way to do this, rather than:
text.replace('\xc2\x99', ' ').replace('\xc2\x85, '...')
推荐答案
正则表达式总是存在的;只需列出方括号内的所有违规字符,如下所示:
There is always regular expressions; just list all of the offending characters inside square brackets like so:
import re
print re.sub(r'[\xc2\x99]'," ","Hello\xc2There\x99")
这会打印:'Hello There',用空格替换不需要的字符.
This prints: 'Hello There ', with the unwanted characters replaced by spaces.
或者,如果每个字符都有不同的替换字符:
Alternately, if you have a different replacement character for each:
# remove annoying characters
chars = {
'\xc2\x82' : ',', # High code comma
'\xc2\x84' : ',,', # High code double comma
'\xc2\x85' : '...', # Tripple dot
'\xc2\x88' : '^', # High carat
'\xc2\x91' : '\x27', # Forward single quote
'\xc2\x92' : '\x27', # Reverse single quote
'\xc2\x93' : '\x22', # Forward double quote
'\xc2\x94' : '\x22', # Reverse double quote
'\xc2\x95' : ' ',
'\xc2\x96' : '-', # High hyphen
'\xc2\x97' : '--', # Double hyphen
'\xc2\x99' : ' ',
'\xc2\xa0' : ' ',
'\xc2\xa6' : '|', # Split vertical bar
'\xc2\xab' : '<<', # Double less than
'\xc2\xbb' : '>>', # Double greater than
'\xc2\xbc' : '1/4', # one quarter
'\xc2\xbd' : '1/2', # one half
'\xc2\xbe' : '3/4', # three quarters
'\xca\xbf' : '\x27', # c-single quote
'\xcc\xa8' : '', # modifier - under curve
'\xcc\xb1' : '' # modifier - under line
}
def replace_chars(match):
char = match.group(0)
return chars[char]
return re.sub('(' + '|'.join(chars.keys()) + ')', replace_chars, text)
这篇关于有效地替换坏字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文