Python删除额外的特殊unicode字符 [英] Python removing extra special unicode characters
问题描述
我正在使用python处理某些文本,内部已经采用unicode格式,但是我想摆脱一些特殊字符,并用更多标准版本替换它们.
I'm working with some text in python, it's already in unicode format internally but I would like to get rid of some special characters and replace them with more standard versions.
我目前有一条看起来像这样的线,但是它变得越来越复杂,我看到它最终会带来更多的麻烦.
I currently have a line that looks like this, but it's getting ever more complex and I see it will eventually bring more trouble.
tmp = infile.lower().replace(u"\u2018", "'").replace(u"\u2019", "'").replace(u"\u2013", "").replace(u"\u2026", "")
例如u \ 2018和\ u2019 左和正确的单引号.这些在某种程度上是可以接受的,但是对于这种类型的文本处理,我认为它们不是必需的.
for example the u\2018 and \u2019 are left and right single quotes. Those are somewhat acceptable but for this type of text processing I don't think they are needed.
类似于此 u \ 2013 EN DASH和绝对不需要此水平省略号.
Things like this u\2013 EN DASH and this HORIZONTAL ELLIPSIS are definitely not needed.
有没有一种方法可以删除这些引号并使用简单的标准引号,而不会打断"nltk"文本处理并删除诸如EN DASH,水平省略号之类的内容,而不会像我看到的那样发出巨大的召唤是上面的示例代码中的标题吗?
Is there a way to remove those quotation marks and use simple standard quotes that won't break text processing 'with nltk' and remove things like those EN DASH, HORIZONTAL ELLIPSIS without making such a monster call like I see starting to rear it's head in the sample code above?
推荐答案
如果您的文本为英文,并且希望以易于阅读的方式进行清理,请使用第三方模块
If your text is in English and you want to clean it up in a human-readable way, use the third-party module unidecode
. It replaces a wide range of characters with their nearest ascii look-alike. Just apply unidecode.unidecode()
to any string to make the substitutions:
from unidecode import unidecode
clean = unidecode(u'Some text: \u2018\u2019\u2013\u03a9')
这篇关于Python删除额外的特殊unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!