在Python中清除属于不同语言的文本 [英] To clean text belonging to different languages in Python

查看:99
本文介绍了在Python中清除属于不同语言的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组文本,这些文本的句子完全是英语,印地语或马拉地语,每个句子的ID分别为0,1,2,分别代表文本的语言.

I have a collection of text which has sentences either entirely in English or Hindi or Marathi with ids attached to each of these sentences as 0,1,2 respectively representing the language of the text.

无论任何语言的文本都可能带有HTML标签,标点符号等.

The text irrespective of any language may have HTML tags, punctuation etc.

我可以使用下面的代码来清理英语句子:

I could clean the English sentences using my code below:

import HTMLParser
import re
from nltk.corpus import stopwords
from collections import Counter
import pickle
from string import punctuation

#creating html_parser object 
html_parser = HTMLParser.HTMLParser()
cachedStopWords = set(stopwords.words("english"))

def cleanText(text,lang_id): 

    if lang_id == 0:
        str1 = ''.join(text).decode('iso-8859-1')


    else:
        str1 = ''.join(text).encode('utf-8')

    str1 = html_parser.unescape(str1)    
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', str1)
    #print "cleantext before puncts removed : " + cleantext
    clean_puncts = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
    cleantext = re.sub(clean_puncts,' ',cleantext)
    #print " cleantext after puncts removed : " + cleantext
    cleanest = cleantext.lower()
    if lang_id == 0:
        cleanertext = ' '.join([word for word in cleanest.split() if word not in cachedStopWords])       
        words = re.findall(r"[\w']+", cleanertext)
        words_final = [x.encode('UTF8') for x in words]
    else:
        words_final = cleanest.split()
    return words_final

但是对于印地语和马拉地语文本,它给我带来以下错误:

but it gives me the following error for Hindi and Marathi text as :

UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 104: ordinal not in range(128)

此外,它还会删除所有单词.

also, it removes all the words.

印地语文字就像

&lt;p&gt;भारत का इतिहास काफी समृद्ध एवं विस्तृत है। &lt;/p&gt;

对于印地语或马拉地语文本,我该怎么做?

How can I do the same for Hindi or Marathi text?

推荐答案

没有完整的文本文件,我们可以提供的解决方案只是在黑暗中拍摄.

Without the full textfile, the solution that we can provide will only be a shot in the dark.

首先,检查要读入cleanText()的字符串的类型,它实际上是unicode还是字节字符串?参见字节字符串与unicode字符串. Python

Firstly, check the types of the strings you're reading into the cleanText(), it is really a unicode or is it a byte string? See byte string vs. unicode string. Python

因此,如果您已正确阅读文件并确保所有内容均为unicode,则如何管理字符串(在python2或3中)都应该没有问题.以下示例确认了这一点:

So if you've read your file properly and ensures that everything is unicode, there should be no problem in how you manage the strings (in both python2 or 3). The following example confirms this:

>>> from HTMLParser import HTMLParser
>>> hp = HTMLParser()
>>> text = u"&lt;p&gt;भारत का इतिहास काफी समृद्ध एवं विस्तृत है। &lt;/p&gt;"
>>> hp.unescape(text)
u'<p>\u092d\u093e\u0930\u0924 \u0915\u093e \u0907\u0924\u093f\u0939\u093e\u0938 \u0915\u093e\u092b\u0940 \u0938\u092e\u0943\u0926\u094d\u0927 \u090f\u0935\u0902 \u0935\u093f\u0938\u094d\u0924\u0943\u0924 \u0939\u0948\u0964 </p>'
>>> print hp.unescape(text)
<p>भारत का इतिहास काफी समृद्ध एवं विस्तृत है। </p>
>>> hp.unescape(text).split()
[u'<p>\u092d\u093e\u0930\u0924', u'\u0915\u093e', u'\u0907\u0924\u093f\u0939\u093e\u0938', u'\u0915\u093e\u092b\u0940', u'\u0938\u092e\u0943\u0926\u094d\u0927', u'\u090f\u0935\u0902', u'\u0935\u093f\u0938\u094d\u0924\u0943\u0924', u'\u0939\u0948\u0964', u'</p>']
>>> print " ".join(hp.unescape(text).split())
<p>भारत का इतिहास काफी समृद्ध एवं विस्तृत है। </p>

即使使用正则表达式也没问题:

Even with regex manipulation, there's no problem:

>>> import re
>>> from string import punctuation
>>> p = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
>>> new_text = " ".join(hp.unescape(text).split())
>>> re.sub(p,' ', new_text)
u' p \u092d\u093e\u0930\u0924 \u0915\u093e \u0907\u0924\u093f\u0939\u093e\u0938 \u0915\u093e\u092b\u0940 \u0938\u092e\u0943\u0926\u094d\u0927 \u090f\u0935\u0902 \u0935\u093f\u0938\u094d\u0924\u0943\u0924 \u0939\u0948\u0964 p '
>>> print re.sub(p,' ', new_text)
 p भारत का इतिहास काफी समृद्ध एवं विस्तृत है। p 

看看如何止痛" ,并紧随其后-此演讲中的做法很可能会解决您的unicode问题.在 http://nedbatchelder.com/text/unipain.html 上滑动.

Take a look at "How to stop the pain" and following the best-practices in this talk would most likely resolve your unicode problems. Slides on http://nedbatchelder.com/text/unipain.html .

也要看一下吗: https://www.youtube.com/watch? v = Mx70n1dL534 在PyCon14上(但仅适用于python2.x)

Do also look at this too: https://www.youtube.com/watch?v=Mx70n1dL534 on PyCon14 (but only applicable for python2.x)

打开这样的utf8文件也可以解决您的问题:

Opening a utf8 file like this might resolve your problem too:

import io
with io.open('myfile.txt', 'r', encoding='utf8') as fin:
    for line in fin:
        clean_text(line)

如果STDIN和STDOUT给您带来问题,请参阅 https ://daveagp.wordpress.com/2010/10/26/what-a-character/

If the STDIN and STDOUT is giving you problem, see https://daveagp.wordpress.com/2010/10/26/what-a-character/

另请参阅:

  • What's the difference between io.open() and os.open() on Python?
  • Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?
  • Should I use "camel case" or underscores in python?
  • Setting the correct encoding when piping stdout in Python
  • https://stackoverflow.com/a/28381060/610569

这篇关于在Python中清除属于不同语言的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆