在Python中清除属于不同语言的文本 [英] To clean text belonging to different languages in Python
问题描述
我有一组文本,这些文本的句子完全是英语,印地语或马拉地语,每个句子的ID分别为0,1,2,分别代表文本的语言.
I have a collection of text which has sentences either entirely in English or Hindi or Marathi with ids attached to each of these sentences as 0,1,2 respectively representing the language of the text.
无论任何语言的文本都可能带有HTML标签,标点符号等.
The text irrespective of any language may have HTML tags, punctuation etc.
我可以使用下面的代码来清理英语句子:
I could clean the English sentences using my code below:
import HTMLParser
import re
from nltk.corpus import stopwords
from collections import Counter
import pickle
from string import punctuation
#creating html_parser object
html_parser = HTMLParser.HTMLParser()
cachedStopWords = set(stopwords.words("english"))
def cleanText(text,lang_id):
if lang_id == 0:
str1 = ''.join(text).decode('iso-8859-1')
else:
str1 = ''.join(text).encode('utf-8')
str1 = html_parser.unescape(str1)
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', str1)
#print "cleantext before puncts removed : " + cleantext
clean_puncts = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
cleantext = re.sub(clean_puncts,' ',cleantext)
#print " cleantext after puncts removed : " + cleantext
cleanest = cleantext.lower()
if lang_id == 0:
cleanertext = ' '.join([word for word in cleanest.split() if word not in cachedStopWords])
words = re.findall(r"[\w']+", cleanertext)
words_final = [x.encode('UTF8') for x in words]
else:
words_final = cleanest.split()
return words_final
但是对于印地语和马拉地语文本,它给我带来以下错误:
but it gives me the following error for Hindi and Marathi text as :
UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 104: ordinal not in range(128)
此外,它还会删除所有单词.
also, it removes all the words.
印地语文字就像
<p>भारत का इतिहास काफी समृद्ध एवं विस्तृत है। </p>
对于印地语或马拉地语文本,我该怎么做?
How can I do the same for Hindi or Marathi text?
推荐答案
没有完整的文本文件,我们可以提供的解决方案只是在黑暗中拍摄.
Without the full textfile, the solution that we can provide will only be a shot in the dark.
首先,检查要读入cleanText()
的字符串的类型,它实际上是unicode还是字节字符串?参见字节字符串与unicode字符串. Python
Firstly, check the types of the strings you're reading into the cleanText()
, it is really a unicode or is it a byte string? See byte string vs. unicode string. Python
因此,如果您已正确阅读文件并确保所有内容均为unicode,则如何管理字符串(在python2或3中)都应该没有问题.以下示例确认了这一点:
So if you've read your file properly and ensures that everything is unicode, there should be no problem in how you manage the strings (in both python2 or 3). The following example confirms this:
>>> from HTMLParser import HTMLParser
>>> hp = HTMLParser()
>>> text = u"<p>भारत का इतिहास काफी समृद्ध एवं विस्तृत है। </p>"
>>> hp.unescape(text)
u'<p>\u092d\u093e\u0930\u0924 \u0915\u093e \u0907\u0924\u093f\u0939\u093e\u0938 \u0915\u093e\u092b\u0940 \u0938\u092e\u0943\u0926\u094d\u0927 \u090f\u0935\u0902 \u0935\u093f\u0938\u094d\u0924\u0943\u0924 \u0939\u0948\u0964 </p>'
>>> print hp.unescape(text)
<p>भारत का इतिहास काफी समृद्ध एवं विस्तृत है। </p>
>>> hp.unescape(text).split()
[u'<p>\u092d\u093e\u0930\u0924', u'\u0915\u093e', u'\u0907\u0924\u093f\u0939\u093e\u0938', u'\u0915\u093e\u092b\u0940', u'\u0938\u092e\u0943\u0926\u094d\u0927', u'\u090f\u0935\u0902', u'\u0935\u093f\u0938\u094d\u0924\u0943\u0924', u'\u0939\u0948\u0964', u'</p>']
>>> print " ".join(hp.unescape(text).split())
<p>भारत का इतिहास काफी समृद्ध एवं विस्तृत है। </p>
即使使用正则表达式也没问题:
Even with regex manipulation, there's no problem:
>>> import re
>>> from string import punctuation
>>> p = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
>>> new_text = " ".join(hp.unescape(text).split())
>>> re.sub(p,' ', new_text)
u' p \u092d\u093e\u0930\u0924 \u0915\u093e \u0907\u0924\u093f\u0939\u093e\u0938 \u0915\u093e\u092b\u0940 \u0938\u092e\u0943\u0926\u094d\u0927 \u090f\u0935\u0902 \u0935\u093f\u0938\u094d\u0924\u0943\u0924 \u0939\u0948\u0964 p '
>>> print re.sub(p,' ', new_text)
p भारत का इतिहास काफी समृद्ध एवं विस्तृत है। p
看看如何止痛" ,并紧随其后-此演讲中的做法很可能会解决您的unicode问题.在 http://nedbatchelder.com/text/unipain.html 上滑动.
Take a look at "How to stop the pain" and following the best-practices in this talk would most likely resolve your unicode problems. Slides on http://nedbatchelder.com/text/unipain.html .
也要看一下吗: https://www.youtube.com/watch? v = Mx70n1dL534 在PyCon14上(但仅适用于python2.x
)
Do also look at this too: https://www.youtube.com/watch?v=Mx70n1dL534 on PyCon14 (but only applicable for python2.x
)
打开这样的utf8文件也可以解决您的问题:
Opening a utf8 file like this might resolve your problem too:
import io
with io.open('myfile.txt', 'r', encoding='utf8') as fin:
for line in fin:
clean_text(line)
如果STDIN和STDOUT给您带来问题,请参阅 https ://daveagp.wordpress.com/2010/10/26/what-a-character/
If the STDIN and STDOUT is giving you problem, see https://daveagp.wordpress.com/2010/10/26/what-a-character/
另请参阅:
- io.open和有什么区别()和os.open()在Python上?
- 为什么我们不应该使用py脚本中的sys.setdefaultencoding("utf-8")?
- 我应该使用"camel case"或在python中加下划线?
- 在Python中传递标准输出时设置正确的编码
- https://stackoverflow.com/a/28381060/610569
- What's the difference between io.open() and os.open() on Python?
- Why should we NOT use sys.setdefaultencoding("utf-8") in a py script?
- Should I use "camel case" or underscores in python?
- Setting the correct encoding when piping stdout in Python
- https://stackoverflow.com/a/28381060/610569
这篇关于在Python中清除属于不同语言的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!