无法使用NLTK标记器处理重音词 [英] Unable to process accented words using NLTK tokeniser

查看:78
本文介绍了无法使用NLTK标记器处理重音词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用以下代码来计算utf-8编码文本文件中单词的频率.成功标记文件内容,然后遍历单词后,我的程序无法读取带重音的字符.

I'm trying to compute the frequencies of words in an utf-8 encoded text file with the following code. Having successfully tokenized the file content and then looping through the words, my program is not able to read the accented characters.

import csv
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

print "computing word frequency..."
if lang == "fr":
    stop = stopwords.words("french")
    stop = [word.encode("utf-8") for word in stop]
    stop.append("les")
    stop.append("a")
elif lang == "en":
    stop = stopwords.words("english")


rb = csv.reader(open(path+file_name))
wb = csv.writer(open('results/all_words_'+file_name,'wb'))

tokenizer = RegexpTokenizer(r'\w+')

word_dict = {}

i = 0

for row in rb:
    i += 1
    if i == 5:
        break
    text = tokenizer.tokenize(row[0].lower())
    text = [j for j in text if j not in stop]
    #print text
    for doc in text:
        try:

            try:
                word_dict[doc] += 1

            except:

                word_dict[doc] = 1
        except:
            print row[0]
            print " ".join(text)




word_dict2 = sorted(word_dict.iteritems(), key=operator.itemgetter(1), reverse=True)

if lang == "English":
    for item in word_dict2:
        wb.writerow([item[0],stem(item[0]),item[1]])
else:
    for item in word_dict2:
        wb.writerow([item[0],item[1]])

print "Finished"

输入文本文件:

rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent

结果输出:

crepes,2
dimanche,2
rt,1
nouveau,1
envie,1
v�,1 
jerrylee,1
cleantext,1
lo,1
bonnes,1
tour,1
crêpes,1
monde,1
bonjour,1
annesorose,1
envoy�,1

envoy.是实际文件中的使节.

envoy� is envoyé in the actual file.

我该如何解决带有重音符号的问题?

How can I correct this problem with accented characters?

推荐答案

如果您使用的是py2.x,请将默认编码重置为'utf8':

If you're using py2.x, reset default encoding to 'utf8':

import sys
reload(sys)
sys.setdefaultencoding('utf8')

或者,您可以使用ucsv模块,请参见

Alternatively, you can use a ucsv module, see see General Unicode/UTF-8 support for csv files in Python 2.6

或使用io.open():

$ echo """rt annesorose envie crêpes
> envoyé jerrylee bonjour monde dimanche crepes dimanche
> The output written in a file is destroying certain words.
> bonnes crepes tour nouveau vélo
> aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent""" > someutf8.txt
$ python
>>> import io, csv
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read().split('\n')
>>> for row in text:
...     print row
... 
rt annesorose envie crêpes
envoyé jerrylee bonjour monde dimanche crepes dimanche
The output written in a file is destroying certain words.
bonnes crepes tour nouveau vélo
aime crepe soleil ça fera bien recharger batteries vu jours hard annoncent

最后,无需使用这样复杂的读取和计数模块,只需在NLTK中使用FreqDist,请参见 http://www.nltk.org/book/ch01.html

Lastly, rather than using such a complex reading and counting module, simply use FreqDist in NLTK, see section 3.1 from http://www.nltk.org/book/ch01.html

或者我个人更喜欢collections.Counter:

Or personally, i prefer collections.Counter:

$ python
>>> import io
>>> text = io.open('someutf8.txt', 'r', encoding='utf8').read()
>>> from collections import Counter
>>> Counter(word_tokenize(text))
Counter({u'crepes': 2, u'dimanche': 2, u'fera': 1, u'certain': 1, u'is': 1, u'bonnes': 1, u'v\xe9lo': 1, u'batteries': 1, u'envoy\xe9': 1, u'vu': 1, u'file': 1, u'in': 1, u'The': 1, u'rt': 1, u'jerrylee': 1, u'destroying': 1, u'bien': 1, u'jours': 1, u'.': 1, u'written': 1, u'annesorose': 1, u'annoncent': 1, u'nouveau': 1, u'envie': 1, u'hard': 1, u'cr\xeapes': 1, u'\xe7a': 1, u'monde': 1, u'words': 1, u'bonjour': 1, u'a': 1, u'crepe': 1, u'soleil': 1, u'tour': 1, u'aime': 1, u'output': 1, u'recharger': 1})
>>> myFreqDist = Counter(word_tokenize(text))
>>> for word, freq in myFreqDist.items():
...     print word, freq
... 
fera 1
crepes 2
certain 1
is 1
bonnes 1
vélo 1
batteries 1
envoyé 1
vu 1
file 1
in 1
The 1
rt 1
jerrylee 1
destroying 1
bien 1
jours 1
. 1
written 1
dimanche 2
annesorose 1
annoncent 1
nouveau 1
envie 1
hard 1
crêpes 1
ça 1
monde 1
words 1
bonjour 1
a 1
crepe 1
soleil 1
tour 1
aime 1
output 1
recharger 1

这篇关于无法使用NLTK标记器处理重音词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆