将nltk.FreqDist单词分成两个列表? [英] Separating nltk.FreqDist words into two lists?
问题描述
我有一系列文本,它们是自定义WebText类的实例.每个文本都是一个对象,具有一个与之相关的等级(-10至+10)和一个单词计数(nltk.FreqDist):
I have a series of texts that are instances of a custom WebText class. Each text is an object that has a rating (-10 to +10) and a word count (nltk.FreqDist) associated with it:
>>trainingTexts = [WebText('train1.txt'), WebText('train2.txt'), WebText('train3.txt'), WebText('train4.txt')]
>>trainingTexts[1].rating
10
>>trainingTexts[1].freq_dist
<FreqDist: 'the': 60, ',': 49, 'to': 38, 'is': 34,...>
您现在如何获得两个列表(或词典),其中包含专门用于正评分文本的每个单词(trainingText [].rating> 0),另一个列表包含专门用于负数文本的每个单词(trainingText []).rating< 0).并让每个列表包含所有正面或负面文本的总字数,以便您得到如下内容:
How can you now get two lists (or dictionaries) containing every word used exclusively in the positively rated texts (trainingText[].rating>0), and another list containing every word used exclusively in the negative texts (trainingText[].rating<0). And have each list contain the total word counts for all the positive or negative texts, so that you get something like this:
>>only_positive_words
[('sky', 10), ('good', 9), ('great', 2)...]
>>only_negative_words
[('earth', 10), ('ski', 9), ('food', 2)...]
我考虑使用集合,因为集合包含唯一的实例,但是我看不到如何使用nltk.FreqDist做到这一点,最重要的是,集合不会按字频排序.有什么想法吗?
I considered using sets, as sets contain unique instances, but i can't see how this can be done with nltk.FreqDist, and on top of that, a set wouldn't be ordered by word frequency. Any ideas?
推荐答案
好吧,假设您是出于测试目的而开始的:
Ok, let's say you start with this for the purposes of testing:
class Rated(object):
def __init__(self, rating, freq_dist):
self.rating = rating
self.freq_dist = freq_dist
a = Rated(5, nltk.FreqDist('the boy sees the dog'.split()))
b = Rated(8, nltk.FreqDist('the cat sees the mouse'.split()))
c = Rated(-3, nltk.FreqDist('some boy likes nothing'.split()))
trainingTexts = [a,b,c]
然后您的代码如下:
from collections import defaultdict
from operator import itemgetter
# dictionaries for keeping track of the counts
pos_dict = defaultdict(int)
neg_dict = defaultdict(int)
for r in trainingTexts:
rating = r.rating
freq = r.freq_dist
# choose the appropriate counts dict
if rating > 0:
partition = pos_dict
elif rating < 0:
partition = neg_dict
else:
continue
# add the information to the correct counts dict
for word,count in freq.iteritems():
partition[word] += count
# Turn the counts dictionaries into lists of descending-frequency words
def only_list(counts, filtered):
return sorted(filter(lambda (w,c): w not in filtered, counts.items()), \
key=itemgetter(1), \
reverse=True)
only_positive_words = only_list(pos_dict, neg_dict)
only_negative_words = only_list(neg_dict, pos_dict)
结果是:
>>> only_positive_words
[('the', 4), ('sees', 2), ('dog', 1), ('cat', 1), ('mouse', 1)]
>>> only_negative_words
[('nothing', 1), ('some', 1), ('likes', 1)]
这篇关于将nltk.FreqDist单词分成两个列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!