计算二元语法频率 [英] Counting bi-gram frequencies
问题描述
我编写了一段代码,该代码实际上是对单词频率进行计数并将其插入到ARFF文件中以与weka一起使用的.我想对其进行更改,以便它可以计算二元语法频率,即成对的单词而不是单个单词,尽管我的尝试充其量也没有成功.
I've written a piece of code that essentially counts word frequencies and inserts them into an ARFF file for use with weka. I'd like to alter it so that it can count bi-gram frequencies, i.e. pairs of words instead of single words although my attempts have proved unsuccessful at best.
我知道有很多值得关注的地方,但是对此有任何帮助,我们将不胜感激. 这是我的代码:
I realise there's alot to look at but any help on this is greatly appreciated. Here's my code:
import re
import nltk
# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
word_list = [punctuation.sub("", word) for word in word_list]
word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]
# create dictionary of word:frequency pairs
freq_dic = {}
for word in word_list2:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result as top 10 most frequent words
freq_list4 =[]
freq_list4=freq_list3[:10]
words = []
for item in freq_list4:
a = str(item[1])
a = a.lower()
words.append(a)
f = open(filename)
newlist = []
for line in f:
line = punctuation.sub("", line)
line = line.lower()
newlist.append(line)
f2 = open('Lines.txt','w')
newlist2= []
for line in newlist:
line = line.split()
newlist2.append(line)
f2.write(str(line))
f2.write("\n")
print newlist2
# ARFF Creation
arff = open('output.arff','w')
arff.write('@RELATION wordfrequency\n\n')
for word in words:
arff.write('@ATTRIBUTE ')
arff.write(str(word))
arff.write(' numeric\n')
arff.write('@ATTRIBUTE class {endofworld, notendofworld}\n\n')
arff.write('@DATA\n')
# Counting word frequencies for each verse
for line in newlist2:
word_occurrences = str("")
for word in words:
matches = int(0)
for item in line:
if str(item) == str(word):
matches = matches + int(1)
else:
continue
word_occurrences = word_occurrences + str(matches) + ","
word_occurrences = word_occurrences + "endofworld"
arff.write(word_occurrences)
arff.write("\n")
print words
推荐答案
这应该使您入门:
def bigrams(words):
wprev = None
for w in words:
yield (wprev, w)
wprev = w
请注意,第一个双字母组是(None, w1)
,其中w1
是第一个单词,因此您有一个特殊的双字母组,用于标记文本的开始.如果您还想要文本结尾的二元组,请在循环后添加yield (wprev, None)
.
Note that the first bigram is (None, w1)
where w1
is the first word, so you have a special bigram that marks start-of-text. If you also want an end-of-text bigram, add yield (wprev, None)
after the loop.
这篇关于计算二元语法频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!