计算二元语法频率 [英] Counting bi-gram frequencies

查看：110 发布时间：2020/5/18 0:48:16 python nlp arff

本文介绍了计算二元语法频率的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我编写了一段代码，该代码实际上是对单词频率进行计数并将其插入到ARFF文件中以与weka一起使用的.我想对其进行更改，以便它可以计算二元语法频率，即成对的单词而不是单个单词，尽管我的尝试充其量也没有成功.

I've written a piece of code that essentially counts word frequencies and inserts them into an ARFF file for use with weka. I'd like to alter it so that it can count bi-gram frequencies, i.e. pairs of words instead of single words although my attempts have proved unsuccessful at best.

我知道有很多值得关注的地方，但是对此有任何帮助，我们将不胜感激. 这是我的代码:

I realise there's alot to look at but any help on this is greatly appreciated. Here's my code:

    import re
    import nltk

    # Quran subset
    filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')

    # create list of lower case words
    word_list = re.split('\s+', file(filename).read().lower())
    print 'Words in text:', len(word_list)
    # punctuation and numbers to be removed
    punctuation = re.compile(r'[-.?!,":;()|0-9]')
    word_list = [punctuation.sub("", word) for word in word_list]

    word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]



    # create dictionary of word:frequency pairs
    freq_dic = {}


    for word in word_list2:

        # form dictionary
        try: 
            freq_dic[word] += 1
        except: 
            freq_dic[word] = 1


    print '-'*30

    print "sorted by highest frequency first:"
    # create list of (val, key) tuple pairs
    freq_list2 = [(val, key) for key, val in freq_dic.items()]
    # sort by val or frequency
    freq_list2.sort(reverse=True)
    freq_list3 = list(freq_list2)
    # display result as top 10 most frequent words
    freq_list4 =[]
    freq_list4=freq_list3[:10]

    words = []

    for item in freq_list4:
        a = str(item[1])
        a = a.lower()
        words.append(a)



    f = open(filename)

    newlist = []

    for line in f:
        line = punctuation.sub("", line)
        line = line.lower()
        newlist.append(line)

    f2 = open('Lines.txt','w')

    newlist2= []
    for line in newlist:
        line = line.split()
        newlist2.append(line)
        f2.write(str(line))
        f2.write("\n")


    print newlist2

    # ARFF Creation

    arff = open('output.arff','w')
    arff.write('@RELATION wordfrequency\n\n')
    for word in words:
        arff.write('@ATTRIBUTE ')
        arff.write(str(word))
        arff.write(' numeric\n')

    arff.write('@ATTRIBUTE class {endofworld, notendofworld}\n\n')
    arff.write('@DATA\n')
    # Counting word frequencies for each verse
    for line in newlist2:
        word_occurrences = str("")
        for word in words:
            matches = int(0)
            for item in line:
                if str(item) == str(word):
                matches = matches + int(1)
                else:
                continue
            word_occurrences = word_occurrences + str(matches) + ","
        word_occurrences = word_occurrences + "endofworld"
        arff.write(word_occurrences)
        arff.write("\n")

    print words

计算二元语法频率 [英] Counting bi-gram frequencies

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

计算二元语法频率 [英] Counting bi-gram frequencies

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭