计算(和书写)文本文件中每一行的词频 [英] Counting (and writing) word frequencies for each line within text file

查看：132 发布时间：2020/5/4 5:03:08 python loops nltk filewriter

本文介绍了计算(和书写)文本文件中每一行的词频的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

第一次在堆栈中发布-总是发现以前的问题足以解决我的问题！我遇到的主要问题是逻辑……即使是伪代码答案也很好.

first time posting in stack - always found previous questions capable enough of solving my prob! Main problem I have is the logic... even a pseudo code answer would be great.

我正在使用python从文本文件的每一行读取数据，格式为:

I'm using python to read in data from each line of a text file, in the format:

This is a tweet captured from the twitter api #hashtag http://url.com/site

使用nltk，我可以按行标记，然后可以使用reader.sents()遍历等:

Using nltk, I can tokenize by line then can use reader.sents() to iterate through etc:

reader = TaggedCorpusReader(filecorpus, r'.*\.txt', sent_tokenizer=Line_Tokenizer())

reader.sents()[:10]

但是我想计算每行某些热门单词"(存储在数组或类似内容中)的频率，然后将它们写回到文本文件中.如果使用reader.words()，我可以算出整个文本中热门单词"的出现频率，但是我正在寻找每行的数量(在这种情况下为句子").

But I would like to count the frequency of certain 'hot words' (stored in an array or similar) per line, then write them back to a text file. If I used reader.words(), i could count up the frequency of 'hot words' in the entire text, but i'm looking for the amount per line (or 'sentence' in this case).

理想情况是:

hotwords = (['tweet'], ['twitter'])

for each line
     tokenize into words.
     for each word in line 
         if word is equal to hotword[1], hotword1 count ++
         if word is equal to hotword[2], hotword2 count ++
     at end of line, for each hotword[index]
         filewrite count,

此外，也不必担心URL损坏(使用WordPunctTokenizer可以删除标点符号-那不是问题)

Also, not so worried about URL becoming broken (using WordPunctTokenizer would remove the punctuation - thats not an issue)

任何有用的指针(包括伪指针或指向其他类似代码的链接)都很好.

Any useful pointers (including pseudo or links to other similar code) would be great.

----编辑------------------

---- EDIT ------------------

最终做了这样的事情:

import nltk
from nltk.corpus.reader import TaggedCorpusReader
from nltk.tokenize import LineTokenizer
#from nltk.tokenize import WordPunctTokenizer
from collections import defaultdict

# Create reader and generate corpus from all txt files in dir.
filecorpus = 'Twitter/FINAL_RESULTS/tweetcorpus'
filereader = TaggedCorpusReader(filecorpus, r'.*\.csv', sent_tokenizer=LineTokenizer())
print "Reader accessible." 
print filereader.fileids()

#define hotwords
hotwords = ('cool','foo','bar')

tweetdict = []

for line in filereader.sents():
wordcounts = defaultdict(int)
    for word in line:
        if word in hotwords:
            wordcounts[word] += 1
    tweetdict.append(wordcounts)

输出为:

print tweetdict

[defaultdict(<type 'dict'>, {}),
 defaultdict(<type 'int'>, {'foo': 2, 'bar': 1, 'cool': 2}),
 defaultdict(<type 'int'>, {'cool': 1})]

计算(和书写)文本文件中每一行的词频 [英] Counting (and writing) word frequencies for each line within text file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

计算(和书写)文本文件中每一行的词频 [英] Counting (and writing) word frequencies for each line within text file

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭