Python - 在文本文件中查找单词列表的词频 [英] Python - Finding word frequencies of list of words in text file

查看:38
本文介绍了Python - 在文本文件中查找单词列表的词频的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力加快我的项目来计算词频.我有 360 多个文本文件,我需要获取单词总数和另一个单词列表中每个单词出现的次数.我知道如何使用单个文本文件执行此操作.

<预><代码>>>>导入 nltk>>>导入操作系统>>>os.chdir("C:UsersCameronDesktopPDF-to-txt")>>>文件名="1976.03.txt">>>文本文件=打开(文件名,r")>>>inputString=textfile.read()>>>word_list=re.split('s+',file(filename).read().lower())>>>打印 '文本中的单词:', len(word_list)#吐出文本文件中的单词数>>>word_list.count('通货膨胀')#吐出文本文件中出现通货膨胀"的次数>>>word_list.count('jobs')>>>word_list.count('输出')

获取通货膨胀"、工作"、产出"个人的频率太繁琐了.我可以将这些单词放入一个列表中并同时查找列表中所有单词的频率吗?基本上这个使用Python.

示例:而不是这样:

<预><代码>>>>word_list.count('通货膨胀')3>>>word_list.count('工作')5>>>word_list.count('输出')1

我想这样做(我知道这不是真正的代码,这是我寻求帮助的):

<预><代码>>>>list1='inflation', 'jobs', 'output'>>>word_list.count(list1)通货膨胀"、就业"、产出"3, 5, 1

我的单词列表将有 10-20 个术语,因此我需要能够将 Python 指向一个单词列表以获取计数.如果输出能够复制并粘贴到 Excel 电子表格中,以单词为列,频率为行,那就太好了

示例:

通货膨胀、就业、产出3, 5, 1

最后,谁能帮助自动化所有文本文件?我想我只是将 Python 指向文件夹,它可以从新列表中为每个 360+ 文本文件执行上述字数统计.似乎很容易,但我有点卡住了.有什么帮助吗?

这样的输出会很棒:文件名1通货膨胀、就业、产出3、5、1

文件名2通货膨胀、就业、产出7、2、4文件名3通货膨胀、就业、产出9、3、5

谢谢!

解决方案

如果我了解您的问题,collections.Counter() 已涵盖此内容.

文档中的示例似乎与您的问题相符.

# 统计列表中单词的出现次数cnt = 计数器()对于 ['red', 'blue', 'red', 'green', 'blue', 'blue'] 中的单词:cnt[字] += 1打印cnt# 找出哈姆雷特中最常见的十个词进口重新words = re.findall('w+', open('hamlet.txt').read().lower())Counter(words).most_common(10)

从上面的例子你应该能够做到:

导入重新进口藏品words = re.findall('w+', open('1976.03.txt').read().lower())打印 collections.Counter(words)

编辑显示一种方式的幼稚方法.

wanted = "鱼片牛排"cnt = 计数器()words = re.findall('w+', open('1976.03.txt').read().lower())逐字逐句:如果想要的话:cnt[字] += 1打印cnt

I am trying to speed up my project to count word frequencies. I have 360+ text files, and I need to get the total number of words and the number of times each word from another list of words appears. I know how to do this with a single text file.

>>> import nltk
>>> import os
>>> os.chdir("C:UsersCameronDesktopPDF-to-txt")
>>> filename="1976.03.txt"
>>> textfile=open(filename,"r")
>>> inputString=textfile.read()
>>> word_list=re.split('s+',file(filename).read().lower())
>>> print 'Words in text:', len(word_list)
#spits out number of words in the textfile
>>> word_list.count('inflation')
#spits out number of times 'inflation' occurs in the textfile
>>>word_list.count('jobs')
>>>word_list.count('output')

Its too tedious to get the frequencies of 'inflation', 'jobs', 'output' individual. Can I put these words into a list and find the frequency of all the words in the list at the same time? Basically this with Python.

Example: Instead of this:

>>> word_list.count('inflation')
3
>>> word_list.count('jobs')
5
>>> word_list.count('output')
1

I want to do this (I know this isn't real code, this is what I'm asking for help on):

>>> list1='inflation', 'jobs', 'output'
>>>word_list.count(list1)
'inflation', 'jobs', 'output'
3, 5, 1

My list of words is going to have 10-20 terms, so I need to be able to just point Python toward a list of words to get the counts of. It would also be nice if the output was able to be copy+paste into an excel spreadsheet with the words as columns and frequencies as rows

Example:

inflation, jobs, output
3, 5, 1

And finally, can anyone help automate this for all of the textfiles? I figure I just point Python toward the folder and it can do the above word counting from the new list for each of the 360+ text files. Seems easy enough, but I'm a bit stuck. Any help?

An output like this would be fantastic: Filename1 inflation, jobs, output 3, 5, 1

Filename2
inflation, jobs, output
7, 2, 4

Filename3
inflation, jobs, output
9, 3, 5

Thanks!

解决方案

collections.Counter() has this covered if I understand your problem.

The example from the docs would seem to match your problem.

# Tally occurrences of words in a list
cnt = Counter()
for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
    cnt[word] += 1
print cnt


# Find the ten most common words in Hamlet
import re
words = re.findall('w+', open('hamlet.txt').read().lower())
Counter(words).most_common(10)

From the example above you should be able to do:

import re
import collections
words = re.findall('w+', open('1976.03.txt').read().lower())
print collections.Counter(words)

EDIT naive approach to show one way.

wanted = "fish chips steak"
cnt = Counter()
words = re.findall('w+', open('1976.03.txt').read().lower())
for word in words:
    if word in wanted:
        cnt[word] += 1
print cnt

这篇关于Python - 在文本文件中查找单词列表的词频的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆