计算txt文件中最常用的单词 [英] Count most commonly used words in a txt file

查看:80
本文介绍了计算txt文件中最常用的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取txt文件中10个最常用的单词的列表,最终目的是建立一个单词云。打印时,以下代码不会产生任何结果。

I'm trying to get a list of the 10 most commonly used words in a txt file with the end goal of building a word cloud. The following code does not produce anything when I print.

>>> import collections
>>> from collections import Counter
>>> file = open('/Users/Desktop/word_cloud/98-0.txt')
>>> wordcount={}
>>> d = collections.Counter(wordcount)
>>> for word, count in d.most_common(10):
    print(word, ": ", count)


推荐答案

实际上,我建议您继续使用 Counter 。这是一个非常有用的工具,可以用来计数,但是它的语法确实富有表现力,因此您不必担心 sort 处理任何事情。使用它,您可以执行以下操作:

Actually, I would recommend that you continue to use Counter. It's a really useful tool for, well, counting things, but it has really expressive syntax, so you don't need to worry about sorting anything. Using it, you can do:

from collections import Counter

#opens the file. the with statement here will automatically close it afterwards.
with open("input.txt") as input_file:
    #build a counter from each word in the file
    count = Counter(word for line in input_file
                         for word in line.split())

print(count.most_common(10))

使用我的 input.txt ,输出为

[('THE', 27643), ('AND', 26728), ('I', 20681), ('TO', 19198), ('OF', 18173), ('A', 14613), ('YOU', 13649), ('MY', 12480), ('THAT', 11121), ('IN', 10967)]

我已经对其进行了一些更改,因此它不必将整个文件读入内存。我的 input.txt 是莎士比亚作品的无标点版本,以证明此代码 fast 。在我的机器上大约需要0.2秒。

I've changed it a bit so it doesn't have to read the whole file into memory. My input.txt is my punctuationless version of the works of shakespeare, to demonstrate that this code is fast. It takes about 0.2 seconds on my machine.

您的代码有些杂乱无章-似乎您已尝试将几种方法结合在一起,在此保留每一种方法,那里。我的代码已带有一些说明性功能。希望它应该相对简单一些,但是如果您仍然对任何事情感到困惑,请告诉我。

Your code was a bit haphazard - it looks like you've tried to bring together several approaches, keeping bits of each here and there. My code has been annotated with some explanatory functions. Hopefully it should be relatively straightforward, but if you're still confused about anything, let me know.

这篇关于计算txt文件中最常用的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆