清理 .txt 并计算最常用的单词 [英] Clean .txt and count most frequent words
问题描述
我需要
1) 从停用词列表中清除 .txt,我在单独的 .txt 中拥有该列表.
2) 之后我需要计算 25 个最常用的词.
这是我在第一部分想到的:
#!/usr/bin/python# -*- 编码:iso-8859-15 -*-进口重新从集合导入计数器f=open("text_to_be_cleaned.txt")txt=f.read()with open("stopwords.txt") as f:停用词 = f.readlines()停用词 = [x.strip() 用于停用词中的 x]查询词 = txt.split()resultwords = [如果 word.lower() 不在停用词中,则在查询词中逐字逐句]cleantxt = ' '.join(resultwords)
对于第二部分,我使用以下代码:
words = re.findall(r'\w+', cleantxt)lower_words = [word.lower() for word in words]word_counts = Counter(lower_words).most_common(25)top25 = word_counts[:25]打印 top25
要清理的源文件如下所示:
(b)
在第二段第一句的最后插入和高级代表";第二句中将每年举行一次辩论"改为每年举行两次辩论",并在末尾插入包括共同安全与国防政策"./p>
停用词列表如下所示:这这个他们你这然后因此鸟巢然后他们
当我运行所有这些时,不知何故输出仍然包含停用词列表中的单词:
[('文章', 911), ('欧洲', 586), ('the', 586), ('council', 569), ('union', 530), ('member', 377), ('states', 282), ('parliament', 244), ('commission', 230), ('accordance', 217), ('treaty', 187), ('in', 174), ('procedure', 161), ('policy', 137), ('cooperation', 136), ('legislative', 136), ('acting', 130), ('act', 125), ('amended',125), ('state', 123), ('provisions', 115), ('security', 113), ('measures', 111), ('adopt', 109), ('common', 108)]
如您所知,我刚刚开始学习 Python,因此非常感谢您提供简单的解释!:)
使用的文件可以在这里找到:
添加了源文件、停用词文件和输出的示例.提供源文件
你的代码差不多了,主要错误是你正在运行正则表达式 \w+
来分组之后的词 你清理"了 str.split
产生的词.这不起作用,因为标点符号仍将附加到 str.split
结果.试试下面的代码.
导入重新从集合导入计数器with open('treaty_of_lisbon.txt', encoding='utf8') as f:target_text = f.read()with open('terrier-stopwords.txt', encoding='utf8') as f:stop_word_lines = f.readlines()target_words = re.findall(r'[\w-]+', target_text.lower())stop_words = set(map(str.strip,stop_word_lines))Interest_words = [w for w in target_words 如果 w 不在 stop_words 中]有趣的单词计数 = 计数器(有趣的单词)打印(interesting_word_counts.most_common(25))
I need to
1) Clean a .txt from a list of stopwords, which I have in a seperate .txt.
2) After that I need to count the 25 most frequent words.
This is what I came up with for the first part:
#!/usr/bin/python
# -*- coding: iso-8859-15 -*-
import re
from collections import Counter
f=open("text_to_be_cleaned.txt")
txt=f.read()
with open("stopwords.txt") as f:
stopwords = f.readlines()
stopwords = [x.strip() for x in stopwords]
querywords = txt.split()
resultwords = [word for word in querywords if word.lower() not in stopwords]
cleantxt = ' '.join(resultwords)
For the second part, I am using this code:
words = re.findall(r'\w+', cleantxt)
lower_words = [word.lower() for word in words]
word_counts = Counter(lower_words).most_common(25)
top25 = word_counts[:25]
print top25
The source file to be cleaned looks like this:
(b)
in the second paragraph, first sentence, the words ‘and to the High Representative’ shall be inserted at the end; in the second sentence, the words ‘It shall hold an annual debate’ shall be replaced by ‘Twice a year it shall hold a debate’ and the words ‘, including the common security and defence policy’ shall be inserted at the end.
The stopwordlist looks like this: this thises they thee the then thence thenest thener them
When I run all this, somehow the output still contains words from the stopword list:
[('article', 911), ('european', 586), ('the', 586), ('council', 569), ('union', 530), ('member', 377), ('states', 282), ('parliament', 244), ('commission', 230), ('accordance', 217), ('treaty', 187), ('in', 174), ('procedure', 161), ('policy', 137), ('cooperation', 136), ('legislative', 136), ('acting', 130), ('act', 125), ('amended', 125), ('state', 123), ('provisions', 115), ('security', 113), ('measures', 111), ('adopt', 109), ('common', 108)]
As you can probablly tell, I just started learning python, so I would be very thankful for easy explanations! :)
Files used can be found here:
EDIT: Added examples for the sourcefile, stopwordfile and the output. Provided the sourcfiles
Your code is almost there, the major error is that you are running the regex \w+
to group the words after you "cleaned" the words produced by str.split
. This doesn't work because punctuation will still be attached to the str.split
result. Try the following code instead.
import re
from collections import Counter
with open('treaty_of_lisbon.txt', encoding='utf8') as f:
target_text = f.read()
with open('terrier-stopwords.txt', encoding='utf8') as f:
stop_word_lines = f.readlines()
target_words = re.findall(r'[\w-]+', target_text.lower())
stop_words = set(map(str.strip, stop_word_lines))
interesting_words = [w for w in target_words if w not in stop_words]
interesting_word_counts = Counter(interesting_words)
print(interesting_word_counts.most_common(25))
这篇关于清理 .txt 并计算最常用的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!