清理 .txt 并计算最常用的单词 [英] Clean .txt and count most frequent words

查看:57
本文介绍了清理 .txt 并计算最常用的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要

1) 从停用词列表中清除 .txt,我在单独的 .txt 中拥有该列表.

2) 之后我需要计算 25 个最常用的词.

这是我在第一部分想到的:

#!/usr/bin/python# -*- 编码:iso-8859-15 -*-进口重新从集合导入计数器f=open("text_to_be_cleaned.txt")txt=f.read()with open("stopwords.txt") as f:停用词 = f.readlines()停用词 = [x.strip() 用于停用词中的 x]查询词 = txt.split()resultwords = [如果 word.lower() 不在停用词中,则在查询词中逐字逐句]cleantxt = ' '.join(resultwords)

对于第二部分,我使用以下代码:

words = re.findall(r'\w+', cleantxt)lower_words = [word.lower() for word in words]word_counts = Counter(lower_words).most_common(25)top25 = word_counts[:25]打印 top25

要清理的源文件如下所示:

(b)

在第二段第一句的最后插入和高级代表";第二句中将每年举行一次辩论"改为每年举行两次辩论",并在末尾插入包括共同安全与国防政策"./p>

停用词列表如下所示:这这个他们你这然后因此鸟巢然后他们

当我运行所有这些时,不知何故输出仍然包含停用词列表中的单词:
[('文章', 911), ('欧洲', 586), ('the', 586), ('council', 569), ('union', 530), ('member', 377), ('states', 282), ('parliament', 244), ('commission', 230), ('accordance', 217), ('treaty', 187), ('in', 174), ('procedure', 161), ('policy', 137), ('cooperation', 136), ('legislative', 136), ('acting', 130), ('act', 125), ('amended',125), ('state', 123), ('provisions', 115), ('security', 113), ('measures', 111), ('adopt', 109), ('common', 108)]

如您所知,我刚刚开始学习 Python,因此非常感谢您提供简单的解释!:)

使用的文件可以在这里找到:

停用词列表

要清理的文件

添加了源文件、停用词文件和输出的示例.提供源文件

解决方案

你的代码差不多了,主要错误是你正在运行正则表达式 \w+ 来分组之后的词 你清理"了 str.split 产生的词.这不起作用,因为标点符号仍将附加到 str.split 结果.试试下面的代码.

导入重新从集合导入计数器with open('treaty_of_lisbon.txt', encoding='utf8') as f:target_text = f.read()with open('terrier-stopwords.txt', encoding='utf8') as f:stop_word_lines = f.readlines()target_words = re.findall(r'[\w-]+', target_text.lower())stop_words = set(map(str.strip,stop_word_lines))Interest_words = [w for w in target_words 如果 w 不在 stop_words 中]有趣的单词计数 = 计数器(有趣的单词)打印(interesting_word_counts.most_common(25))

I need to

1) Clean a .txt from a list of stopwords, which I have in a seperate .txt.

2) After that I need to count the 25 most frequent words.

This is what I came up with for the first part:

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-

import re
from collections import Counter

f=open("text_to_be_cleaned.txt")
txt=f.read()
with open("stopwords.txt") as f:
    stopwords = f.readlines()
stopwords = [x.strip() for x in stopwords]

querywords = txt.split()
resultwords  = [word for word in querywords if word.lower() not in stopwords]
cleantxt = ' '.join(resultwords)

For the second part, I am using this code:

words = re.findall(r'\w+', cleantxt)
lower_words = [word.lower() for word in words]
word_counts = Counter(lower_words).most_common(25)
top25 = word_counts[:25]

print top25

The source file to be cleaned looks like this:

(b)

in the second paragraph, first sentence, the words ‘and to the High Representative’ shall be inserted at the end; in the second sentence, the words ‘It shall hold an annual debate’ shall be replaced by ‘Twice a year it shall hold a debate’ and the words ‘, including the common security and defence policy’ shall be inserted at the end.

The stopwordlist looks like this: this thises they thee the then thence thenest thener them

When I run all this, somehow the output still contains words from the stopword list:
[('article', 911), ('european', 586), ('the', 586), ('council', 569), ('union', 530), ('member', 377), ('states', 282), ('parliament', 244), ('commission', 230), ('accordance', 217), ('treaty', 187), ('in', 174), ('procedure', 161), ('policy', 137), ('cooperation', 136), ('legislative', 136), ('acting', 130), ('act', 125), ('amended', 125), ('state', 123), ('provisions', 115), ('security', 113), ('measures', 111), ('adopt', 109), ('common', 108)]

As you can probablly tell, I just started learning python, so I would be very thankful for easy explanations! :)

Files used can be found here:

Stopwordlist

File to be cleaned

EDIT: Added examples for the sourcefile, stopwordfile and the output. Provided the sourcfiles

解决方案

Your code is almost there, the major error is that you are running the regex \w+ to group the words after you "cleaned" the words produced by str.split. This doesn't work because punctuation will still be attached to the str.split result. Try the following code instead.

import re
from collections import Counter

with open('treaty_of_lisbon.txt', encoding='utf8') as f:
    target_text = f.read()

with open('terrier-stopwords.txt', encoding='utf8') as f:
    stop_word_lines = f.readlines()

target_words = re.findall(r'[\w-]+', target_text.lower())
stop_words = set(map(str.strip, stop_word_lines))

interesting_words = [w for w in target_words if w not in stop_words]
interesting_word_counts = Counter(interesting_words)

print(interesting_word_counts.most_common(25))

这篇关于清理 .txt 并计算最常用的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆