从一组文档中提取最重要的关键字 [英] Extract most important keywords from a set of documents
问题描述
我有3000个文本文档,我想提取前300个关键字(可以是单个单词或多个单词).
I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).
我尝试了以下方法-
RAKE :这是一个基于Python的关键字提取库,但失败了悲惨地.
RAKE: It is a Python based keyword extraction library and it failed miserably.
Tf-Idf :文档,但无法将其汇总并找到代表整个文档组的关键字. 此外,仅根据Tf-Idf得分从每个文档中选择前k个单词也无济于事,对吧?
Tf-Idf: It has given me good keywords per document, but it is not able to aggregate them and find keywords that represent the whole group of documents. Also, just selecting top k words from each document based on Tf-Idf score won't help, right?
Word2vec :我能够做一些很酷的事情,例如找到相似的单词,但不确定如何找到重要的单词使用它的关键字.
Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.
能否请您提出一些解决此问题的好方法(或详细说明上述三种方法的改进方法)?谢谢:)
Can you please suggest some good approach (or elaborate on how to improve any of the above 3) to solve this problem? Thanks :)
推荐答案
最好手动选择这300个单词(不是很多,而是一次)-用Python 3编写的代码
Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3
import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files:
file_opened = open(file, "r")
lines = file_opened.read().split("\n")
for word in topWords:
if word in lines and wordsCount < 301:
print("I found %s" %word)
wordsCount += 1
#Check Again wordsCount to close first repetitive instruction
if wordsCount == 300:
break
这篇关于从一组文档中提取最重要的关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!