从一组文档中提取最重要的关键字 [英] Extract most important keywords from a set of documents

查看:119
本文介绍了从一组文档中提取最重要的关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有3000个文本文档,我想提取前300个关键字(可以是单个单词或多个单词).

I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).

我尝试了以下方法-

RAKE :这是一个基于Python的关键字提取库,但失败了悲惨地.

RAKE: It is a Python based keyword extraction library and it failed miserably.

Tf-Idf :文档,但无法将其汇总并找到代表整个文档组的关键字. 此外,仅根据Tf-Idf得分从每个文档中选择前k个单词也无济于事,对吧?

Tf-Idf: It has given me good keywords per document, but it is not able to aggregate them and find keywords that represent the whole group of documents. Also, just selecting top k words from each document based on Tf-Idf score won't help, right?

Word2vec :我能够做一些很酷的事情,例如找到相似的单词,但不确定如何找到重要的单词使用它的关键字.

Word2vec: I was able to do some cool stuff like find similar words but not sure how to find important keywords using it.

能否请您提出一些解决此问题的好方法(或详细说明上述三种方法的改进方法)?谢谢:)

Can you please suggest some good approach (or elaborate on how to improve any of the above 3) to solve this problem? Thanks :)

推荐答案

最好手动选择这300个单词(不是很多,而是一次)-用Python 3编写的代码

Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3

import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files: 
        file_opened = open(file, "r")
        lines = file_opened.read().split("\n")
        for word in topWords: 
                if word in lines and wordsCount < 301:
                                print("I found %s" %word)
                                wordsCount += 1
        #Check Again wordsCount to close first repetitive instruction
        if wordsCount == 300:
                break

这篇关于从一组文档中提取最重要的关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆