从句子中提取python nltk关键字 [英] python nltk keyword extraction from sentence

查看:651
本文介绍了从句子中提取python nltk关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们要做的第一件事,就是杀死所有律师." -威廉·莎士比亚

鉴于上面的引用,我想引出"kill""lawyers"作为两个突出的关键字来描述句子的整体含义.我提取了以下名词/动词POS标签:

Given the quote above, I would like to pull out "kill" and "lawyers" as the two prominent keywords to describe the overall meaning of the sentence. I have extracted the following noun/verb POS tags:

[["First", "NNP"], ["thing", "NN"], ["do", "VBP"], ["lets", "NNS"], ["kill", "VB"], ["lawyers", "NNS"]]

我要解决的更普遍的问题是将句子提炼为最重要" *的单词/标签,以概括句子的整体含义" *.

The more general problem I am trying to solve is to distill a sentence to the "most important"* words/tags to summarise the overall "meaning"* of a sentence.

*请注意吓人的引号.我承认这是一个非常棘手的问题,目前很可能还没有完美的解决方案.尽管如此,我还是有兴趣尝试解决特定问题(提取"kill""lawyers")和一般问题(总结关键字/标签中句子的整体含义)

*note the scare quotes. I acknowledge this is a very hard problem and there is most likely no perfect solution at this point in time. Nonetheless, I am interested to see attempts at solving the specific problem (extracting "kill" and "lawyers") and the general problem (summarising the overall meaning of a sentence in keywords/tags)

推荐答案

我认为这个问题没有完美的答案,因为没有人人都会同意的输入/输出映射的黄金集.您认为该句子最重要的单词是('kill','lawyers'),其他人可能会说正确的答案应该是('first','kill','lawyers').如果您能够非常准确,完全明确地描述您希望系统执行的操作,那么您的问题将解决一半以上.

I don't think theres any perfect answer to this question because there aren't any gold-set of input/output mappings which everybody will agree upon. You think the most important words for that sentence are ('kill', 'lawyers'), someone else might argue the correct answer should be ('first', 'kill', 'lawyers'). If you are able to very precisely and completely unambiguously describe exactly what you want your system to do, your problem will be more than half solved.

在此之前,我可以建议一些其他启发式方法,以帮助您获得所需的内容.
使用您的数据建立一个 idf 字典,即建立从每个单词到一个数字的映射与该词的稀有程度相关.对于较大的 n-grams 这样做的加分点也是如此.

Until then, I can suggest some additional heuristics to help you get what you want.
Build an idf dictionary using your data, i.e. build a mapping from every word to a number that correlates with how rare that word is. Bonus points for doing it for larger n-grams as well.

通过将输入句子中每个单词的idf值与其POS标签结合起来,您可以回答以下形式的问题:该句子中最稀有的动词是什么?",该句子中最稀有的名词是什么",在任何合理的语料库中,"kill"应该比"do"少,而"lawyers"应该比"thing"少,因此也许试图在句子中找到最稀有的名词和最稀有的动词,然后仅返回这两个词就可以了.大多数预期用例的技巧.如果不是这样,您总是可以使您的算法稍微复杂一点,看看这样做是否做得更好.

By combining the idf values of each word in your input sentence along with their POS tags, you answer questions of the form 'What is the rarest verb in this sentence?', 'What is the rarest noun in this sentence', etc. In any reasonable corpus, 'kill' should be rarer than 'do', and 'lawyers' rarer than 'thing', so maybe trying to find the rarest noun and rarest verb in a sentence and returning just those two will do the trick for most of your intended use cases. If not, you can always make your algorithm a little more complicated and see if that seems to do the job better.

扩展此方法的方法包括尝试使用n-gram idf识别较大的短语,构建完整的 parse-tree (可能使用 stanford解析器)并找出一些这些树中的图形,以帮助您确定树的哪些部分完成了重要的工作,等等.

Ways to expand this include trying to identify larger phrases using n-gram idf's, building a full parse-tree of the sentence (using maybe the stanford parser) and identifying some pattern within these trees to help you figure out which parts of the tree do important things tend to be based, etc.

这篇关于从句子中提取python nltk关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆