从句子中提取“有用的"信息? [英] Extracting 'useful' information out of sentences?

查看:80
本文介绍了从句子中提取“有用的"信息?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试理解这种形式的句子:

I am currently trying to understand sentences of this form:

The problem was more with the set-top box than the television. Restarting the set-top box solved the problem.

我对自然语言处理完全陌生,开始使用Python的NLTK程序包弄脏了我的手.但是,我想知道是否有人可以概述实现此目标的高级步骤.

I am totally new to Natural Language Processing and started using Python's NLTK package to get my hands dirty. However, I am wondering if someone could give me an overview of the high-level steps involved in achieving this.

在这种情况下,我要做的是找出问题所在,set-top box,并确定所采取的措施是否解决了问题,因此在这种情况下,yes,因为重新启动可以解决问题.因此,如果所有句子都采用这种形式,那么我的生活会更加轻松,但是由于它是自然语言,所以这些句子也可能采用以下形式:

What I am trying to do is to identify what the problem was so in this case, set-top box and whether the action that was taken resolved the problem so in this case, yes because restarting fixed the problem. So if all the sentences were of this form, my life would have been easier but because it is natural language, the sentences could also be of the following form:

I took a look at the car and found nothing wrong with it. However, I suspect there is something wrong with the engine

因此,在这种情况下,问题出在car上.由于存在单词suspect,因此所采取的措施无法解决问题.而潜在的问题可能出在engine.

So in this case, the problem was with the car. The action taken did not resolve the problem because of the presence of the word suspect. And the potential problem could be with the engine.

我不是在寻找绝对的答案,因为我怀疑这很复杂.我正在寻找的是更高层次的概述,它将为我指明正确的方向.如果有更简单/替代的方法,也欢迎使用.

I am not looking for an absolute answer as I suspect this is very complex. What I am looking for is more rather a high-level overview that will point me in the right direction. If there is an easier/alternate way to do this, that is welcome as well.

推荐答案

可能,如果句子格式正确,我将尝试使用依赖关系解析(http://nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html#raw_parse).这样可以为您提供一个句子组成图,并且您可以说出词汇项之间的关系.以后,您可以从依赖项解析器的输出中提取短语(http://nltk.googlecode.com/svn/trunk/doc/book/ch/08.html#code-cfg2),这可以帮助您提取以下内容的直接对象:一个句子或句子中的动词短语.

Probably, if the sentences are well-formed, I would experiment with dependency parsing (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.parse.malt.MaltParser-class.html#raw_parse). That gives you a graph of the constituents of a sentence and you can tell the relations between the lexical items. Later, you can extract phrases from the output of a dependency parser (http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html#code-cfg2) That could help you to extract the direct object of a sentence, or the verb phrase in a sentence.

如果您只想从句子中获取短语或块",则可以尝试使用块解析器(http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk-module.html) .您还可以执行命名实体识别(http://streamhacker.com/2009/02/23/chunk-extraction-with-nltk/).通常用于提取地点,组织或人员姓名的实例,但也可以在您的情况下使用.

If you just want to get phrases or "chunks" from a sentence, you can try chunk parser (http://nltk.googlecode.com/svn/trunk/doc/api/nltk.chunk-module.html). You can also carry out named entity recognition (http://streamhacker.com/2009/02/23/chunk-extraction-with-nltk/). It's usually used to extract instances of places, organizations or people names but it could work in your case as well.

假设您解决了从句子中提取名词/动词短语的问题,则可能需要将它们过滤掉以减轻您的领域专家的工作(太多的短语可能会使法官不知所措).您可以对短语进行频率分析,删除通常与问题域无关的非常频繁的短语,或编制白名单并保留包含预定义单词集的短语,等等.

Assuming that you solve the problem of extracting noun/verb phrases from a sentence, you may need to filter them out to ease the job of your domain expert (too many phrases could overwhelm a judge). You may carry out a frequency analysis on your phrases, remove very frequent ones that are not usually related to the problem domain, or compile a white-list and keep the phrases that contain a pre-defined set of words, etc.

这篇关于从句子中提取“有用的"信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆