使用 NLTK 提取关系 [英] extract relationships using NLTK

查看:26
本文介绍了使用 NLTK 提取关系的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的问题的后续.我正在使用 nltk 来解析个人、组织和他们的关系.使用这个例子,我能够创建大量的人和组织;但是,我在 nltk.sem.extract_rel 命令中遇到错误:

This is a follow-up of my question. I am using nltk to parse out persons, organizations, and their relationships. Using this example, I was able to create chunks of persons and organizations; however, I am getting an error in the nltk.sem.extract_rel command:

AttributeError: 'Tree' object has no attribute 'text'

完整代码如下:

import nltk
import re
#billgatesbio from http://www.reuters.com/finance/stocks/officerProfile?symbol=MSFT.O&officerId=28066
with open('billgatesbio.txt', 'r') as f:
    sample = f.read()

sentences = nltk.sent_tokenize(sample)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.batch_ne_chunk(tagged_sentences)

# tried plain ne_chunk instead of batch_ne_chunk as given in the book
#chunked_sentences = [nltk.ne_chunk(sentence) for sentence in tagged_sentences]

# pattern to find <person> served as <title> in <org>
IN = re.compile(r'.+s+ass+')
for doc in chunked_sentences:
    for rel in nltk.sem.extract_rels('ORG', 'PERSON', doc,corpus='ieer', pattern=IN):
        print nltk.sem.show_raw_rtuple(rel)

此示例与 给出的示例非常相似在书中,但该示例使用了准备好的解析文档",它无处不在,我不知道在哪里可以找到它的对象类型.我也浏览了 git 库.任何帮助表示赞赏.

This example is very similar to the one given in the book, but the example uses prepared 'parsed docs,' which appears of nowhere and I don't know where to find its object type. I scoured thru the git libraries as well. Any help is appreciated.

我的最终目标是为一些公司提取人员、组织、职位(日期);然后创建个人和组织的网络地图.

My ultimate goal is to extract persons, organizations, titles (dates) for some companies; then create network maps of persons and organizations.

推荐答案

看起来像一个解析文档"的对象需要有一个 headline 成员和一个 text member 两者都是令牌列表,其中一些令牌被标记为树.例如,这个(hacky)示例有效:

It looks like to be a "Parsed Doc" an object needs to have a headline member and a text member both of which are lists of tokens, where some of the tokens are marked up as trees. For example this (hacky) example works:

import nltk
import re

IN = re.compile (r'.*in(?!.+ing)')

class doc():
  pass

doc.headline=['foo']
doc.text=[nltk.Tree('ORGANIZATION', ['WHYY']), 'in', nltk.Tree('LOCATION',['Philadelphia']), '.', 'Ms.', nltk.Tree('PERSON', ['Gross']), ',']

for rel in  nltk.sem.extract_rels('ORG','LOC',doc,corpus='ieer',pattern=IN):
   print nltk.sem.relextract.show_raw_rtuple(rel)

运行时提供输出:

[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']

显然您实际上不会像这样编码它,但它提供了 extract_rels 期望的数据格式的工作示例,您只需要确定如何执行预处理步骤来获取数据按摩成那种格式.

Obviously you wouldn't actually code it like this, but it provides a working example of the data format expected by extract_rels, you just need to determine how to do your preprocessing steps to get your data massaged into that format.

这篇关于使用 NLTK 提取关系的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆