从文本解析含义 [英] Parsing Meaning from Text

查看:101
本文介绍了从文本解析含义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我意识到这是一个广泛的话题,但是我正在寻找一个很好的入门手册,以解析文本的含义,最好是在Python中.例如,如果用户发表博客帖子,例如:

曼尼·拉米雷斯(Manny Ramirez)今天与道奇队对阵休斯顿太空人队(Asstros)返回"

从句子中提取名词的轻巧/简便方法是什么?首先,我想将其限制为专有名词,但我不希望仅限于此(我也不想依赖简单的正则表达式,它假定标题上限"是专有名词).

让这个问题更糟的是,我不是在问我应该做些什么?我是否需要现有单词的语料库才能上手?我需要了解哪些词法分析知识才能完成这项工作?我确实遇到了另一个问题该主题,我现在正在挖掘这些资源.

解决方案

尤其要使用 NLTK 关于信息提取的第7章.

您说您想提取含义,并且有用于语义分析的模块,但是我认为IE就是您所需要的-老实说,NLP计算机目前只能处理的领域之一.

有关命名实体识别(将Manny Ramerez作为个人,Dodgers作为体育组织,Houston Astros作为其他体育组织,或者其他适合您的领域的东西)的子主题和关系提取,请参见7.5和7.6节.一旦安装了NLTK,便可以插入一个NER分块器.从他们的示例中,提取一个地缘政治实体(GPE)和一个人:

>>> sent = nltk.corpus.treebank.tagged_sents()[22]
>>> print nltk.ne_chunk(sent) 
(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  ...
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ...)

请注意,您仍需要了解令牌化和标记(如前几章所述),以使这些IE任务的文本格式正确.

I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:

"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",

what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).

To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other question on the topic and I'm digging through those resources now.

解决方案

Use the NLTK, in particular chapter 7 on Information Extraction.

You say you want to extract meaning, and there are modules for semantic analysis, but I think IE is all you need--and honestly one of the only areas of NLP computers can handle right now.

See sections 7.5 and 7.6 on the subtopics of Named Entity Recognition (to chunk and categorize Manny Ramerez as a person, Dodgers as a sports organization, and Houston Astros as another sports organization, or whatever suits your domain) and Relationship Extraction. There is a NER chunker that you can plugin once you have the NLTK installed. From their examples, extracting a geo-political entity (GPE) and a person:

>>> sent = nltk.corpus.treebank.tagged_sents()[22]
>>> print nltk.ne_chunk(sent) 
(S
  The/DT
  (GPE U.S./NNP)
  is/VBZ
  one/CD
  ...
  according/VBG
  to/TO
  (PERSON Brooke/NNP T./NNP Mossman/NNP)
  ...)

Note you'll still need to know tokenization and tagging, as discussed in earlier chapters, to get your text in the right format for these IE tasks.

这篇关于从文本解析含义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆