如何从文本中提取命名实体+动词 [英] How to extract Named Entity + Verb from text

查看:389
本文介绍了如何从文本中提取命名实体+动词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好吧,我的目标是从文本中提取NE(Person)和与之相关的动词。例如,我有这样的文字:

Well, my aim is to extract NE (Person) and a verb connected to it from a text. For example, I have this text:


邓布利多转身走回街上。哈利波特在没有醒来的情况下翻过毯子。

Dumbledore turned and walked back down the street. Harry Potter rolled over inside his blankets without waking up.

理想的结果我应该得到


<邓布利多转身走了;哈利波特滚动

Dumbledore turned walked; Harry Potter rolled

我使用斯坦福NER查找并标记人物,然后删除所有不包含NE的句子。所以,最后我有一个'纯'文本,只包含字符名称的句子。
之后我使用Stanford Dependencies。结果我得到这样的smth(CONLLU输出格式):

I use Stanford NER to find and mark persons, then I delete all sentences that don't contain NE. So, in the end I have a 'pure' text that consists only of sentences with names of characters. After that I use Stanford Dependencies. As the result I get smth like this (CONLLU output-format):

1   Dumbledore  _   _   NN  _   2   nsubj   _   _
2   turned  _   _   VBD _   0   root    _   _
3   and _   _   CC  _   2   cc  _   _
4   walked  _   _   VBD _   2   conj    _   _
5   back    _   _   RB  _   4   advmod  _   _
6   down    _   _   IN  _   8   case    _   _
7   the _   _   DT  _   8   det _   _
8   street  _   _   NN  _   4   nmod    _   _
9   .   _   _   .   _   2   punct   _   _

1   Harry   _   _   NNP _   2   compound    _   _
2   Potter  _   _   NNP _   3   nsubj   _   _
3   rolled  _   _   VBD _   0   root    _   _
4   over    _   _   IN  _   3   compound:prt    _   _
5   inside  _   _   IN  _   7   case    _   _
6   his _   _   PRP$    _   7   nmod:poss   _   _
7   blankets    _   _   NNS _   3   nmod    _   _
8   without _   _   IN  _   9   mark    _   _
9   waking  _   _   VBG _   3   advcl   _   _
10  up  _   _   RP  _   9   compound:prt    _   _
11  .   _   _   .   _   3   punct   _   _

这就是我所有问题的开始。我知道这个人和动词,但是如何从这种格式中提取它我不知道。
我想,我可以这样做:在表格中找到NN / NNP,找到它的父,然后提取所有'子'字。从理论上说它应该有效。从理论上讲。

And that's where all my problems start. I know the person and the verb, but how to extract it from this format I have no idea. I guess, i can do it this way: find NN/NNP in the table, find its 'parent' and then extract all its 'child'-words. Theoretically it should work. Theoretically.

问题是,是否有人能够提出任何其他想法如何从文本中获取一个人及其行为?或者,如果还有更合理的方法吗?

The question is if anyone can come up with any other idea how to get a person and its action from the text? Or if there any more rational way to do it?

我将非常感谢任何帮助!

I'll be very grateful for any help!

推荐答案

以下是一些帮助您解决问题的示例代码:

Here is some sample code to help with your problem:

import java.io.*;
import java.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.util.*;



public class NERAndVerbExample {

  public static void main(String[] args) throws IOException {
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse,entitymentions");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    String text = "John Smith went to the store.";
    Annotation annotation = new Annotation(text);
    pipeline.annotate(annotation);
    System.out.println("---");
    System.out.println("text: " + text);
    System.out.println("");
    System.out.println("dependency edges:");
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      SemanticGraph sg = sentence.get(SemanticGraphCoreAnnotations.CollapsedDependenciesAnnotation.class);
      for (SemanticGraphEdge sge : sg.edgeListSorted()) {
        System.out.println(
                sge.getGovernor().word() + "," + sge.getGovernor().index() + "," + sge.getGovernor().tag() + "," +
                        sge.getGovernor().ner()
                        + " - " + sge.getRelation().getLongName()
                        + " -> "
                        + sge.getDependent().word() + "," +
                        +sge.getDependent().index() + "," + sge.getDependent().tag() + "," + sge.getDependent().ner());
      }
      System.out.println();
      System.out.println("entity mentions:");
      for (CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
        int lastTokenIndex = entityMention.get(CoreAnnotations.TokensAnnotation.class).size()-1;
        System.out.println(entityMention.get(CoreAnnotations.TextAnnotation.class) +
                "\t" +
                entityMention.get(CoreAnnotations.TokensAnnotation.class)
                        .get(lastTokenIndex).get(CoreAnnotations.IndexAnnotation.class) + "\t" +
                entityMention.get(CoreAnnotations.NamedEntityTagAnnotation.class));
      }
    }
  }
}

I我希望在斯坦福CoreNLP 3.8.0中添加一些语法糖以协助处理实体提及。

I'm hoping to add some syntactic sugar to Stanford CoreNLP 3.8.0 to assist with working with the entity mentions.

为了解释这段代码,基本上是实体注释注释器去了通过并将具有相同NER标记的组标记放在一起。因此,John Smith被标记为实体提及。

To explain this code a bit, basically the entitymentions annotator goes through and groups tokens with the same NER tag together. So "John Smith" gets marked as an entity mention.

如果您浏览依赖图,则可以获得每个单词的索引。

If you go through the dependency graph, you can get the index of each word.

同样,如果您访问实体提及的令牌列表,您还可以找到实体提及的每个单词的索引。

Likewise if you access the list of tokens for an entity mention, you can also find the index of each word for the entity mention.

使用更多代码,您可以将这些代码链接在一起,并按照您的要求形成实体提及动词对。

With a little more code you can link those together and form entity mention verb pairs as you were requesting.

正如您在当前代码中看到的那样,它非常麻烦访问实体提及的信息,所以我将尝试在3.8.0中改进它。

As you can see in the current code it is quite cumbersome to access info for an entity mention, so I am going to try to improve that in 3.8.0.

这篇关于如何从文本中提取命名实体+动词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆