使用spacy和Matcher提取NER主语+动词的问题 [英] Problem to extract NER subject + verb with spacy and Matcher

查看:189
本文介绍了使用spacy和Matcher提取NER主语+动词的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个 NLP 项目上工作,我必须使用 spacy 和 spacy Matcher 来提取所有 nsubj(主题)及其相关动词的命名实体:我的 NE nsubj 的州长动词.示例:

I work on an NLP project and i have to use spacy and spacy Matcher to extract all named entities who are nsubj (subjects) and the verb to which it relates : the governor verb of my NE nsubj. Example :

Georges and his friends live in Mexico City
"Hello !", says Mary

我需要提取Georges"和活"在第一句和玛丽"中和说"在第二个中,但我不知道我的命名实体和与之相关的动词之间会有多少个单词.所以我决定更多地探索 spacy Matcher.所以我正在努力在 Matcher 上写一个模式来提取我的 2 个单词.当 NE subj 在动词之前时,我得到了很好的结果,但我不知道如何写一个模式来匹配与之相关的单词之后的 NE subj.根据指南,我也可以使用常规空间"来完成这项任务.但我不知道该怎么做.Matcher 的问题在于我无法管理 NE 和 VERB 之间的依赖类型并获取好的 VERB.我是 spacy 的新手,我一直在使用 NLTK 或 Jieba(针对中文).我什至不知道如何用 spacy 标记句子中的文本.但我选择将整个文本分成句子,以避免两个句子之间的匹配不良.这是我的代码

I'll need to extract "Georges" and "live" in the first sentence and "Mary" and "says" in the second one but i don't know how many words will be between my named entity and the verb to which it relate. So i decided to explore spacy Matcher more. So i'm struggling to write a pattern on Matcher to extract my 2 words. When the NE subj is before the verb, i get good results but i don't know how to write a pattern to match a NE subj after words which it correlates to. I could also, according to the guideline, do this task with "regular spacy" but i don't know how to do that. The problem with Matcher concerns the fact that i can't manage the type of dependency between the NE and VERB and grab the good VERB. I'm new with spacy, i've always worked with NLTK or Jieba (for chineese). I don't know even how to tokenize a text in sentence with spacy. But i chose to split the whole text in sentences to avoir bad matching between two sentences. Here is my code

import spacy
from nltk import sent_tokenize
from spacy.matcher import Matcher

nlp = spacy.load('fr_core_news_md')

matcher = Matcher(nlp.vocab)

def get_entities_verbs():

    try:

        # subjet before verb
        pattern_subj_verb = [{'ENT_TYPE': 'PER', 'DEP': 'nsubj'}, {"POS": {'NOT_IN':['VERB']}, "DEP": {'NOT_IN':['nsubj']}, 'OP':'*'}, {'POS':'VERB'}]
        # subjet after verb
        # this pattern is not good

        matcher.add('ent-verb', [pattern_subj_verb])

        for sent in sent_tokenize(open('Le_Ventre_de_Paris-short.txt').read()):
            sent = nlp(sent)
            matches = matcher(sent)
            for match_id, start, end in matches:
                span = sent[start:end]
                print(span)

    except Exception as error:
        print(error)


def main():

    get_entities_verbs()

if __name__ == '__main__':
    main()

即使是法语,我也可以向你保证,我得到了很好的结果

Even if it's french, i can assert you that i get good results

Florent regardait
Lacaille reparut
Florent baissait
Claude regardait
Florent resta
Florent, soulagé
Claude s’était arrêté
Claude en riait
Saget est matinale, dit
Florent allait
Murillo peignait
Florent accablé
Claude entra
Claude l’appelait
Florent regardait
Florent but son verre de punch ; il le sentit
Alexandre, dit
Florent levait
Claude était ravi
Claude et Florent revinrent
Claude, les mains dans les poches, sifflant

我有一些错误的结果,但 90% 是好的.我只需要抓住每行的第一个 ans 最后一个词来拥有我的一对 NE/动词.所以我的问题是.当 NE 是主语时,如何提取 NE 与它与 Matcher 相关的动词或简单地如何用 spacy(不是 Matcher)做到这一点?有很多因素需要考虑.即使 100% 是不可能的,您是否有尽可能获得最佳结果的方法.我需要一个模式匹配 VERB Governor + NER subj after from this pattern:

I have some wrong results but 90% is good. I just need to grab the first ans last word of each line to have my couple NE/verb. So my question is. How to extract NE when NE is subj with the verb which it correlates to with Matcher or simply how to do that with spacy (not Matcher) ? There are to many factors to be taken into account. Do you have a method to get the best results as possible even if 100% is not possible. I need a pattern matching VERB governor + NER subj after from this pattern:

pattern = [
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"},
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"},
        }
        ]

此模式的所有功劳都归功于 polm23

All credit to polm23 for this pattern

推荐答案

这是 Dependency Matcher 的完美用例.如果在运行之前将实体合并为单个令牌,它也会使事情变得更容易.此代码应该可以满足您的需求:

This is a perfect use case for the Dependency Matcher. It also makes things easier if you merge entities to single tokens before running it. This code should do what you need:

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")

# merge entities to simplify this
nlp.add_pipe("merge_entities")


pattern = [
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"},
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"},
        }
        ]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("PERVERB", [pattern])

texts = [
        "John Smith and some other guy live there",
        '"Hello!", says Mary.',
        ]

for text in texts:
    doc = nlp(text)
    matches = matcher(doc)

    for match in matches:
        match_id, (start, end) = match
        # note order here is defined by the pattern, so the nsubj will be first
        print(doc[start], "::", doc[end])
    print()

查看DependencyMatcher 的文档.

这篇关于使用spacy和Matcher提取NER主语+动词的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆