Spacy中的自定义句子切分 [英] Custom sentence segmentation in Spacy

查看:0
本文介绍了Spacy中的自定义句子切分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望spaCy使用我提供的句子分割边界,而不是它自己的处理。

例如:

get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
# => ["Bob meets Alice.", "They play together."]  # two sents

get_sentences("Bob meets Alice. They play together.")
# => ["Bob meets Alice. They play together."]  # ONE sent

get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
# => ["Bob meets Alice,", "they play together."] # two sents

这就是我到目前为止所拥有的(从文档中借用东西here):

import spacy
nlp = spacy.load('en_core_web_sm')

def mark_sentence_boundaries(doc):
    for i, token in enumerate(doc):
        if token.text == '@SentBoundary@':
            doc[i+1].sent_start = True
    return doc

nlp.add_pipe(mark_sentence_boundaries, before='parser')

def get_sentences(text):
    doc = nlp(text)
    return (list(doc.sents))

但我得到的结果如下:

# Ex1
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
#=> ["Bob meets Alice.", "@SentBoundary@", "They play together."]

# Ex2
get_sentences("Bob meets Alice. They play together.")
#=> ["Bob meets Alice.", "They play together."]

# Ex3
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
#=> ["Bob meets Alice, @SentBoundary@", "they play together."]

以下是我面临的主要问题:

  1. 当发现断句时,如何清除@SentBoundary@标记。
  2. 如果@SentBoundary@不存在,如何禁止spaCy拆分。

推荐答案

以下代码有效:

import spacy
nlp = spacy.load('en_core_web_sm')

def split_on_breaks(doc):
    start = 0
    seen_break = False
    for word in doc:
        if seen_break:
            yield doc[start:word.i-1]
            start = word.i
            seen_break = False
        elif word.text == '@SentBoundary@':
            seen_break = True
    if start < len(doc):
        yield doc[start:len(doc)]

sbd = SentenceSegmenter(nlp.vocab, strategy=split_on_breaks)
nlp.add_pipe(sbd, first=True)

def get_sentences(text):
    doc = nlp(text)
    return (list(doc.sents)) # convert to string if required.

# Ex1
get_sentences("Bob meets Alice. @SentBoundary@ They play together.")
# => ["Bob meets Alice.", "They play together."]  # two sentences

# Ex2
get_sentences("Bob meets Alice. They play together.")
# => ["Bob meets Alice. They play together."]  # ONE sentence

# Ex3
get_sentences("Bob meets Alice, @SentBoundary@ they play together.")
# => ["Bob meets Alice,", "they play together."] # two sentences
正确的做法是检查SentenceSegmenter,而不是手动设置边界(例如here)。ThisGitHub问题也有帮助。

这篇关于Spacy中的自定义句子切分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆