使用Spacy进行句子分割 [英] Sentence Segmentation using Spacy

查看:1816
本文介绍了使用Spacy进行句子分割的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Spacy和NLP的新手.使用Spacy进行句段分割时面临以下问题.

I am new to Spacy and NLP. Facing the below issue while doing sentence segmentation using Spacy.

我要标记为句子的文本包含编号列表(编号和实际文本之间有空格).如下所示.

The text I am trying to tokenise into sentences contains numbered lists(with space between numbering and actual text) . Like below.

import spacy
nlp = spacy.load('en_core_web_sm')
text = "This is first sentence.\nNext is numbered list.\n1. Hello World!\n2. Hello World2!\n3. Hello World!"
text_sentences = nlp(text)
for sentence in text_sentences.sents:
    print(sentence.text)

输出(1,.2.,3.被视为单独的行)为:

Output (1.,2.,3. are considered as separate lines) is:

This is first sentence.

Next is numbered list.

1.
Hello World!

2.
Hello World2!

3.
Hello World!

但是,如果编号和实际文本之间没有空格,那么句子标记化就可以了.如下所示:

But if there is no space between numbering and actual text, then sentence tokenisation is fine. Like below:

import spacy
nlp = spacy.load('en_core_web_sm')
text = "This is first sentence.\nNext is numbered list.\n1.Hello World!\n2.Hello World2!\n3.Hello World!"
text_sentences = nlp(text)
for sentence in text_sentences.sents:
    print(sentence.text)

期望的输出是:

This is first sentence.

Next is numbered list.

1.Hello World!

2.Hello World2!

3.Hello World!

请建议我们是否可以为此定制句子检测器.

Please suggest whether we can customise sentence detector to this.

推荐答案

当您使用带有spacy的预训练模型时,句子将根据模型训练过程中提供的训练数据进行拆分.

When you use a pretrained model with spacy, the sentences get splitted based on training data that were provided during the training procedure of the model.

当然,在某些情况下,例如您,可能有人希望使用自定义句子分段逻辑.通过向spacy管道添加组件,可以实现这一点.

Of course, there are cases like yours, that may somebody want to use a custom sentence segmentation logic. This is possible by adding a component to spacy pipeline.

对于您的情况,您可以添加一条规则,以防止在有{number}时句子分裂.图案.

For your case, you can add a rule that prevents sentence splitting when there is a {number}. pattern.

您的问题的解决方法:

import spacy
import re

nlp = spacy.load('en')
boundary = re.compile('^[0-9]$')

def custom_seg(doc):
    prev = doc[0].text
    length = len(doc)
    for index, token in enumerate(doc):
        if (token.text == '.' and boundary.match(prev) and index!=(length - 1)):
            doc[index+1].sent_start = False
        prev = token.text
    return doc

nlp.add_pipe(custom_seg, before='parser')
text = u'This is first sentence.\nNext is numbered list.\n1. Hello World!\n2. Hello World2!\n3. Hello World!'
doc = nlp(text)
for sentence in doc.sents:
    print(sentence.text)

希望有帮助!

这篇关于使用Spacy进行句子分割的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆