将文本拆分为句子 [英] Split a text into sentences

查看:114
本文介绍了将文本拆分为句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何将文本分成句子数组?

How can I split a text into an array of sentences?

示例文字:

炸我一个海狸.炒我一个海狸!炒我一个海狸? 炒我海狸号. 4 ?!炸了我很多海狸...结束

Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End

应输出:

0 => Fry me a Beaver.
1 => Fry me a Beaver!
2 => Fry me a Beaver?
3 => Fry me Beaver no. 4?!
4 => Fry me many Beavers...
5 => End

我尝试了一些通过搜索在SO上找到的解决方案,但是它们都失败了,尤其是在第4句话时.

I tried some solutions that I've found on SO through search, but they all fail, especially at the 4th sentence.

/(?<=[!?.])./

/\.|\?|!/

/((?<=[a-z0-9)][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])/

/(?<=[.!?]|[.!?][\'"])\s+/    // <- closest one

推荐答案

由于您想拆分"句子,所以为什么要匹配它们?

Since you want to "split" sentences why are you trying to match them ?

在这种情况下,我们使用 preg_split().

For this case let's use preg_split().

代码:

$str = 'Fry me a Beaver. Fry me a Beaver! Fry me a Beaver? Fry me Beaver no. 4?! Fry me many Beavers... End';
$sentences = preg_split('/(?<=[.?!])\s+(?=[a-z])/i', $str);
print_r($sentences);

输出:

Array
(
    [0] => Fry me a Beaver.
    [1] => Fry me a Beaver!
    [2] => Fry me a Beaver?
    [3] => Fry me Beaver no. 4?!
    [4] => Fry me many Beavers...
    [5] => End
)

说明:

简单地说,我们是按分组空间 \ s + 进行拆分,然后做两件事:

Well to put it simply we are spliting by grouped space(s) \s+ and doing two things:

  1. (?< = [.?!])肯定在声明之后,基本上我们在空间后面搜索是否有点或问号或感叹号.

  1. (?<=[.?!]) Positive look behind assertion, basically we search if there is a point or question mark or exclamation mark behind the space.

(?= [az])肯定的前瞻性断言,搜索空格后是否有字母,这是解决no. 4问题的一种方法.

(?=[a-z]) Positive look ahead assertion, searching if there is a letter after the space, this is kind of a workaround for the no. 4 problem.

这篇关于将文本拆分为句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆