使用regexp和PHP将段落拆分为句子 [英] Splitting paragraphs into sentences with regexp and PHP

查看:86
本文介绍了使用regexp和PHP将段落拆分为句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个正则表达式菜鸟,正在尝试将段落拆分为句子.在我的语言中,我们在句子的中间使用了很多缩写(例如:bl.a.),因此我得出的结论是,我需要做的是查找标点符号,并在后面加上一个空格,然后是一个以大写字母开头的单词,例如:

I'm a regexp noob and trying to split paragraphs into sentences. In my language we use quite a bit of abbreviations (like: bl.a.) in the middle of sentences, so I have come to the conclusion, that what I need to do is to look for punctuations, that are followed by a single space and then a word that starts with a capital letter like:

[sentence1]...anymore. However...[sentence2]

这样的一段:

Der er en lang og bevæget forhistorie bag lov om varsling m.v. i forbindelse med afskedigelser af større omfang. Det er ikke en bureaukratisk lovtekst blandt så mange andre.

应在此输出中结束:

[0] => Der er en lang og bevæget forhistorie bag lov om varsling m.v. i forbindelse med afskedigelser af større omfang.
[1] => Det er ikke en bureaukratisk lovtekst blandt så mange andre.

不是这个:

[0] => Der er en lang og bevæget forhistorie bag lov om varsling m.v. 
[1] => i forbindelse med afskedigelser af større omfang.
[2] => Det er ikke en bureaukratisk lovtekst blandt så mange andre.

我找到了一种解决方案,该解决方案使用积极的向后看功能实现了这一目标:

I have found a solution that does the first part of this with the positive lookbehind feature:

$regexp = (?<=[.!?] | [.!?][\'"]);

然后

$sentences = preg_split($regexp, $paragraph, -1, PREG_SPLIT_NO_EMPTY);

这是一个很好的起点,但由于缩写太多,因此分得太多.

which is a great starting point, but splits way too many times because of the many abbreviations.

我尝试这样做:

(?<=[.!?]\s[A-Z] | [.!?][\'"])

以任一目标的每次发生为目标

to target every occurance of either

. or ! or ?

后跟一个空格和一个大写字母,但这没用.

followed by a space and a capital letter, but that did not work.

有人知道,是否有办法完成我想做的事情?

Does anyone know, if there is a way to accomplish what I am trying to do?

推荐答案

Unicode RegExp用于拆分句子:(?<=[.?!;])\s+(?=\p{Lu})

Unicode RegExp for splitting sentences: (?<=[.?!;])\s+(?=\p{Lu})

此处演示了示例: http://regex101.com/r/iR7cC8

这篇关于使用regexp和PHP将段落拆分为句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆