php句子边界检测 [英] php sentence boundaries detection

查看:49
本文介绍了php句子边界检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用PHP将文本分成句子.我目前正在使用正则表达式,它带来了约95%的准确性,并希望通过使用更好的方法来进行改进.我见过在Perl,Java和C中执行此操作的NLP工具,但没有看到适合PHP的内容.您知道这样的工具吗?

I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that fits PHP. Do you know of such a tool?

推荐答案

增强的正则表达式解决方案

假设您确实关心处理:Mr.Mrs.等缩写,那么以下单个正则表达式解决方案效果很好:

An enhanced regex solution

Assuming you do care about handling: Mr. and Mrs. etc. abbreviations, then the following single regex solution works pretty well:

<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
    # Split sentences on whitespace between them.
    # See: http://stackoverflow.com/a/5844564/433790
    (?<=          # Sentence split location preceded by
      [.!?]       # either an end of sentence punct,
    | [.!?][\'"]  # or end of sentence punct and quote.
    )             # End positive lookbehind.
    (?<!          # But don\'t split after these:
      Mr\.        # Either "Mr."
    | Mrs\.       # Or "Mrs."
    | Ms\.        # Or "Ms."
    | Jr\.        # Or "Jr."
    | Dr\.        # Or "Dr."
    | Prof\.      # Or "Prof."
    | Sr\.        # Or "Sr."
    | T\.V\.A\.   # Or "T.V.A."
                 # Or... (you get the idea).
    )             # End negative lookbehind.
    \s+           # Split on whitespace between sentences,
    (?=\S)        # (but not at end of string).
    %xi';  // End $split_sentences.

$text = 'This is sentence one. Sentence two! Sentence thr'.
        'ee? Sentence "four". Sentence "five"! Sentence "'.
        'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
        'Jones said: "Mrs. Smith you have a lovely daught'.
        'er!" The T.V.A. is a big project! '; // Note ws at end.

$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
    printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>

请注意,您可以轻松地在表达式中添加或删除缩写.给出以下测试段落:

Note that you can easily add or take away abbreviations from the expression. Given the following test paragraph:

这是一句话.句子二!句子三?句子四".句子五"!句子六"?句子七".句子八!"琼斯博士说:史密斯夫人,你有一个可爱的女儿!"电视是一个大项目!

这是脚本的输出:

Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]

Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]

该问题的作者评论说,上述解决方案忽略了很多选择" ,并且不够通用.我不确定这是什么意思,但是上面表达式的本质是尽可能地简洁明了.在这里:

The author of the question commented that the above solution "overlooks many options" and is not generic enough. I'm not sure what that means, but the essence of the above expression is about as clean and simple as you can get. Here it is:

$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

请注意,两种解决方案都可以正确地识别在标点符号后加引号结尾的句子.如果您不在乎匹配以引号结尾的句子,则可以将正则表达式简化为:/(?<=[.!?])\s+(?=\S)/.

Note that both solutions correctly identify sentences ending with a quotation mark after the ending punctuation. If you don't care about matching sentences ending in a quotation mark the regex can be simplified to just: /(?<=[.!?])\s+(?=\S)/.

20130820_1000 在正则表达式和测试字符串中添加了T.V.A.(另一个标点符号,将被忽略). (回答PapyRef的评论问题)

20130820_1000 Added T.V.A. (another punctuated word to be ignored) to regex and test string. (to answer PapyRef's comment question)

20130820_1800 整理并重命名正则表达式,并添加了shebang.还修复了正则表达式,以防止在尾随空白处分割文本.

20130820_1800 Tidied and renamed regex and added shebang. Also fixed regexes to prevent splitting text on trailing whitespace.

这篇关于php句子边界检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆