除非句号是缩写的一部分,否则如何将段落拆分成由句点(.)分隔的句子? [英] How to Split a paragraph into sentences separated by period(.) except when the period is a part of an abbreviation?

查看:89
本文介绍了除非句号是缩写的一部分,否则如何将段落拆分成由句点(.)分隔的句子?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑此段文字

环境保护组织将20年禁令称为对美国偶像的重要保护.采矿业和一些共和党国会议员表示,这不利于亚利桑那州的经济和该国的能源独立.皮尤环境小组的成员.

Conservation groups call the 20-year ban a crucial protection for an American icon. The mining industry and some Republican members of Congress say it is detrimental to Arizona's economy and the nation's energy independence."Despite significant pressure from the mining industry, the president and Secretary Salazar did not back down," said Jane Danowitz, U.S. public lands director for the Pew Environment Group.

在上面,它很容易在句点(.)上分割句子,但是在美国处理句点时会导致错误的结果.假设我有一个缩写列表,例如

In the above, its easy to split sentences over period(.) but it will lead to incorrect results when it processes the period in U.S.A. . Assume I have a list of abbreviations such as

String abbrev[] ={"u.s.a", "u.a.e", "u.k", "p.r.c","u.s.s.r", };
String regex= "\\.";
Pattern pattern = Pattern.compile(regex,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sx);
int beginIndex=0;

// Check all occurance
int index=0;
while (matcher.find()) {
    System.out.print("Start index: " + matcher.start());
    System.out.print(" End index: " + matcher.end() + " ");

    String group=matcher.group();
    System.out.println("group: " + group);
    int dotIndex= group.indexOf(".");
    String sub= sx.substring(beginIndex, matcher.start()+dotIndex);
    beginIndex= matcher.start()+dotIndex;

    System.out.println(sub);
}            

我可以使用dotIndex周围的所有缩写进行强力匹配.有更好的方法吗?

I could do a brute force match with all the abbreviations around dotIndex. Is there a better approach ?

推荐答案

我最好的猜测是:(?<!\.[a-zA-Z])\.(?![a-zA-Z]\.)会转换为:

My best guess would be something like: (?<!\.[a-zA-Z])\.(?![a-zA-Z]\.) which would translate to:

(?<!\.[a-zA-Z])    # can't be preceded by a period followed by a single letter
\.
(?![a-zA-Z]\.)     # nor can it be followed by a letter and another preiod

然后您可以从此处执行替换.

Then you can perform the replace from there.

演示

如果您需要在报价内捕捉句点,则需要付出更多的努力,而在上述模式中没有说明了这一点.

This would require a lot more effort if you needed to catch period within quotes though, which is not accounted for in the above pattern.

这篇关于除非句号是缩写的一部分,否则如何将段落拆分成由句点(.)分隔的句子?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆