将字符串拆分成句子 [英] Split string into sentences

查看:160
本文介绍了将字符串拆分成句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了这段代码,用于分割字符串并将其存储在字符串数组中: -

I have written this piece of code that splits a string and stores it in a string array:-

String[] sSentence = sResult.split("[a-z]\\.\\s+");

但是,我添加了[az],因为我想处理一些缩写问题。但后来我的结果显示如下: -

However, I've added the [a-z] because I wanted to deal with some of the abbreviation problem. But then my result shows up as so:-


此外,当埃弗雷特试图用基本数学指导他们时,他们证明没有回应

Furthermore when Everett tried to instruct them in basic mathematics they proved unresponsiv

我看到我丢失了split函数中指定的模式。我可以失去这段时间,但丢失这个词的最后一个字母会扰乱它的含义。

I see that I lose the pattern specified in the split function. It's okay for me to lose the period, but losing the last letter of the word disturbs its meaning.

有人可以帮我这个,此外,有人可以帮忙我处理缩写?例如,因为我根据句点分割字符串,所以我不想丢失缩写。

Could someone help me with this, and in addition, could someone help me with dealing with abbreviations? For example, because I split the string based on periods, I do not want to lose the abbreviations.

推荐答案

解析句子远这是一项微不足道的任务,即使是像英语这样的拉丁语。像你在问题中概述的那种天真的方法经常会失败,以至于它在实践中证明是无用的。

Parsing sentences is far from being a trivial task, even for latin languages like English. A naive approach like the one you outline in your question will fail often enough that it will prove useless in practice.

更好的方法是使用 BreakIterator

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
  System.out.println(source.substring(start,end));
}

产生以下结果:


  1. 这是一个测试。

  2. 这是一个TLA测试。

  3. 现在有博士。

这篇关于将字符串拆分成句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆