getSentenceInstance和空格 [英] getSentenceInstance and whitespace

查看:216
本文介绍了getSentenceInstance和空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在拿一个文本并打破它的句子 - 创建一个包含完整句子的每个项目的数组。我认为最好的方法是使用BreakIterator类。这是我正在使用的代码:

I am taking a text and breaking it sentences - creating an array with each item containing a complete sentence. I decided the best way to do this was using the BreakIterator class. Here is the code I am using:

theSentences = new ArrayList<String>();
String myText = aString; //the text is produced through a text box
BreakIterator boundary = BreakIterator.getSentenceInstance();
boundary.setText(myText);
int start = boundary.first();
for (int end = boundary.next();
         end != BreakIterator.DONE;
         start = end, end = boundary.next())
{
    String temp = myText.substring(start,end);
    theSentences.add(temp.trim());
}

当用户记得在结尾处包含空格时,此功能完全正常一句话(大多数人都这么做)。然而,人们在打字时确实会犯错误,如果他们在完全停止后没有放置空格,代码似乎没有意识到句子的结尾已经达到。我能做些什么呢?

This works absolutely fine when the user remembers to include a space at the end of a sentence (which most people do). However, people do make mistakes when typing, and if they fail to put a space after the full stop, the code does not seem to realise that the end of the sentence has been reached. What can I do about this?

我确实认识到我可以使用正则表达式,但最好使用BreakIterator,因为这就是它的用途。编写正则表达式来区分句号和句号的所有其他可能用途会让我头疼: - )

I do realise that I could use regex instead, but it seems best to use BreakIterator as this is what it was made for. Also writing a regex that will distinguish between full stops and all the other possible uses of a period makes my head hurt :-)

推荐答案

很少。句子分裂不是100%可完成的任务。我自己使用斯坦福CoreNLP 和ssplit注释器,它是管道的一部分来做我的判决分裂。对于简单的任务,这是一个你可能不想下载的巨大jar,但它显示了这个任务有多复杂。

Very little. Sentence splitting is not a 100% accomplishable task. I myself use Stanford CoreNLP and the ssplit annotator which is part of the pipeline to do my sentence splitting. For simple tasks, this is a huge jar that you probably do not want to download, but it shows how complicated a task this is.

对于句子分割的轻量级实现,最好实现基于规则的正则表达式方法。

For a lightweight implementation of sentence splitting, it is best to implement a rule-based regular expression method.

这篇关于getSentenceInstance和空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆