查找句子边界的 Java 库 [英] Java library that finds sentence boundaries

查看:13
本文介绍了查找句子边界的 Java 库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有谁知道处理寻找句子边界的 Java 库?我认为这将是一个智能的 StringTokenizer 实现,它知道语言可以使用的所有句子终止符.

Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

这是我使用 BreakIterator 的经验:

Here's my experience with BreakIterator:

使用示例此处:我有以下日语:

Using the example here: I have the following Japanese:

今日はパソコンを買った。高性能のマックは早い!とても快適です。

在 ascii 中,它看起来像这样:

In ascii, it looks like this:

ufeffu4ecau65e5u306fu30d1u30bdu30b3u30f3u3092u8cb7u3063u305fu3002u9ad8u6027u80fdu306eu30deu30c3u30afu306fu65e9u3044uff01u3068u3066u3082u5febu9069u3067u3059u3002

这是我更改的样本部分:static void sentenceExamples() {

Here's the part of that sample that I changed: static void sentenceExamples() {

  Locale currentLocale = new Locale ("ja","JP");
  BreakIterator sentenceIterator = 
     BreakIterator.getSentenceInstance(currentLocale);
  String someText = "今日はパソコンを買った。高性能のマックは早い!とても快適です。";

当我查看边界指数时,我看到:

When I look at the Boundary indices, I see this:

0|13|24|32

但这些索引不对应任何句子终止符.

But those indices don't correspond to any sentence terminators.

推荐答案

您写道:

我认为这将是一个智能的 StringTokenizer 实现,它知道语言可以使用的所有句子终止符.

I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use.

这里的一个基本问题是句子终止符取决于上下文,请考虑:

A basic problem here is that sentence terminators depend on the context, consider:

Jones 博士是如何计算 5 的!没有递归?

How did Dr. Jones compute 5! without recursion?

这应该被识别为一个句子,但如果你只是拆分可能的句子终止符,你会得到三个句子.

This should be recognized as a single sentence, but if you just split on possible sentence terminators you will get three sentences.

所以这是一个比一开始想象的更复杂的问题.可以使用机器学习技术来解决它.例如,您可以查看 OpenNLP 项目,特别是 SentenceDetectorME 类.

So this is a more complex problem than one might think in the beginning. It can be approached using machine learning techniques. You could for instance look into the OpenNLP project, in particular the SentenceDetectorME class.

这篇关于查找句子边界的 Java 库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆