Java简单句子解析器 [英] Java simple sentence parser

查看:95
本文介绍了Java简单句子解析器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有简单的方法在普通Java
中创建句子解析器而不添加任何lib和jar。

is there any simple way to create sentence parser in plain Java without adding any libs and jars.

Parser不应该只关注单词之间的空白,
但要更聪明并解析:。 ! ?,
识别句子何时结束等。

Parser should not just take care about blanks between words, but be more smart and parse: . ! ?, recognize when sentence is ended etc.

解析后,只有真正的单词可以全部存储在db或文件中,而不是任何特殊的字符。

After parsing, only real words could be all stored in db or file, not any special chars.

非常感谢您提前:)

推荐答案

您可能想要首先看一下 BreakIterator 课程。

You might want to start by looking at the BreakIterator class.

来自JavaDoc。


BreakIterator类实现
方法用于在文本中查找
边界的位置。
的实例BreakIterator维持当前
的位置并扫描文本,返回
字符索引,其中
边界出现。在内部,
BreakIterator使用
CharacterIterator扫描文本,因此能够
扫描实现该协议的任何对象
所持有的文本。
StringCharacterIterator用于
扫描传递给setText的String对象。

The BreakIterator class implements methods for finding the location of boundaries in text. Instances of BreakIterator maintain a current position and scan over text returning the index of characters where boundaries occur. Internally, BreakIterator scans text using a CharacterIterator, and is thus able to scan text held by any object implementing that protocol. A StringCharacterIterator is used to scan String objects passed to setText.

您可以使用此类提供的工厂方法
来创建
各种类型的break迭代器的实例。在
特别是使用getWordIterator,
getLineIterator,getSentenceIterator,
和getCharacterIterator创建
BreakIterators执行的话,
线,句子和字符边界
分析。单个
BreakIterator只能在一个
单位(单词,行,句子,以及
on)上工作。对于
希望执行的每个单位边界分析,必须使用不同的迭代器

You use the factory methods provided by this class to create instances of various types of break iterators. In particular, use getWordIterator, getLineIterator, getSentenceIterator, and getCharacterIterator to create BreakIterators that perform word, line, sentence, and character boundary analysis respectively. A single BreakIterator can work only on one unit (word, line, sentence, and so on). You must use a different iterator for each unit boundary analysis you wish to perform.

行边界分析确定
其中一个文本当
换行时,字符串可以被破坏。机制正确
处理标点符号和带连字符的
字。

Line boundary analysis determines where a text string can be broken when line-wrapping. The mechanism correctly handles punctuation and hyphenated words.

句子边界分析允许选择正确解释的

of period在数字和
缩写中,并且尾随
标点符号,例如引用
标记和括号。

Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses.

$边界分析由$使用b $ b搜索和替换函数,以及文本编辑应用程序
中的
,允许用户通过双击选择单词
。单词选择

单词之内和之后提供
标点符号的正确解释。不属于
a字的字符,如符号或标点符号
标记,双方都有分词。

Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.

字符边界分析允许
用户与字符交互为
他们期望,例如,当
将光标移动到文本
字符串时。字符边界分析
提供了通过
字符串的正确导航,无论字符存储的
如何。例如,
重音字符可能存储为
作为基本字符和变音符
标记。用户认为
字符的内容可能因
种语言而异。

Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. Character boundary analysis provides correct navigation of through character strings, regardless of how the character is stored. For example, an accented character might be stored as a base character and a diacritical mark. What users consider to be a character can differ between languages.

BreakIterator仅适用于
自然语言。不要使用
这个类来标记编程
语言。

BreakIterator is intended for use with natural languages only. Do not use this class to tokenize a programming language.

参见演示 BreakIteratorDemo.java

这篇关于Java简单句子解析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆