将文本拆分为句子和句子到词语:BreakIterator与正则表达式 [英] Splitting text to sentences and sentence to words: BreakIterator vs regular expressions

查看:496
本文介绍了将文本拆分为句子和句子到词语:BreakIterator与正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不小心回答了问题,其中原始问题涉及将句子分割为单独的单词。

I accidentally answered a question where the original problem involved splitting sentence to separate words.

作者建议使用 BreakIterator 来标记输入字符串,有些人喜欢这个想法。

And the author suggested to use BreakIterator to tokenize input strings and some people liked this idea.

我只是没有得到疯狂:如何25行复杂的代码可以比一个简单的单线程regexp更好?

I just don't get that madness: how 25 lines of complicated code can be better than a simple one-liner with regexp?

请告诉我使用BreakIterator的好处,以及应该使用BreakIterator的真实情况。

Please, explain me the pros of using BreakIterator and the real cases when it should be used.

如果它真的那么酷和适当,那么我不知道:你真的使用 BreakIterator

If it's really so cool and proper then I wonder: do you really use the approach with BreakIterator in your projects?

推荐答案

从查看在该答案发布的代码,它看起来像 BreakIterator 考虑文本的语言和区域设置。通过正则表达式获得这种水平的支持肯定会是一个相当大的痛苦。也许这是一个简单正则表达式优先的主要原因?

From looking at the code posted at that answer, it looks like BreakIterator takes into consideration the language and locale of the text. Getting that level of support via regex will surely be a considerable pain. Perhaps that is the main reason it is preferred over a simple regex?

这篇关于将文本拆分为句子和句子到词语:BreakIterator与正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆