Unicode字符正则表达式，捕获组 [英] Unicode character regular expression, capture groups

查看：180 发布时间：2018/12/19 22:32:58 java regex unicode

本文介绍了Unicode字符正则表达式，捕获组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个正则表达式\\\ {L} \p {M} *我用来将单词拆分成字符，特别需要使用印地语或泰语单词，其中字符可以包含多个字符，如
मछली
如果在Java中以常规方式拆分我得到
[म] [छ] [ल] [ी]
我想要
哪里[म] [छ] [ली]

I got a regular expression \p{L}\p{M}* which I use to split words into characters, this is particularly needed with hindi or thai words where the character can contains multiple 'characters' in them, such as मछली if split in a regular way in Java I get [म][छ][ल][ी] Where as I want [म][छ][ली]

我一直在努力改进这个正则表达式以包含空格字符以及
这样当我分割$ b时$ bफार्मपशु
我会得到以下组：
[फा] [र्] [म] [] [प] [शु]

I have been trying to improve this regular expression to include space characters as well so that when I split फार्म पशु I would get the followng groups [फा][र्][म][ ][प][शु]

但我没有运气。有人能帮助我吗？

But I haven't had any luck. Would anyone be able to help me out?

此外，如果有人有另一种方法可以做到这一点，那么java也可以作为替代解决方案。我当前的java代码是

Also, if anyone has a alternative way of doing this is java that could be an alternative solution too. My current java code is

Pattern pat = Pattern.compile("\\p{L}\\p{M}*");
    Matcher matcher = pat.matcher(word);
    while (matcher.find()) {
        characters.add(matcher.group());
    }

推荐答案

考虑使用 BreakIterator ：

String text = "मछली";
Locale hindi = new Locale("hi", "IN");
BreakIterator breaker = BreakIterator.getCharacterInstance(hindi);
breaker.setText(text);
int start = breaker.first();
for (int end = breaker.next();
  end != BreakIterator.DONE;
  start = end, end = breaker.next()) {
  System.out.println(text.substring(start,end));
}

我使用Oracle Java 8实现测试了示例字符串。另请考虑 ICU4J 版 BreakIterator 如果需要。

I tested the sample string using the Oracle Java 8 implementation. Also consider the ICU4J version of BreakIterator if required.

这篇关于Unicode字符正则表达式，捕获组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Unicode字符正则表达式，捕获组 [英] Unicode character regular expression, capture groups

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Unicode字符正则表达式，捕获组 [英] Unicode character regular expression, capture groups

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭