BreakIterator无法正确使用中文文本 [英] BreakIterator not working correctly with Chinese text

查看:199
本文介绍了BreakIterator无法正确使用中文文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用BreakIterator.getWordInstance将中文文本拆分为单词.这是我的例子

I used BreakIterator.getWordInstance to split a Chinese text into words. Here is my example

import java.text.BreakIterator;
import java.util.Locale;

public class Sample {
    public static void main(String[] args) {
        String stringToExamine = "I like to eat apples. 我喜欢吃苹果。";

        //print each word in order
        BreakIterator boundary = BreakIterator.getWordInstance(new Locale("zh", "CN"));
        boundary.setText(stringToExamine);

        printEachForward(boundary, stringToExamine);
    }

    public static void printEachForward(BreakIterator boundary, String source) {
        int start = boundary.first();
        for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) {
            System.out.println(start + ": " + source.substring(start, end));
        }
    }
}

我的示例文本摘自 https://stackoverflow.com/a/42219474/954439

我得到的输出是

0: I
1:  
2: like
6:  
7: to
9:  
10: eat
13:  
14: apples
20: .
21:  
22: 我喜欢吃苹果
28: 。

鉴于此,预期输出为

0 I
1  
2 like
6  
7 to
9  
10 eat
13  
14 apples
20 .
21  
22 我
23 喜欢
25 吃
26 苹果
28 。

我什至尝试使用纯中文文本,但是这些单词在空格和标点符号上都被破坏了.

I even tried pure Chinese text, but the words are broken on whitespace and punctuation characters.

我正在为服务器编程,因此jar文件的大小不是一个大问题.我正在尝试使用最小公共子序列(但在单词上)找到给定内容中与样本内容相比不同的单词数.

I am programming for a server, so the jar file size is not a big concern. I am trying to find the number of words that is different in a given content compared to a sample content using Least Common Subsequence (but on words).

我做错了什么?

推荐答案

标准BreakIterator不支持在CJK表意文字的连续字符串中检测单词"边界.关于此主题,有一个错误报告,但该报告在2006年被关闭为无法解决".

The standard BreakIterator does not support detection of "word" boundaries within unbroken strings of CJK ideographs. There is a bug report on this subject, but it was closed in 2006 as "Won't Fix".

相反,您需要使用 ICU实施.如果您在Android上进行开发,则已经将其作为android.icu.text.BreakIterator.否则,您需要从 http://site.icu-project.org/download下载ICU4J库. ,其名称为com.ibm.icu.text.BreakIterator.

Instead, you'll need to use the ICU implementation. If you're developing on Android, you already have this as android.icu.text.BreakIterator. Otherwise, you'll need to download the ICU4J library from http://site.icu-project.org/download, which has it as com.ibm.icu.text.BreakIterator.

这篇关于BreakIterator无法正确使用中文文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆