使用ICUTokenizer JAVA标记泰语句子 [英] Tokenize Thai sentence with ICUTokenizer JAVA

查看:278
本文介绍了使用ICUTokenizer JAVA标记泰语句子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试下面的代码来获取泰语句子的所有标记. 它引发异常.谁能指出我要用JAVA标记泰语吗?

I am trying the below code to get all the tokens fro the thai sentence. It throws exception. Can anyone point me to tokenize thai in JAVA?

    import org.apache.lucene.analysis.Analyzer.TokenStreamComponents;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.icu.ICUNormalizer2Filter;
import org.apache.lucene.analysis.icu.segmentation.ICUTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class Tokenizer{

    public static void main(String[] args) throws IOException {
        ICUTokenizer tokenizer = new ICUTokenizer(new StringReader("การที่ได้ต้องแสดงว่างานดี"));
        TokenFilter filter = new ICUNormalizer2Filter(tokenizer);
        TokenStreamComponents tt = new TokenStreamComponents(tokenizer, filter);
        TokenStream ts = tt.getTokenStream();
        CharTermAttribute cattr  = ts.addAttribute(CharTermAttribute.class);
        ts.reset();
        while(ts.incrementToken()){
            System.out.println(cattr.toString()+"-----");
        }
    }
}

异常如下

Exception in thread "main" java.lang.ExceptionInInitializerError
    at org.apache.lucene.analysis.icu.segmentation.ICUTokenizer.<init>(ICUTokenizer.java:72)
    at com.tokenizer.tt.main(tt.java:22)
Caused by: java.lang.RuntimeException: java.io.IOException: ICU data file error: Not an ICU data file
    at org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig.readBreakIterator(DefaultICUTokenizerConfig.java:128)
    at org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig.<clinit>(DefaultICUTokenizerConfig.java:66)
    ... 2 more
Caused by: java.io.IOException: ICU data file error: Not an ICU data file
    at com.ibm.icu.impl.ICUBinary.readHeader(ICUBinary.java:577)
    at com.ibm.icu.text.RBBIDataWrapper.get(RBBIDataWrapper.java:173)
    at com.ibm.icu.text.RuleBasedBreakIterator.getInstanceFromCompiledRules(RuleBasedBreakIterator.java:71)
    at org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig.readBreakIterator(DefaultICUTokenizerConfig.java:123)
    ... 3 more

推荐答案

最后弄清楚了如何在Java程序中使用ICU4J

Finally figured out how to use ICU4J in a java program

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import org.apache.lucene.analysis.icu.segmentation.ICUTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

public class icuEstes {

public static void main(String[] args) throws IOException {
    Reader reader = new StringReader("การที่ได้ต้องแสดงว่างานดี  This is a test ກວ່າດອກ");
    ICUTokenizer icut = new ICUTokenizer();
    icut.setReader(reader);
    icut.addAttribute(CharTermAttribute.class);
    icut.reset();
    while (icut.incrementToken()) {
        System.out.println(icut.toString());
        System.out.println(icut.getAttribute(CharTermAttribute.class));
    }
    icut.close();
}}

这篇关于使用ICUTokenizer JAVA标记泰语句子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆