如何将不使用空格的泰语句子拆分为单词? [英] How to split a Thai sentence, which does not use spaces, into words?

查看:721
本文介绍了如何将不使用空格的泰语句子拆分为单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何从泰语句子中拆分单词?英语,我们可以按空格分割单词.

How to split word from Thai sentence? English we can split word by space.

示例:I go to school,split = ['I', 'go', 'to' ,'school']仅查看空格进行拆分.

Example: I go to school, split = ['I', 'go', 'to' ,'school'] Split by looking only space.

但是泰语没有空格,所以我不知道该怎么办. 示例将ฉันจะไปโรงเรียน吐出从txt文件到['ฉัน''จะ''ไป''โรง''เรียน'] =输出另一个txt文件.

But Thai language had no space, so I don't know how to do. Example spit ฉันจะไปโรงเรียน to from txt file to ['ฉัน' 'จะ' 'ไป' 'โรง' 'เรียน'] = output another txt file.

是否有任何程序或库可以识别泰语单词边界并进行拆分?

Are there any programs or libraries that identify Thai word boundaries and split?

推荐答案

2006年,有人为 Apache Lucene 项目来完成这项工作.

In 2006, someone contributed code to the Apache Lucene project to make this work.

他们的方法(用Java编写)是使用 BreakIterator 类,调用getWordInstance()以获得泰语的基于字典的单词迭代器.还请注意,对 ICU4J 项目存在一定的依赖性.我在下面粘贴了他们代码的相关部分:

Their approach (written in Java) was to use the BreakIterator class, calling getWordInstance() to get a dictionary-based word iterator for the Thai language. Note also that there is a stated dependency on the ICU4J project. I have pasted the relevant section of their code below:

  private BreakIterator breaker = null;
  private Token thaiToken = null;

  public ThaiWordFilter(TokenStream input) {
    super(input);
    breaker = BreakIterator.getWordInstance(new Locale("th"));
  }

  public Token next() throws IOException {
    if (thaiToken != null) {
      String text = thaiToken.termText();
      int start = breaker.current();
      int end = breaker.next();
      if (end != BreakIterator.DONE) {
        return new Token(text.substring(start, end), 
            thaiToken.startOffset()+start,
            thaiToken.startOffset()+end, thaiToken.type());
      }
      thaiToken = null;
    }
    Token tk = input.next();
    if (tk == null) {
      return null;
    }
    String text = tk.termText();
    if (UnicodeBlock.of(text.charAt(0)) != UnicodeBlock.THAI) {
      return new Token(text.toLowerCase(), 
                       tk.startOffset(), 
                       tk.endOffset(), 
                       tk.type());
    }
    thaiToken = tk;
    breaker.setText(text);
    int end = breaker.next();
    if (end != BreakIterator.DONE) {
      return new Token(text.substring(0, end), 
          thaiToken.startOffset(), 
          thaiToken.startOffset()+end,
          thaiToken.type());
    }
    return null;
  }

这篇关于如何将不使用空格的泰语句子拆分为单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆