如何在lucene 4.0中使用ngram tokenizer? [英] How to have ngram tokenizer in lucene 4.0?

查看:392
本文介绍了如何在lucene 4.0中使用ngram tokenizer?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用没有空格的文本索引大文本文件。目前我有ngram方法生成长度为12的字符串然后我索引它们。同样的搜索方式,我从用户获取字符串生成12的ngrams,然后用它来构建查询。在搜索时,请阅读lucene中存在的ngram tokenizer。但是找不到任何例子。

I am working on indexing large text file with text without spaces. Currently i have ngram method to generate string of length 12 and then i index them. Same way to search,i get the string from the user generate ngrams of 12 and then use it in building the query. On searching,read about ngram tokenizer present in lucene. But couldnt find much oof any examples.

如何在lucene 4.0中实现ngram tokenizer?

How to implement ngram tokenizer in lucene 4.0 ??

推荐答案

使用 NGramTokenizer 的最简单方法可能是这个构造函数只需读取器,以及最小和最大克大小。您可以将其合并到分析器中,类似于分析器文档。类似于:

Probably the simplest way to use NGramTokenizer is with this constructor the just takes a reader, and min and max gram size. You can incorporate it into an analyzer, similar to the example on the Analyzer docs. Something like:

Analyzer analyzer = new Analyzer() {
 @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer source = new NGramTokenizer(reader, 12, 12);
    TokenStream filter = new LowercaseFilter(source);
    return new TokenStreamComponents(source, filter);
  }
};

这篇关于如何在lucene 4.0中使用ngram tokenizer?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆