Fieldable.tokenStreamValue()对于标记化字段返回null [英] Fieldable.tokenStreamValue() returns null for tokenized field

查看:57
本文介绍了Fieldable.tokenStreamValue()对于标记化字段返回null的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用lucene进行N-Gram匹配.我设置了要使用N-Gram分析仪分析的字段.我想看看分析得出的标记看起来如何确保正确计算n元语法.

I use lucene for N-Gram matching. I set a field to be analyzed using an N-Gram analyzer. I want to see how the tokens resulting from the analysis look like to make sure the n-grams are being correctly computed.

如果我在文档的已分析字段上调用方法Fieldable.tokenStreamValue(),则会得到空值,而调用Fieldable.isTokenized()则返回true.

If I call the method Fieldable.tokenStreamValue() on the analyzed field of a document, I get null, while calling Fieldable.isTokenized() returns true.

我还必须补充说,查询结果与正确生成的n-gram一致.

I must add that the results of querying are consistent with n-grams being correctly generated.

对此有任何解释吗?我实质上是想做这里提到的事情: 我如何阅读Lucene在对字段标记进行分析后是否将其记录下来?

Any explanations for this? I am essentially trying to do what is mentioned here: How can I read a Lucene document field tokens after they are analyzed?

这是完整的代码:

public class TestLuceneNgram {

public static class NGramQuery extends BooleanQuery {

    public NGramQuery(final String queryTerm) throws IOException {

        StringReader strReader = new StringReader(queryTerm);
        TokenStream tokens = new NGramTokenizer(strReader);

        CharTermAttribute termAtt = (CharTermAttribute) tokens
                .addAttribute(CharTermAttribute.class);

        while (tokens.incrementToken()) {
            System.out.println(termAtt);
            Term t = new Term("NGRAM_FIELD", termAtt.toString());
            add(new TermQuery(t), BooleanClause.Occur.SHOULD);

        }

    }
}

public static class NGramSearcher extends IndexSearcher {

    public NGramSearcher(final Directory directory)
            throws CorruptIndexException, IOException {
        super(IndexReader.open(directory));
    }

    public TopDocs search(final String term) {
        try {
            return search(new NGramQuery(term), 10);
        } catch (IOException e) {
            e.printStackTrace();
        }

        return null;
    }
}

public static class SubWordAnalyzer extends Analyzer {

    @Override
    public TokenStream tokenStream(final String fieldName,
            final Reader reader) {
        return new NGramTokenizer(reader);
    }

}

public static Directory index(final String[] terms) {

    Directory indexDirectory = new RAMDirectory();

    IndexWriter indexWriter = null;
    try {
        indexWriter = new IndexWriter(indexDirectory,
                new IndexWriterConfig(Version.LUCENE_32,
                        new SubWordAnalyzer()));
    } catch (CorruptIndexException e) {
        e.printStackTrace();
    } catch (LockObtainFailedException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

    for (int i = 0; i < terms.length; ++i) {
        Document doc = new Document();
        doc.add(new Field("NGRAM_FIELD", terms[i], Field.Store.YES,
                Field.Index.ANALYZED,
                Field.TermVector.WITH_POSITIONS_OFFSETS));
        doc.add(new Field("ORIGINAL_FIELD", terms[i], Field.Store.YES,
                Field.Index.NOT_ANALYZED));

        try {
            indexWriter.addDocument(doc);
        } catch (CorruptIndexException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    try {
        indexWriter.optimize();
    } catch (CorruptIndexException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    try {
        indexWriter.close();
    } catch (CorruptIndexException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

    return indexDirectory;
}

/**
 * @param args
 */
public static void main(final String[] args) {

    String[] terms = new String[] { "the first string", "the second one" };

    Directory dir = index(terms);

    NGramSearcher ngs = null;
    try {
        ngs = new NGramSearcher(dir);
    } catch (CorruptIndexException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

    TopDocs td = ngs.search("second");
    System.out.println(td.totalHits);

    for (ScoreDoc sd : td.scoreDocs) {
        System.out.println(sd.doc + "---" + sd.score);
        try {
            System.out.println(ngs.doc(sd.doc).getFieldable("NGRAM_FIELD").
            tokenStreamValue());

        } catch (CorruptIndexException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

}

推荐答案

首先要检查的是您是否实际上在索引时存储了该字段.如果只是索引它,这就是预期的结果.

The first thing to check is whether you are actually storing this field at index time. If you're just indexing it, this is the expected result.

这篇关于Fieldable.tokenStreamValue()对于标记化字段返回null的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆