Fieldable.tokenStreamValue()对于标记化字段返回null [英] Fieldable.tokenStreamValue() returns null for tokenized field
问题描述
我使用lucene进行N-Gram匹配.我设置了要使用N-Gram分析仪分析的字段.我想看看分析得出的标记看起来如何确保正确计算n元语法.
I use lucene for N-Gram matching. I set a field to be analyzed using an N-Gram analyzer. I want to see how the tokens resulting from the analysis look like to make sure the n-grams are being correctly computed.
如果我在文档的已分析字段上调用方法Fieldable.tokenStreamValue()
,则会得到空值,而调用Fieldable.isTokenized()
则返回true.
If I call the method Fieldable.tokenStreamValue()
on the analyzed field of a document, I get null, while calling Fieldable.isTokenized()
returns true.
我还必须补充说,查询结果与正确生成的n-gram一致.
I must add that the results of querying are consistent with n-grams being correctly generated.
对此有任何解释吗?我实质上是想做这里提到的事情: 我如何阅读Lucene在对字段标记进行分析后是否将其记录下来?
Any explanations for this? I am essentially trying to do what is mentioned here: How can I read a Lucene document field tokens after they are analyzed?
这是完整的代码:
public class TestLuceneNgram {
public static class NGramQuery extends BooleanQuery {
public NGramQuery(final String queryTerm) throws IOException {
StringReader strReader = new StringReader(queryTerm);
TokenStream tokens = new NGramTokenizer(strReader);
CharTermAttribute termAtt = (CharTermAttribute) tokens
.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken()) {
System.out.println(termAtt);
Term t = new Term("NGRAM_FIELD", termAtt.toString());
add(new TermQuery(t), BooleanClause.Occur.SHOULD);
}
}
}
public static class NGramSearcher extends IndexSearcher {
public NGramSearcher(final Directory directory)
throws CorruptIndexException, IOException {
super(IndexReader.open(directory));
}
public TopDocs search(final String term) {
try {
return search(new NGramQuery(term), 10);
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
public static class SubWordAnalyzer extends Analyzer {
@Override
public TokenStream tokenStream(final String fieldName,
final Reader reader) {
return new NGramTokenizer(reader);
}
}
public static Directory index(final String[] terms) {
Directory indexDirectory = new RAMDirectory();
IndexWriter indexWriter = null;
try {
indexWriter = new IndexWriter(indexDirectory,
new IndexWriterConfig(Version.LUCENE_32,
new SubWordAnalyzer()));
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (LockObtainFailedException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
for (int i = 0; i < terms.length; ++i) {
Document doc = new Document();
doc.add(new Field("NGRAM_FIELD", terms[i], Field.Store.YES,
Field.Index.ANALYZED,
Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.add(new Field("ORIGINAL_FIELD", terms[i], Field.Store.YES,
Field.Index.NOT_ANALYZED));
try {
indexWriter.addDocument(doc);
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
try {
indexWriter.optimize();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
try {
indexWriter.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return indexDirectory;
}
/**
* @param args
*/
public static void main(final String[] args) {
String[] terms = new String[] { "the first string", "the second one" };
Directory dir = index(terms);
NGramSearcher ngs = null;
try {
ngs = new NGramSearcher(dir);
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
TopDocs td = ngs.search("second");
System.out.println(td.totalHits);
for (ScoreDoc sd : td.scoreDocs) {
System.out.println(sd.doc + "---" + sd.score);
try {
System.out.println(ngs.doc(sd.doc).getFieldable("NGRAM_FIELD").
tokenStreamValue());
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
推荐答案
首先要检查的是您是否实际上在索引时存储了该字段.如果只是索引它,这就是预期的结果.
The first thing to check is whether you are actually storing this field at index time. If you're just indexing it, this is the expected result.
这篇关于Fieldable.tokenStreamValue()对于标记化字段返回null的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!