Lucene.Net 下划线导致令牌分裂 [英] Lucene.Net Underscores causing token split
问题描述
I've scripted a MsSqlServer databases tables,views and stored procedures into a directory structure that I am then indexing with Lucene.net. Most of my table, view and procedure names contain underscores.
I use the StandardAnalyzer. If I query for a table named tIr_InvoiceBtnWtn01, for example, I recieve hits back for tIr and for InvoiceBtnWtn01, rather than for just tIr_InvoiceBtnWtn01.
I think the issue is the tokenizer is splitting on _ (underscore) since it is punctuation.
Is there a (simple) way to remove underscores from the punctuation list or is there another analyzer that I should be using for sql and programming languages?
Yes, the StandardAnalyzer splits on underscore. WhitespaceAnalyzer does not. Note that you can use a PerFieldAnalyzerWrapper to use different analyzers for each field - you might want to keep some of the standard analyzer's functionality for everything except table/column name.
WhitespaceAnalyzer only does whitespace splitting though. It won't lowercase your tokens, for example. So you might want to make your own analyzer which combines WhitespaceTokenizer and LowercaseFilter, or look into LowercaseTokenizer.
EDIT: Simple custom analyzer (in C#, but you can translate it to Java pretty easily):
// Chains together standard tokenizer, standard filter, and lowercase filter
class MyAnalyzer : Analyzer
{
public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
{
StandardTokenizer baseTokenizer = new StandardTokenizer(Lucene.Net.Util.Version.LUCENE_29, reader);
StandardFilter standardFilter = new StandardFilter(baseTokenizer);
LowerCaseFilter lcFilter = new LowerCaseFilter(standardFilter);
return lcFilter;
}
}
这篇关于Lucene.Net 下划线导致令牌分裂的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!