如何在 Lucene 中搜索特殊字符(+ ! ? : ) [英] How to search special characters(+ ! ? : ) in Lucene
问题描述
我想在索引中搜索特殊字符.
I want to search special characters in index.
我转义了查询字符串中的所有特殊字符,但是当我在索引中的 lucene 上执行查询为 + 时,它将查询创建为 +().
I escaped all the special characters in query string but when i perform query as + on lucene in index it create query as +().
因此它不搜索任何字段.
Hence it search on no fields.
如何解决这个问题?我的索引包含这些特殊字符.
How to solve this problem? My index contains these special characters.
推荐答案
如果您使用的是 StandardAnalyzer
,则会丢弃非字母数字字符.尝试使用 WhitespaceAnalyzer
索引相同的值,看看是否保留了您需要的字符.它还可能保留您不想要的东西:这时您可能会考虑编写自己的分析器,这基本上意味着创建一个 TokenStream 堆栈来执行您需要的处理.
If you are using the StandardAnalyzer
, that will discard non-alphanum characters. Try indexing the same value with a WhitespaceAnalyzer
and see if that preserves the characters you need. It might also keep stuff you don't want: that's when you might consider writing your own Analyzer, which basically means creating a TokenStream stack that does exactly the kind of processing you need.
例如,SimpleAnalyzer
实现以下管道:
For example, the SimpleAnalyzer
implements the following pipeline:
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new LowerCaseTokenizer(reader);
}
只是将标记小写.
StandardAnalyzer
功能更多:
/** Constructs a {@link StandardTokenizer} filtered by a {@link
StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
tokenStream.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new StopFilter(enableStopPositionIncrements, result, stopSet);
return result;
}
你可以混合 &从 org.apache.lucene.analysis
中的这些和其他组件匹配,或者您可以编写自己的专用 TokenStream
实例,这些实例由您的自定义 分析器
.
You can mix & match from these and other components in org.apache.lucene.analysis
, or you can write your own specialized TokenStream
instances that are wrapped into a processing pipeline by your custom Analyzer
.
要查看的另一件事是您使用的是哪种类型的 CharTokenizer
.CharTokenizer
是一个抽象类,它指定了对文本字符串进行标记的机制.它被一些更简单的分析器使用(但不被 StandardAnalyzer
使用).Lucene 带有两个子类:一个 LetterTokenizer
和一个 WhitespaceTokenizer
.您可以通过实现 boolean isTokenChar(char c)
方法创建自己的保留字符并中断不需要的字符.
One other thing to look at is what sort of CharTokenizer
you're using. CharTokenizer
is an abstract class that specifies the machinery for tokenizing text strings. It's used by some simpler Analyzers (but not by the StandardAnalyzer
). Lucene comes with two subclasses: a LetterTokenizer
and a WhitespaceTokenizer
. You can create your own that keeps the characters you need and breaks on those you don't by implementing the boolean isTokenChar(char c)
method.
这篇关于如何在 Lucene 中搜索特殊字符(+ ! ? : )的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!