如何在 Lucene 中搜索特殊字符(+ ! ? : ) [英] How to search special characters(+ ! ? : ) in Lucene

查看:85
本文介绍了如何在 Lucene 中搜索特殊字符(+ ! ? : )的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在索引中搜索特殊字符.

I want to search special characters in index.

我转义了查询字符串中的所有特殊字符,但是当我在索引中的 lucene 上执行查询为 + 时,它将查询创建为 +().

I escaped all the special characters in query string but when i perform query as + on lucene in index it create query as +().

因此它不搜索任何字段.

Hence it search on no fields.

如何解决这个问题?我的索引包含这些特殊字符.

How to solve this problem? My index contains these special characters.

推荐答案

如果您使用的是 StandardAnalyzer,则会丢弃非字母数字字符.尝试使用 WhitespaceAnalyzer 索引相同的值,看看是否保留了您需要的字符.它还可能保留您不想要的东西:这时您可能会考虑编写自己的分析器,这基本上意味着创建一个 TokenStream 堆栈来执行您需要的处理.

If you are using the StandardAnalyzer, that will discard non-alphanum characters. Try indexing the same value with a WhitespaceAnalyzer and see if that preserves the characters you need. It might also keep stuff you don't want: that's when you might consider writing your own Analyzer, which basically means creating a TokenStream stack that does exactly the kind of processing you need.

例如,SimpleAnalyzer 实现以下管道:

For example, the SimpleAnalyzer implements the following pipeline:

@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
   return new LowerCaseTokenizer(reader);
}

只是将标记小写.

StandardAnalyzer 功能更多:

/** Constructs a {@link StandardTokenizer} filtered by a {@link
StandardFilter}, a {@link LowerCaseFilter} and a {@link StopFilter}. */
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
    StandardTokenizer tokenStream = new StandardTokenizer(matchVersion, reader);
    tokenStream.setMaxTokenLength(maxTokenLength);
    TokenStream result = new StandardFilter(tokenStream);
    result = new LowerCaseFilter(result);
    result = new StopFilter(enableStopPositionIncrements, result, stopSet);
    return result;
 }

你可以混合 &从 org.apache.lucene.analysis 中的这些和其他组件匹配,或者您可以编写自己的专用 TokenStream 实例,这些实例由您的自定义 分析器.

You can mix & match from these and other components in org.apache.lucene.analysis, or you can write your own specialized TokenStream instances that are wrapped into a processing pipeline by your custom Analyzer.

要查看的另一件事是您使用的是哪种类型的 CharTokenizer.CharTokenizer 是一个抽象类,它指定了对文本字符串进行标记的机制.它被一些更简单的分析器使用(但不被 StandardAnalyzer 使用).Lucene 带有两个子类:一个 LetterTokenizer 和一个 WhitespaceTokenizer.您可以通过实现 boolean isTokenChar(char c) 方法创建自己的保留字符并中断不需要的字符.

One other thing to look at is what sort of CharTokenizer you're using. CharTokenizer is an abstract class that specifies the machinery for tokenizing text strings. It's used by some simpler Analyzers (but not by the StandardAnalyzer). Lucene comes with two subclasses: a LetterTokenizer and a WhitespaceTokenizer. You can create your own that keeps the characters you need and breaks on those you don't by implementing the boolean isTokenChar(char c) method.

这篇关于如何在 Lucene 中搜索特殊字符(+ ! ? : )的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆