如何在Lucene-3x中通过模糊(近似)搜索找到被分析的术语? [英] How to find an analyzed term with a fuzzy (approximate) search in Lucene-3x?

查看:104
本文介绍了如何在Lucene-3x中通过模糊(近似)搜索找到被分析的术语?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

查询' laser〜'找不到' laser '.

我正在使用Lucene的GermanAnalyzer将文档存储在索引中.我保存两个文档,其标题"字段分别为激光"和人工".之后,我执行模糊查询laser~. Lucene仅找到包含人工"的​​文档. Lucene-3x实施此类搜索的方式是什么?

I'm using Lucene's GermanAnalyzer to store documents in the index. I save two documents with "title" fields "laser" and "labor" respectively. Afterwards I perform a fuzzy query laser~. Lucene only finds the document that contains "labor". What is the Lucene-3x way to implement such searches?

通过查看Lucene源代码,我想模糊搜索并不是设计用于处理已分析"的内容,但是我不确定情况是否如此.

By taking a look at the Lucene source code, I guess that fuzzy searches are not designed to work with "analyzed" content, but I'm not sure whether this is the case.

下面是一些背景和说明...

Following, some background and remarks...

最近有人注意到我们的OpenCms搜索结果页面中缺少文档后,我注意到了这种行为.在某些德国站点中搜索失败.经过调查,我发现:

I noticed this behaviour after someone recently noticed that our OpenCms' searches were missing documents in the results page. The searches were failing in some German site. Investigating a bit, I found that:

  • 我们正在使用OpenCms 8.5.1来执行搜索,而这是使用Lucene 3.6.1来实现搜索功能的.
  • 默认情况下,OpenCms对具有德语语言环境的网站使用org.apache.lucene.analysis.de.GermanAnalyzer来解析内容和查询.
  • 我们使用Field.Index.ANALYZED
  • 存储网站内容
  • 对于报告的搜索失败,我们通过在搜索查询中附加代字号来强制进行模糊搜索.
  • We are using OpenCms 8.5.1 to perform our searches, and this uses Lucene 3.6.1 to implement the search functionality.
  • By default, OpenCms uses the org.apache.lucene.analysis.de.GermanAnalyzer for sites with German locale to parse content and queries.
  • We are storing the sites content with Field.Index.ANALYZED
  • For the reported failing search, we were forcing a fuzzy search by appending a tilde to the search query.

为了缩小问题的范围,我编写了一些直接使用Lucene 3.6.1的代码(我也测试了3.6.2,但是两者的行为相同).请注意,Lucene 4+具有稍微不同的API和不同的模糊搜索,也就是说,在Lucene 4+中不会出现此问题. (不幸的是,我无法控制OpenCms所依赖的Lucene版本.)

To try to narrow the problem, I wrote some code directly exercising Lucene 3.6.1 (I have tested the 3.6.2 also, but both behave identical). Notice that Lucene 4+ has a slightly different API and a different fuzzy search, that is, in Lucene 4+ this problem doesn't arise. (Unfortunately, I cannot control the Lucene version that OpenCms depends on.)

// For the import clauses, see below
public static void main(String[] args) throws Exception {
    final Version VER = Version.LUCENE_36;
    // With the StandardAnalyzer or the EnglishAnalyzer
    // the search works as expected
    Analyzer analyzer = new GermanAnalyzer(VER);

    Directory index = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(VER, analyzer);

    IndexWriter w = new IndexWriter(index, config);
    addDoc(w, "labor");
    addDoc(w, "laser");
    addDoc(w, "latex");
    w.close();

    String querystr = "laser~"; // Fuzzy search for 'title'
    Query q = new QueryParser(VER, "title", analyzer).parse(querystr);
    System.out.println("Querystr: " + querystr + "; Query: " + q);

    int hitsPerPage = 10;
    IndexReader reader = IndexReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);
    TopScoreDocCollector collector = TopScoreDocCollector.create(
            hitsPerPage, true);
    searcher.search(q, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;

    System.out.println("Found " + hits.length + " hits.");
    for (int i = 0; i < hits.length; ++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        System.out.println((i + 1) + ". " + d.get("title"));
    }
}

private static void addDoc(IndexWriter w, String title) throws Exception {
    Document doc = new Document();
    doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
    w.addDocument(doc);
}

此代码的输出:

Querystr: laser~; Query: title:laser~0.5 <br>
Found 2 hits.<br>
1. labor<br>
2. latex<br>

我故意剪切了导入部分,以免使代码混乱.要构建项目,您需要lucene-core-3.6.2.jarlucene-analyzers-3.6.2.jar(可以从

I deliberately cut the imports section to not clutter the code. To build the project, you need lucene-core-3.6.2.jar, lucene-analyzers-3.6.2.jar (that you can download from the Apache archives) and following imports:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.de.GermanAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

一些Lucene调试详细信息和说明

  1. 在调试Lucene代码时,我发现使用GermanAnalyzer的Lucene将文档标题存储在索引中为:

  1. When debugging the Lucene code, I found that Lucene with the GermanAnalyzer stores the document titles in the index as:

  • 激光"->激光"
  • 劳工"->劳工"
  • 'latex'->'latex'

我还发现,使用精确搜索laser时,还对查询字符串进行了分析. laser查询的先前代码的输出为:

I also found that using an exact search laser, the query string is also analyzed. The output of the previous code for the laser query is:

Querystr: laser; Query: title:las
Found 1 hits.
1. laser

(请注意两次运行中的不同查询:第一次运行时为title:laser~0.5,第二次运行时为title:las.)

(Notice the different queries in the two runs: title:laser~0.5 in the first runs vs title:las in the second.)

如前所述,对于StandardAnalyzerEnglishAnalyzer,模糊搜索可以按预期进行:

As already commented, with the StandardAnalyzer or the EnglishAnalyzer the fuzzy searchs works as expected:

Querystr: laser~; Query: title:laser~0.5
Found 3 hits.
1. laser
2. labor
3. latex

  • Lucene计算两个项(在org.apache.lucene.search.FuzzyTermEnum.similarity(target: String)中)相对于最短项长度的相似性. Similarity返回:

  • Lucene calculates the similarity between two terms (in org.apache.lucene.search.FuzzyTermEnum.similarity(target: String)) relative to the length of the shortest term. Similarity returns:

    [...]
    1 - (editDistance / length)
    其中length是最短术语(文本或目标)的长度,包括 前缀相同且editDistance是Levenshtein距离 这两个词.

    [...]
    1 - (editDistance / length)
    where length is the length of the shortest term (text or target) including a prefix that are identical and editDistance is the Levenshtein distance for the two words.

    请注意:

    similarity("laser","las"  ) = 1 - (2 / 3) = 1/3
    similarity("laser","labor") = 1 - (2 / 5) = 3/5
    

  • 编辑1 .从分析仪中明确排除激光"也会产生预期的搜索结果:

  • Edit 1. Excluding "laser" explicitly from the analyzer also yields the expected search results:

    Analyzer analyzer = new GermanAnalyzer(VER, null, new HashSet() {
        {
            add("laser");
        }
    });
    

    输出:

    Querystr: laser~; Query: title:laser~0.5
    Found 3 hits.
    1. laser
    2. labor
    3. latex
    

  • 推荐答案

    * 指出,在3.6分支之前,查询不会通过分析器(执行词干和精简的组件) .在3.6分支中,一些过滤器已添加到查询分析器链(例如LowerCaseFilterFactory).最后,GermanNormalizationFilterFactory已添加到4.0分支的该链中.

    It turns out* that prior to the 3.6 branch, the query doesn't go through the Analyzer (the component that performs stemming and lowercasing). In the 3.6 branch, some filters has been added to the query analyzer chain (e.g. the LowerCaseFilterFactory). And finally, the GermanNormalizationFilterFactory has been added to this chain in the 4.0 branch.

    *感谢 @femtoRgon 为您的指针

    * Thanks @femtoRgon for your pointers

    上一篇文章举例说明为何不通过分析器传递模糊搜索:

    An older article explains with an example why fuzzy searches were not passed through the Analyzer:

    跳过分析器的原因是,如果您要搜索"dogs *",则不希望"dogs"首先词干为"dog",因为那样会匹配"dog *",这不是预期的查询.

    The reason for skipping the Analyzer is that if you were searching for "dogs*" you would not want "dogs" first stemmed to "dog", since that would then match "dog*", which is not the intended query.

    最重要的是,如果使用Lucene 3.6.2,则用户必须自己执行查询分析.

    The bottom line is that if staying with Lucene 3.6.2, the user has to implement the analysis of the query herself.

    这篇关于如何在Lucene-3x中通过模糊(近似)搜索找到被分析的术语?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆