如何在Lucene-3x中通过模糊(近似)搜索找到被分析的术语? [英] How to find an analyzed term with a fuzzy (approximate) search in Lucene-3x?

查看：104 发布时间：2020/5/4 7:48:52 java search lucene fuzzy-search opencms

本文介绍了如何在Lucene-3x中通过模糊(近似)搜索找到被分析的术语?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

查询' laser〜'找不到' laser '.

我正在使用Lucene的GermanAnalyzer将文档存储在索引中.我保存两个文档，其标题"字段分别为激光"和人工".之后，我执行模糊查询laser~. Lucene仅找到包含人工"的文档. Lucene-3x实施此类搜索的方式是什么?

I'm using Lucene's GermanAnalyzer to store documents in the index. I save two documents with "title" fields "laser" and "labor" respectively. Afterwards I perform a fuzzy query laser~. Lucene only finds the document that contains "labor". What is the Lucene-3x way to implement such searches?

通过查看Lucene源代码，我想模糊搜索并不是设计用于处理已分析"的内容，但是我不确定情况是否如此.

By taking a look at the Lucene source code, I guess that fuzzy searches are not designed to work with "analyzed" content, but I'm not sure whether this is the case.

下面是一些背景和说明...

Following, some background and remarks...

最近有人注意到我们的OpenCms搜索结果页面中缺少文档后，我注意到了这种行为.在某些德国站点中搜索失败.经过调查，我发现:

I noticed this behaviour after someone recently noticed that our OpenCms' searches were missing documents in the results page. The searches were failing in some German site. Investigating a bit, I found that:

我们正在使用OpenCms 8.5.1来执行搜索，而这是使用Lucene 3.6.1来实现搜索功能的.
默认情况下，OpenCms对具有德语语言环境的网站使用org.apache.lucene.analysis.de.GermanAnalyzer来解析内容和查询.
我们使用Field.Index.ANALYZED
对于报告的搜索失败，我们通过在搜索查询中附加代字号来强制进行模糊搜索.

We are using OpenCms 8.5.1 to perform our searches, and this uses Lucene 3.6.1 to implement the search functionality.
By default, OpenCms uses the org.apache.lucene.analysis.de.GermanAnalyzer for sites with German locale to parse content and queries.
We are storing the sites content with Field.Index.ANALYZED
For the reported failing search, we were forcing a fuzzy search by appending a tilde to the search query.

为了缩小问题的范围，我编写了一些直接使用Lucene 3.6.1的代码(我也测试了3.6.2，但是两者的行为相同).请注意，Lucene 4+具有稍微不同的API和不同的模糊搜索，也就是说，在Lucene 4+中不会出现此问题. (不幸的是，我无法控制OpenCms所依赖的Lucene版本.)

To try to narrow the problem, I wrote some code directly exercising Lucene 3.6.1 (I have tested the 3.6.2 also, but both behave identical). Notice that Lucene 4+ has a slightly different API and a different fuzzy search, that is, in Lucene 4+ this problem doesn't arise. (Unfortunately, I cannot control the Lucene version that OpenCms depends on.)

// For the import clauses, see below
public static void main(String[] args) throws Exception {
    final Version VER = Version.LUCENE_36;
    // With the StandardAnalyzer or the EnglishAnalyzer
    // the search works as expected
    Analyzer analyzer = new GermanAnalyzer(VER);

    Directory index = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(VER, analyzer);

    IndexWriter w = new IndexWriter(index, config);
    addDoc(w, "labor");
    addDoc(w, "laser");
    addDoc(w, "latex");
    w.close();

    String querystr = "laser~"; // Fuzzy search for 'title'
    Query q = new QueryParser(VER, "title", analyzer).parse(querystr);
    System.out.println("Querystr: " + querystr + "; Query: " + q);

    int hitsPerPage = 10;
    IndexReader reader = IndexReader.open(index);
    IndexSearcher searcher = new IndexSearcher(reader);
    TopScoreDocCollector collector = TopScoreDocCollector.create(
            hitsPerPage, true);
    searcher.search(q, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;

    System.out.println("Found " + hits.length + " hits.");
    for (int i = 0; i < hits.length; ++i) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        System.out.println((i + 1) + ". " + d.get("title"));
    }
}

private static void addDoc(IndexWriter w, String title) throws Exception {
    Document doc = new Document();
    doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));
    w.addDocument(doc);
}

此代码的输出:

Querystr: laser~; Query: title:laser~0.5 <br>
Found 2 hits.<br>
1. labor<br>
2. latex<br>

我故意剪切了导入部分，以免使代码混乱.要构建项目，您需要lucene-core-3.6.2.jar，lucene-analyzers-3.6.2.jar(可以从

I deliberately cut the imports section to not clutter the code. To build the project, you need lucene-core-3.6.2.jar, lucene-analyzers-3.6.2.jar (that you can download from the Apache archives) and following imports:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.de.GermanAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

一些Lucene调试详细信息和说明

在调试Lucene代码时，我发现使用GermanAnalyzer的Lucene将文档标题存储在索引中为:

When debugging the Lucene code, I found that Lucene with the GermanAnalyzer stores the document titles in the index as:

激光"->激光"
劳工"->劳工"
'latex'->'latex'

我还发现，使用精确搜索laser时，还对查询字符串进行了分析. laser查询的先前代码的输出为:

I also found that using an exact search laser, the query string is also analyzed. The output of the previous code for the laser query is:

Querystr: laser; Query: title:las
Found 1 hits.
1. laser

(请注意两次运行中的不同查询:第一次运行时为title:laser~0.5，第二次运行时为title:las.)

(Notice the different queries in the two runs: title:laser~0.5 in the first runs vs title:las in the second.)

如前所述，对于StandardAnalyzer或EnglishAnalyzer，模糊搜索可以按预期进行:

As already commented, with the StandardAnalyzer or the EnglishAnalyzer the fuzzy searchs works as expected:

Querystr: laser~; Query: title:laser~0.5
Found 3 hits.
1. laser
2. labor
3. latex

Lucene计算两个项(在org.apache.lucene.search.FuzzyTermEnum.similarity(target: String)中)相对于最短项长度的相似性. Similarity返回:

Lucene calculates the similarity between two terms (in org.apache.lucene.search.FuzzyTermEnum.similarity(target: String)) relative to the length of the shortest term. Similarity returns:

[...]
1 - (editDistance / length)
其中length是最短术语(文本或目标)的长度，包括前缀相同且editDistance是Levenshtein距离这两个词.

[...]
1 - (editDistance / length)
where length is the length of the shortest term (text or target) including a prefix that are identical and editDistance is the Levenshtein distance for the two words.

请注意:

similarity("laser","las"  ) = 1 - (2 / 3) = 1/3
similarity("laser","labor") = 1 - (2 / 5) = 3/5

编辑1 .从分析仪中明确排除激光"也会产生预期的搜索结果:

Edit 1. Excluding "laser" explicitly from the analyzer also yields the expected search results:

Analyzer analyzer = new GermanAnalyzer(VER, null, new HashSet() {
    {
        add("laser");
    }
});

输出:

Querystr: laser~; Query: title:laser~0.5
Found 3 hits.
1. laser
2. labor
3. latex

如何在Lucene-3x中通过模糊(近似)搜索找到被分析的术语? [英] How to find an analyzed term with a fuzzy (approximate) search in Lucene-3x?

问题描述

一些Lucene调试详细信息和说明

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何在Lucene-3x中通过模糊(近似)搜索找到被分析的术语? [英] How to find an analyzed term with a fuzzy (approximate) search in Lucene-3x?

问题描述

一些Lucene调试详细信息和说明

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭