Lucene查询:bla〜*(匹配以模糊开头的单词),如何? [英] Lucene query: bla~* (match words that start with something fuzzy), how?

查看:126
本文介绍了Lucene查询:bla〜*(匹配以模糊开头的单词),如何?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Lucene查询语法中,我想在有效的查询中将*和〜组合在一起,类似于: bla〜*//无效的查询

In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to: bla~* //invalid query

含义:请匹配以"bla"开头的单词或类似于"bla"的单词.

Meaning: Please match words that begin with "bla" or something similar to "bla".

更新: 我现在要做的是在输入很少的情况下使用以下代码(SOLR模式的代码段):

Update: What I do now, works for small input, is use the following (snippet of SOLR schema):

<fieldtype name="text_ngrams" class="solr.TextField">
  <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>

如果您不使用SOLR,请执行以下操作.

In case you don't use SOLR, this does the following.

索引时间:通过创建一个包含我(简短)输入的所有前缀的字段来索引数据.

Indextime: Index data by creating a field containing all prefixes of my (short) input.

搜索时间:仅使用〜运算符,因为前缀明确地存在于索引中.

Searchtime: only use the ~ operator, as prefixes are explicitly present in the index.

推荐答案

我不认为Lucene支持这样的事情,也不认为它有一个简单的解决方案.

I do not believe Lucene supports anything like this, nor do I believe it has a trivial solution.

模糊"搜索不适用于固定数量的字符.例如,bla~可能匹配blah,因此它必须考虑整个术语.

"Fuzzy" searches do not operate on a fixed number of characters. bla~ may for example match blah and so it must consider the entire term.

您可以做的是实现一个查询扩展算法,该算法采用查询bla~*并将其转换为一系列OR查询

What you could do is implement a query expansion algorithm that took the query bla~* and converted it into a series of OR queries

bla* OR blb* OR blc OR .... etc.

但这仅在字符串非常短或您可以根据某些规则缩小扩展范围时才可行.

But that is really only viable if the string is very short or if you can narrow the expansion based on some rules.

或者,如果前缀的长度是固定的,则可以添加带有子字符串的字段,然后对该字段进行模糊搜索.这样可以为您提供所需的东西,但是只有在您的用例足够狭窄的情况下才可以使用.

Alternatively if the length of the prefix is fixed you could add a field with the substrings and perform the fuzzy search on that. That would give you what you want, but will only work if your use case is sufficiently narrow.

您没有确切说明为什么需要这样做,也许这样做会引起其他解决方案.

You don't specify exactly why you need this, perhaps doing so will elicit other solutions.

我能想到的一种情况是处理不同形式的单词.例如.找到carcars.

One scenario I can think of is dealing with different form of words. E.g. finding car and cars.

这在英语中很容易,因为有词干.在其他语言中,即使不是不可能,实现单词词干分析器也可能非常困难.

This is easy in English as there are word stemmers available. In other languages it can be quite difficult to implement word stemmers, if not impossible.

但是,在这种情况下,您可以(假设您可以使用一本好的字典)查找搜索词并以编程方式扩展搜索以搜索单词的所有形式.

In this scenario you can however (assuming you have access to a good dictionary) look up the search term and expand the search programmatically to search for all forms of the word.

例如对cars的搜索将转换为car OR cars.这已经在至少一个搜索引擎中成功地用于我的语言,但是显然实现起来并不容易.

E.g. a search for cars is translated into car OR cars. This has been applied successfully for my language in at least one search engine, but is obviously non-trivial to implement.

这篇关于Lucene查询:bla〜*(匹配以模糊开头的单词),如何?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆