Lucene查询:bla~*(匹配以模糊开头的单词),怎么做? [英] Lucene query: bla~* (match words that start with something fuzzy), how?

查看:28
本文介绍了Lucene查询:bla~*(匹配以模糊开头的单词),怎么做?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Lucene 查询语法中,我想将 * 和 ~ 组合在一个有效的查询中,类似于:bla~*//无效查询

In the Lucene query syntax I'd like to combine * and ~ in a valid query similar to: bla~* //invalid query

含义:请匹配以bla"或类似bla"开头的单词.

Meaning: Please match words that begin with "bla" or something similar to "bla".

更新:我现在所做的,适用于少量输入,使用以下(SOLR 模式的片段):

Update: What I do now, works for small input, is use the following (snippet of SOLR schema):

<fieldtype name="text_ngrams" class="solr.TextField">
  <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>

如果您不使用 SOLR,则执行以下操作.

In case you don't use SOLR, this does the following.

Indextime:通过创建一个包含我的(短)输入的所有前缀的字段来索引数据.

Indextime: Index data by creating a field containing all prefixes of my (short) input.

搜索时间:仅使用 ~ 运算符,因为前缀明确存在于索引中.

Searchtime: only use the ~ operator, as prefixes are explicitly present in the index.

推荐答案

我不相信 Lucene 支持这样的东西,也不相信它有一个简单的解决方案.

I do not believe Lucene supports anything like this, nor do I believe it has a trivial solution.

模糊"搜索不会对固定数量的字符进行操作.bla~ 可能例如匹配 blah,因此它必须考虑整个术语.

"Fuzzy" searches do not operate on a fixed number of characters. bla~ may for example match blah and so it must consider the entire term.

你可以做的是实现一个查询扩展算法,将查询 bla~* 转换为一系列 OR 查询

What you could do is implement a query expansion algorithm that took the query bla~* and converted it into a series of OR queries

bla* OR blb* OR blc OR .... etc.

但这只有在字符串很短或者你可以根据一些规则缩小扩展时才可行.

But that is really only viable if the string is very short or if you can narrow the expansion based on some rules.

或者,如果前缀的长度是固定的,您可以添加一个带有子字符串的字段并对其执行模糊搜索.这会给你你想要的,但只有当你的用例足够狭窄时才会起作用.

Alternatively if the length of the prefix is fixed you could add a field with the substrings and perform the fuzzy search on that. That would give you what you want, but will only work if your use case is sufficiently narrow.

你没有具体说明为什么需要这个,也许这样做会引出其他解决方案.

You don't specify exactly why you need this, perhaps doing so will elicit other solutions.

我能想到的一个场景是处理不同形式的单词.例如.找到 carcars.

One scenario I can think of is dealing with different form of words. E.g. finding car and cars.

这在英语中很容易,因为有可用的词干分析器.在其他语言中,即使不是不可能,也很难实现词干分析器.

This is easy in English as there are word stemmers available. In other languages it can be quite difficult to implement word stemmers, if not impossible.

但是,在这种情况下,您可以(假设您可以访问一本好的字典)查找搜索词并以编程方式扩展搜索以搜索该词的所有形式.

In this scenario you can however (assuming you have access to a good dictionary) look up the search term and expand the search programmatically to search for all forms of the word.

例如cars 的搜索被翻译成 car OR cars.这已在至少一个搜索引擎中成功应用于我的语言,但显然实现起来并非易事.

E.g. a search for cars is translated into car OR cars. This has been applied successfully for my language in at least one search engine, but is obviously non-trivial to implement.

这篇关于Lucene查询:bla~*(匹配以模糊开头的单词),怎么做?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆