带有“-"的Lucene索引问题特点 [英] Lucene Index problems with "-" character

查看:35
本文介绍了带有“-"的Lucene索引问题特点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用Lucene索引时遇到麻烦,该索引具有包含-"字符的索引词.

I'm having trouble with a Lucene Index, which has indexed words, that contain "-" Characters.

它适用于某些包含-"的单词,但不适用于所有单词,我找不到原因,为什么它不起作用.

It works for some words that contain "-" but not for all and I don't find the reason, why it's not working.

我正在搜索的字段经过分析,并包含带有和不带有-"字符的单词的版本.

The field I'm searching in, is analyzed and contains version of the word with and without the "-" character.

我正在使用分析器:org.apache.lucene.analysis.standard.StandardAnalyzer

I'm using the analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer

这里是一个例子:

如果我搜索"gsx- *",我得到一个结果,则索引字段包含铃木GSX-R 1000 GSX-R1000 GSXR"

if I search for "gsx-*" I got a result, the indexed field contains "SUZUKI GSX-R 1000 GSX-R1000 GSXR"

但是,如果我搜索"v- *",则没有结果.预期结果的索引字段包含:"SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM"

but if I search for "v-*" I got no result. The indexed field of the expected result contains: "SUZUKI DL 1000 V-STROM DL1000V-STROMVSTROM V STROM"

如果我搜索不带"*"的"v-strom",则可以,但是例如,如果我仅搜索"v-str",则不会得到结果.(应该有结果,因为它是对网上商店的实时搜索)

If I search for "v-strom" without "*" it works, but if I just search for "v-str" for example I don't get the result. (There should be a result because it's for a live search for a webshop)

那么,两个预期结果有什么区别?为什么它适用于"gsx- "而不适用于"v-"?

So, what's the difference between the 2 expected results? why does it work for "gsx-" but not for "v-" ?

推荐答案

StandardAnalyzer会将连字符视为空白.这样一来,您的查询"gsx-*" 就会变成"gsx *" ,而"v-*" 变成零,因为这也消除了单个字母标记.您在搜索结果中看到的字段内容就是该字段的存储值,它完全独立于为该字段索引的术语.

StandardAnalyzer will treat the hyphen as whitespace, I believe. So it turns your query "gsx-*" into "gsx*" and "v-*" into nothing because at also eliminates single-letter tokens. What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field.

因此,您想要整体将"v-strom"作为索引项. StandardAnalyzer 不适合此类文本.也许可以使用 WhitespaceAnalyzer SimpleAnalyzer .如果仍然不能解决问题,那么您还可以选择将自己的分析器放在一起,或者只是将这两个受指导的分析器放在一起,然后用进一步的 TokenFilters 组成它们.中给出了很好的解释Lucene Analysis包Javadoc.

So what you want is for "v-strom" as a whole to be an indexed term. StandardAnalyzer is not suited to this kind of text. Maybe have a go with the WhitespaceAnalyzer or SimpleAnalyzer. If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further TokenFilters. A very good explanation is given in the Lucene Analysis package Javadoc.

顺便说一句,不需要在索引中输入所有变体,例如V-strom,V-Strom等.这个想法是让同一个分析器在索引和解析时将所有这些变体归一化为相同的字符串查询.

BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query.

这篇关于带有“-"的Lucene索引问题特点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆