使用弹性搜索来检索标签内容和连字符 [英] Using Elastic Search to retrieve tag contents and hyphenated words
问题描述
< fantastic>项目
被索引为 [< fantastic>,project ]
和 ABC-123-def项目索引为
[ABC-123-def,project]
当我们搜索ABC- *时,预期的项目会出现。但是,如果我们专门搜索< fantastic>
,则不会显示。就像Lucene / Elastic Search一样,忽略包含尖括号的任何搜索项。但是,我们可以搜索梦幻般的
,或 <* fantastic *
或 * fantastic *
,它发现很好,即使该单词没有与尖括号分开索引。
标准分析器对任何非字母数字字符进行标记。项目被索引为
<$ c $ c> [fantastic,project]
和 ABC-123-def 项目被索引为
[ABC,123,def,project]
这打破了使用 ABC-123 - *
。然而,我们用标准分析仪得到的是,有人可以专门搜索< fantastic>
,并返回所需的结果。
如果不是标准分析器,我们向空白分析器添加一个char_filter,以过滤掉标签上的尖括号(替换<(。*)>
与 $ 1
)它将被索引:
< fantatsic>项目
索引为
[fantastic,project]
(无尖括号)。而ABC-123-def项目被索引为
[ABC-123-def 项目]
它看起来很有前途,但我们最终得到的结果与纯白色空格相同分析器:当我们专门搜索< fantastic>
时,我们什么都没有,但是 * fantastic *
工作正常。 / p>
任何人在Stack Overflow上可以解释这个奇怪的事情吗?
你可以为特殊字符创建标记器,请参见以下示例
{
settings:{
index:{
number_of_shards:1,
number_of_replicas:1
},
分析:{
过滤器:{
custom_filter:{
type:word_delimiter,
type_table:[> => ALPHA,> ALPHA]
}
},
analyzer:{
custom_analyzer:{
type:custom,
tokenizer:whitespace,
:[smallcase,custom_filter]
}
}
}
},
mappings:{
my_type
properties:{
msg:{
type:string,
analyzer:custom_analyzer
}
}
}
}
}
> 作为ALPHA字符,导致底层的word_delimiter将它们视为字母字符。
We have elastic search configured with a whitespace analyzer in our application. The words are tokenized on whitespace, so a name like <fantastic> project
is indexed as
["<fantastic>", "project"]
and ABC-123-def project is indexed as
["ABC-123-def", "project"]
When we then search for ABC-* the expected project turns up. But, if we specifically search for <fantastic>
it won't show up at all. It's as though Lucene/Elastic Search ignores any search term that includes angle brackets. However, we can search for fantastic
, or <*fantastic*
or *fantastic*
and it finds it fine, even though the word is not indexed separately from the angle brackets.
The standard analyzer tokenizes on any non-alphanumeric character. <fantatsic>
project is indexed as
["fantastic", "project"]
and ABC-123-def project is indexed as
["ABC", "123", "def", "project"]
This breaks the ability to search successfully using ABC-123-*
. However, what we get with the standard analyzer is that someone can then specifically search for <fantastic>
and it returns the desired results.
If instead of a standard analyzer we add a char_filter to the whitespace analyzer that filters out the angle brackets on tags, (replace <(.*)>
with $1
) it will be indexed thus:
<fantatsic> project
is indexed as
["fantastic", "project"]
(no angle brackets). And ABC-123-def project is indexed as
["ABC-123-def", "project"]
It looks promising, but we end up with the same results as for the plain whitespace analyzer: When we search specifically for <fantastic>
, we get nothing, but *fantastic*
works fine.
Can anyone out on Stack Overflow explain this weirdness?
You could create a tokenizer for special characters, see the following example
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"number_of_replicas" : 1
},
"analysis" : {
"filter" : {
"custom_filter" : {
"type" : "word_delimiter",
"type_table": ["> => ALPHA", "< => ALPHA"]
}
},
"analyzer" : {
"custom_analyzer" : {
"type" : "custom",
"tokenizer" : "whitespace",
"filter" : ["lowercase", "custom_filter"]
}
}
}
},
"mappings" : {
"my_type" : {
"properties" : {
"msg" : {
"type" : "string",
"analyzer" : "custom_analyzer"
}
}
}
}
}
<> as ALPHA character causing the underlying word_delimiter to treat them as alphabetic characters.
这篇关于使用弹性搜索来检索标签内容和连字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!