弹性搜索:如何索引仅作为禁忌词? [英] elasticsearch: how to index terms which are stopwords only?

查看:74
本文介绍了弹性搜索:如何索引仅作为禁忌词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在背景中用弹性搜索建立了我自己的小搜索。但是有一件事我在文档中找不到。



我正在索引音乐家和乐队的名字。有一个叫做The The的乐队,由于这个乐队列表,这个乐队从来没有编入索引。



我知道我可以完全忽略停止的单词列表,但这是搜索其他乐队(如谁)将会爆炸的结果,而不是我想要的。



所以,是否可以在索引中保存The The禁用所有的停止词?

解决方案

您可以使用同义词过滤器 The 转换为单个令牌例如 ,这不会被停止词过滤器删除。



首先,配置分析器:

  curl -XPUT'http://127.0.0.1:9200/test/?pretty=1'-d'
{
设置:{
分析:{
过滤器:{
syn:{
同义词:[
the =>

type:同义词
}
},
analyzer:{
syn:{
filter:[
smallcase,
syn,
stop
],
type:custom,
tokenizer:standard
}
}
}
}
}
'

然后用字符串The The Who进行测试。

  curl -XGET'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+The+The+Who& analyzer = syn'

{
tokens:[
{
end_offset:7,
position:1,
start_offset:0,
type:SYNONYM,
token:thethe
},
{
end_offset ,
position:3,
start_offset:12,
type:< ALPHANUM>,
token:who
}
]
}

The已被标记作为The Who as who因为前面的已被停用词过滤器删除。



停止或不停止停止



哪些让我们回到是否应该添加停用词?你说:

 我知道我可以完全忽略停止单词列表
,但这不是我想要的搜索
的其他乐队,如谁将爆炸。

你是什么意思?爆炸怎么样?索引大小?性能?



最初引入了停用词,通过删除可能对查询的相关性几乎没有影响的常用单词来提高搜索引擎的性能。但是,从那以后我们已经走了很长的路。我们的服务器的能力远远超过了80年代。



索引阻止词对索引大小不会产生巨大的影响。例如,为了对单词进行索引,意味着向索引添加单个术语。你已经有了数千个术语 - 索引的诀窍也不会对大小或性能产生太大影响。



实际上,更大的问题是 是非常常见的,因此对相关性的影响不大,所以搜索马德里音乐会将更喜欢 Madrid 超过其他条款。
这可以通过使用 shingle 过滤器,这将导致这些令牌:

  ['the','the concert','concert madrid '] 

虽然 可能是常见的, 不是,所以会排名更高。



您不会自己查询shingled字段,但是您可以将查询与标准分析器(无停用词)标记的字段相结合,并对带有字符串的字段进行查询。



我们可以使用多字段来分析文本字段有两种不同的方式:

  curl -XPUT'http ://127.0.0.1:9200 / test /?pretty = 1'-d'
{
mappings:{
test:{
properties {
text:{
fields:{
shingle :{
type:string,
analyzer:shingle
},
text:{
type string,
analyzer:no_stop
}
},
type:multi_field
}
}
}
},
settings:{
analysis:{
analyzer:{
no_stop:{
:,
type:standard
},
shingle:{
filter:[
standard,
小写,
shingle
],
type:custom,
tokenizer:standard
}
}
}
}
}
'

然后你搜索一个 multi_match 查询来查询这两个版本的字段,使得shingled版本更boost/相关性。在这个例子中, text.shingle ^ 2 意味着我们希望将该字段提高2:

  curl -XGET'http://127.0.0.1:9200/test/test/_search?pretty=1'-d'
{
查询:{
multi_match:{
fields:[
text,
text.shingle ^ 2
],
query音乐会马德里
}
}
}
'


I had much success building my own little search with elasticsearch in the background. But there is one thing I couldn't find in the documentation.

I'm indexing the names of musicians and bands. There is one band called "The The" and due to the stop words list this band is never indexed.

I know I can ignore the stop words list completely but this is not what I want since the results searching for other bands like "the who" would explode.

So, is it possible to save "The The" in the index but not disabling the stop words at all?

解决方案

You can use the synonym filter to convert The The into a single token eg thethe which won't be removed by the stopwords filter.

First, configure the analyzer:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "syn" : {
               "synonyms" : [
                  "the the => thethe"
               ],
               "type" : "synonym"
            }
         },
         "analyzer" : {
            "syn" : {
               "filter" : [
                  "lowercase",
                  "syn",
                  "stop"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Then test it with the string "The The The Who".

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=The+The+The+Who&analyzer=syn' 

{
   "tokens" : [
      {
         "end_offset" : 7,
         "position" : 1,
         "start_offset" : 0,
         "type" : "SYNONYM",
         "token" : "thethe"
      },
      {
         "end_offset" : 15,
         "position" : 3,
         "start_offset" : 12,
         "type" : "<ALPHANUM>",
         "token" : "who"
      }
   ]
}

"The The" has been tokenized as "the the", and "The Who" as "who" because the preceding "the" was removed by the stopwords filter.

To stop or not to stop

Which brings us back to whether we should include stopwords or not? You said:

I know I can ignore the stop words list completely 
but this is not what I want since the results searching 
for other bands like "the who" would explode.

What do you mean by that? Explode how? Index size? Performance?

Stopwords were originally introduced to improve search engine performance by removing common words which are likely to have little effect on the relevance of a query. However, we've come a long way since then. Our servers are capable of much more than they were back in the 80s.

Indexing stopwords won't have a huge impact on index size. For instance, to index the word the means adding a single term to the index. You already have thousands of terms - indexing the stopwords as well won't make much difference to size or to performance.

Actually, the bigger problem is that the is very common and thus will have a low impact on relevance, so a search for "The The concert Madrid" will prefer Madrid over the other terms. This can be mitigated by using a shingle filter, which would result in these tokens:

['the the','the concert','concert madrid']

While the may be common, the the isn't and so will rank higher.

You wouldn't query the shingled field by itself, but you could combine a query against a field tokenized by the standard analyzer (without stopwords) with a query against the shingled field.

We can use a multi-field to analyze the text field in two different ways:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "mappings" : {
      "test" : {
         "properties" : {
            "text" : {
               "fields" : {
                  "shingle" : {
                     "type" : "string",
                     "analyzer" : "shingle"
                  },
                  "text" : {
                     "type" : "string",
                     "analyzer" : "no_stop"
                  }
               },
               "type" : "multi_field"
            }
         }
      }
   },
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "no_stop" : {
               "stopwords" : "",
               "type" : "standard"
            },
            "shingle" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "shingle"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Then use a multi_match query to query both versions of the field, giving the shingled version more "boost"/relevance. In this example the text.shingle^2 means that we want to boost that field by 2:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "multi_match" : {
         "fields" : [
            "text",
            "text.shingle^2"
         ],
         "query" : "the the concert madrid"
      }
   }
}
'

这篇关于弹性搜索:如何索引仅作为禁忌词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆