Elasticsearch-查询具有不同术语的主要和次要属性 [英] Elasticsearch - query primary and secondary attribute with different terms

查看:84
本文介绍了Elasticsearch-查询具有不同术语的主要和次要属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Elasticsearch查询最初是从具有大量冗余的几个关系数据库中导出的数据.现在,我要在具有主属性和一个或多个应匹配的辅助属性的地方执行查询.我尝试使用带有必须术语和应该术语的布尔查询,但这似乎不适用于我的情况,这可能像这样:

I'm using elasticsearch to query data that originally was exported out of several relational databases that had a lot of redundencies. I now want to perform queries where I have a primary attribute and one or more secondary attributes that should match. I tried using a bool query with a must term and a should term, but that doesn't seem to work for my case, which may look like this:

示例:

我有一个文档,其中包含用户的fullnamestreet name,我想搜索其他索引中的相似用户.因此,对于我的查询而言,最佳匹配应该是fullname上的最佳匹配,而streetname字段上的最佳匹配.但是,由于原始数据存在很多冗余和不一致之处,因此字段fullname(我是从字段name1,name2,name3中手动创建的)可能多次包含相同的名称,并且似乎Elasticsearch在必填项中将双精度匹配字段高于应有属性中的匹配项.

I have a document with fullname and street name of a user and I want to search for similiar users in different indices. So the best match for my query should be the best match on fullname and best match on streetname field. But since the original data has a lot of redundencies and inconsistencies the field fullname (which I manually created out of fields name1, name2, name3) may contain the same name multiple times and it seems that elasticsearch ranks a double match in a must field higher than a match in a should attribute.

这意味着,我要使用以下示例数据查询John Doe Back Street:

That means, I want to query for John Doe Back Street with the following sample data:

{
    "fullname" : "John Doe John and Jane",
    "street" : "Main Street"

}
{
    "fullname" : "John Doe",
    "street" : "Back Street"

}

长话短说,我想查询主要属性fullname - John Doe和次要属性street - Back Street,并希望第二个文档最匹配,而不是第一个文档,因为它多次包含John.

Long story short, I want to query for a main attribute fullname - John Doe and secondary attribute street - Back Street and want the second document to be the best match and not the first because it contains John multiple times.

推荐答案

在Elasticsearch中操纵相关性并不是最简单的部分.分数计算基于三个主要部分:

Manipulation of relevance in Elasticsearch is not the easiest part. Score calculation is based on three main parts:

  • 学期频率
  • 反文档频率
  • 字段长度范数

简短地:

  • 该术语在田间经常出现,更相关的是
  • 该术语经常出现在整个索引中,与LESS相关的是
  • 期限越长,相关性越强

我建议您阅读以下材料:

I recommend you to read below materials:

  • What Is Relevance?
  • Theory Behind Relevance Scoring
  • Controlling Relevance and subpages

如果通常,对于您来说,fullname的结果比street的结果更重要,则可以提高第一个结果的重要性.下面是基于我的工作代码的示例代码:

If in general, in your case, result of fullname is more important than from street you can boost importance of the first one. Below you have example code base on my working code:

{
  "query": {
    "multi_match": {
      "query": "john doe",
      "fields": [
        "fullname^10",
        "street"
      ]
    }
  }
}

在此示例中,fullname的结果比street的结果重要十倍(^10).您可以尝试操纵提升或使用其他方式来控制相关性,但是正如我在开始时提到的那样-这不是最简单的方式,并且一切都取决于您的特定情况.主要是因为反文档频率"部分考虑了整个索引中的术语-每个下一个添加到索引的文档都可能会更改同一搜索查询的得分.

In this example result from fullname is ten times (^10) much important than result from street. You can try to manipulate the boost or use other ways to control relevance but as I mentioned at the beginning - it is not the easiest way and everything depends on your particular situation. Mostly because of "inverse document frequency" part which considers terms from entire index - each next added document to index will probably change the score of the same search query.

我知道我没有直接回答,但希望能帮助您了解它的工作原理.

I know that I did not answer directly but I hope to helped you to understand how this works.

这篇关于Elasticsearch-查询具有不同术语的主要和次要属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆