如何进行对字段最大值进行过滤的Elasticsearch查询? [英] How to make an elasticsearch query that filters on the maximum value of a field?

查看:106
本文介绍了如何进行对字段最大值进行过滤的Elasticsearch查询?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望能够查询文本,但也只能检索数据中某个整数字段的最大值的结果.我已经阅读了有关聚合和过滤器的文档,但我不太清楚自己在寻找什么.

I would like to be able to query for text but also retrieve only the results with the maximum value of a certain integer field in my data. I have read the docs about aggregations and filters and I don't quite see what I am looking for.

例如,我有一些重复的数据被索引,除了整数字段外,它们都是相同的-我们将此字段称为 lastseen .

For instance, I have some repeating data that gets indexed that is the same except for an integer field - let's call this field lastseen.

因此,作为一个示例,给定将这些数据放入elasticsearch:

So, as an example, given this data put into elasticsearch:

  //  these two the same except "lastseen" field
  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "dinner carrot potato broccoli",
    "field2": "something here",
    "lastseen": 1000
  }'

  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "dinner carrot potato broccoli",
    "field2": "something here",
    "somevalue": 100
  }'

  # and these two the same except "lastseen" field
  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "fish chicken something",
    "field2": "dinner",
    "lastseen": 2000
  }'

  curl -XPOST localhost:9200/myindex/myobject -d '{
    "field1": "fish chicken something",
    "field2": "dinner",
    "lastseen": 200
  }'

如果我查询晚餐"

  curl -XPOST localhost:9200/myindex -d '{  
   "query": {
        "query_string": {
            "query": "dinner"
        }
    }
    }'

我会得到4条结果.我想要一个过滤器,这样我只能得到两个结果-仅包含 lastseen 字段最大的项目.

I'll get 4 results back. I'd like to have a filter such that I only get two results back - only the items with the maximum lastseen field.

这是 显然不正确 ,但希望它能使您了解我的追求:

This is obviously not right, but hopefully it gives you an idea of what I am after:

{
    "query": {
        "query_string": {
            "query": "dinner"
        }
    },
    "filter": {
          "max": "lastseen"
        }

}

结果类似于:

"hits": [
      {
        ...
        "_source": {
          "field1": "dinner carrot potato broccoli",
          "field2": "something here",
          "lastseen": 1000
        }
      },
      {
        ...
        "_source": {
          "field1": "fish chicken something",
          "field2": "dinner",
          "lastseen": 2000
        }
      } 
   ]

更新1:我尝试创建一个映射,该映射从索引中排除了 lastseen .这没有用.仍会取回所有4个结果.

update 1: I tried creating a mapping that excluded lastseen from being indexed. This did not work. Still getting all 4 results back.

curl -XPOST localhost:9200/myindex -d '{  
    "mappings": {
      "myobject": {
        "properties": {
          "lastseen": {
            "type": "long",
            "store": "yes",
            "include_in_all": false
          }
        }
      }
    }
}'

更新2:我尝试使用agg方案在此处列出的进行重复数据删除,并且它不起作用,但更重要的是,我没有找到一种将其与关键字搜索结合的方法.

update 2: I tried a deduplication with the agg scheme listed here, and it did not work, but more importantly, I don't see a way to combine that with a keyword search.

推荐答案

不理想,但我认为它可以满足您的需求.

Not ideal, but I think it gets you what you need.

更改您的 field1 字段的映射,假设这是您用来定义重复"文档的映射,例如:

Change the mapping of your field1 field, assuming this is the one that you use to define "duplicate" documents, like this:

PUT /lastseen
{
  "mappings": {
    "test": {
      "properties": {
        "field1": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        },
        "field2": {
          "type": "string"
        },
        "lastseen": {
          "type": "long"
        }
      }
    }
  }
}

的意思是,您添加了一个 .not 的 .raw 子字段,这意味着将按原样对它进行索引,而无需进行分析并将其分解为术语.这是为了使有些重复的文档发现"成为可能.

meaning, you add a .raw subfield that is not_analyzed which means it will be indexed just the way it is, no analysis and split into terms. This is to make possible the somewhat "duplicate documents spotting".

然后,您需要在 field1.raw (用于重复项)上使用 terms 聚合(用于重复项)和 top_hits 子聚合以获取每个 field1 值的单个文档:

Then, you need to use a terms aggregation on field1.raw (for duplicates) and a top_hits sub-aggregation to get a single document for each field1 value:

GET /lastseen/test/_search
{
  "size": 0,
  "query": {
    "query_string": {
      "query": "dinner"
    }
  },
  "aggs": {
    "field1_unique": {
      "terms": {
        "field": "field1.raw",
        "size": 2
      },
      "aggs": {
        "first_one": {
          "top_hits": {
            "size": 1,
            "sort": [{"lastseen": {"order":"desc"}}]
          }
        }
      }
    }
  }
}

此外,由 top_hits 返回的单个文档是具有最高 lastseen 的文档(通过"sort":[{"lastseen":{"order":"desc"}}] ).

Also, that single document returned by top_hits is the one with the highest lastseen (thing made possible by "sort": [{"lastseen": {"order":"desc"}}]).

您将获得的结果是这些(在 aggregations 而不是 hits 下):

The results you will get back are these (under aggregations not hits):

   ...
   "aggregations": {
      "field1_unique": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "dinner carrot potato broccoli",
               "doc_count": 2,
               "first_one": {
                  "hits": {
                     "total": 2,
                     "max_score": null,
                     "hits": [
                        {
                           "_index": "lastseen",
                           "_type": "test",
                           "_id": "AU60ZObtjKWeJgeyudI-",
                           "_score": null,
                           "_source": {
                              "field1": "dinner carrot potato broccoli",
                              "field2": "something here",
                              "lastseen": 1000
                           },
                           "sort": [
                              1000
                           ]
                        }
                     ]
                  }
               }
            },
            {
               "key": "fish chicken something",
               "doc_count": 2,
               "first_one": {
                  "hits": {
                     "total": 2,
                     "max_score": null,
                     "hits": [
                        {
                           "_index": "lastseen",
                           "_type": "test",
                           "_id": "AU60ZObtjKWeJgeyudJA",
                           "_score": null,
                           "_source": {
                              "field1": "fish chicken something",
                              "field2": "dinner",
                              "lastseen": 2000
                           },
                           "sort": [
                              2000
                           ]
                        }
                     ]
                  }
               }
            }
         ]
      }
   }

这篇关于如何进行对字段最大值进行过滤的Elasticsearch查询?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆