Elasticsearch-全域值的基数 [英] Elasticsearch - Cardinality over Full Field Value

查看:47
本文介绍了Elasticsearch-全域值的基数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的文档:

I have a document that looks like this:

{
   "_id":"some_id_value",
   "_source":{
      "client":{
         "name":"x"
      },
      "project":{
         "name":"x November 2016"
      }
   }
}

我正在尝试执行一个查询,该查询将为我获取每个客户端的唯一项目名称的计数.为此,我在 project.name 上使用了 cardinality 的查询.我确定该特定客户端只有 4 个唯一的项目名称.但是,当我运行查询时,我得到了 5 的计数,我知道这是错误的.

I am attempting to perform a query that will fetch me the count of unique project names for each client. For this, I am using a query with cardinality over the project.name. I am sure that there are only 4 unique project names for this particular client. However, when I run my query, I get a count of 5, which I know is wrong.

项目名称都包含客户端的名称.例如,如果客户为"X",则项目名称将为"X Testing November 2016"或"X Jan 2016",等等.我不知道这是不是一个考虑因素.

The project names all contain the name of the client. For instance, if a client is "X", project names will be "X Testing November 2016", or "X Jan 2016", etc. I don't know if that is a consideration.

这是文档类型的映射

{
   "mappings":{
      "vma_docs":{
         "properties":{
            "client":{
               "properties":{
                  "contact":{
                     "type":"string"
                  },
                  "name":{
                     "type":"string"
                  }
               }
            },
            "project":{
               "properties":{
                  "end_date":{
                     "format":"yyyy-MM-dd",
                     "type":"date"
                  },
                  "project_type":{
                     "type":"string"
                  },
                  "name":{
                     "type":"string"
                  },
                  "project_manager":{
                     "index":"not_analyzed",
                     "type":"string"
                  },
                  "start_date":{
                     "format":"yyyy-MM-dd",
                     "type":"date"
                  }
               }
            }
         }
      }
   }
}

这是我的搜索查询

{
   "fields":[
      "client.name",
      "project.name"
   ],
   "query":{
      "bool":{
         "must":{
            "match":{
               "client.name":{
                  "operator":"and",
                  "query":"ABC systems"
               }
            }
         }
      }
   },
   "aggs":{
      "num_projects":{
         "cardinality":{
            "field":"project.name"
         }
      }
   },
   "size":5
}

这些是我得到的结果(为简洁起见,我仅发布了2个结果).请发现num_projects聚合返回5,但必须仅返回4,这是项目的总数.

These are the results I get (I have only posted 2 results for the sake of brevity). Please find that the num_projects aggregation returns 5, but must only return 4, which are the total number of projects.

{
   "hits":{
      "hits":[
         {
            "_score":5.8553367,
            "_type":"vma_docs",
            "_id":"AVTMIM9IBwwoAW3mzgKz",
            "fields":{
               "project.name":[
                  "ABC"
               ],
               "client.name":[
                  "ABC systems Pvt Ltd"
               ]
            },
            "_index":"vma"
         },
         {
            "_score":5.8553367,
            "_type":"vma_docs",
            "_id":"AVTMIM9YBwwoAW3mzgK2",
            "fields":{
               "project.name":[
                  "ABC"
               ],
               "client.name":[
                  "ABC systems Pvt Ltd"
               ]
            },
            "_index":"vma"
         }
      ],
      "total":18,
      "max_score":5.8553367
   },
   "_shards":{
      "successful":5,
      "failed":0,
      "total":5
   },
   "took":4,
   "aggregations":{
      "num_projects":{
         "value":5
      }
   },
   "timed_out":false
}

仅供参考:项目名称为 ABC ABC 2016年11月 ABC retest November ABC Mobile App

FYI: The project names are ABC, ABC Nov 2016, ABC retest November, ABC Mobile App

推荐答案

您需要在 project.name 字段中进行以下映射:

You need the following mapping for your project.name field:

{
  "mappings": {
    "vma_docs": {
      "properties": {
        "client": {
          "properties": {
            "contact": {
              "type": "string"
            },
            "name": {
              "type": "string"
            }
          }
        },
        "project": {
          "properties": {
            "end_date": {
              "format": "yyyy-MM-dd",
              "type": "date"
            },
            "project_type": {
              "type": "string"
            },
            "name": {
              "type": "string",
              "fields": {
                "raw": {
                  "type": "string",
                  "index": "not_analyzed"
                }
              }
            },
            "project_manager": {
              "index": "not_analyzed",
              "type": "string"
            },
            "start_date": {
              "format": "yyyy-MM-dd",
              "type": "date"
            }
          }
        }
      }
    }
  }
}

基本上,这是一个名为 raw 的子字段,其中将与 project.name 中相同的值放入 project.name.raw 中,但不涉及它(对它进行标记或分析).然后您需要使用的查询是:

It's basically a subfield called raw where the same value put in project.name is put in project.name.raw but without touching it (tokenizing or analyzing it). And then the query you need to use is:

{
  "fields": [
    "client.name",
    "project.name"
  ],
  "query": {
    "bool": {
      "must": {
        "match": {
          "client.name": {
            "operator": "and",
            "query": "ABC systems"
          }
        }
      }
    }
  },
  "aggs": {
    "num_projects": {
      "cardinality": {
        "field": "project.name.raw"
      }
    }
  },
  "size": 5
}

这篇关于Elasticsearch-全域值的基数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆