Elasticsearch-全域值的基数 [英] Elasticsearch - Cardinality over Full Field Value
问题描述
我有一个看起来像这样的文档:
I have a document that looks like this:
{
"_id":"some_id_value",
"_source":{
"client":{
"name":"x"
},
"project":{
"name":"x November 2016"
}
}
}
我正在尝试执行一个查询,该查询将为我获取每个客户端的唯一项目名称的计数.为此,我在 project.name
上使用了 cardinality
的查询.我确定该特定客户端只有 4
个唯一的项目名称.但是,当我运行查询时,我得到了 5
的计数,我知道这是错误的.
I am attempting to perform a query that will fetch me the count of unique project names for each client. For this, I am using a query with cardinality
over the project.name
. I am sure that there are only 4
unique project names for this particular client. However, when I run my query, I get a count of 5
, which I know is wrong.
项目名称都包含客户端的名称.例如,如果客户为"X",则项目名称将为"X Testing November 2016"或"X Jan 2016",等等.我不知道这是不是一个考虑因素.
The project names all contain the name of the client. For instance, if a client is "X", project names will be "X Testing November 2016", or "X Jan 2016", etc. I don't know if that is a consideration.
这是文档类型的映射
{
"mappings":{
"vma_docs":{
"properties":{
"client":{
"properties":{
"contact":{
"type":"string"
},
"name":{
"type":"string"
}
}
},
"project":{
"properties":{
"end_date":{
"format":"yyyy-MM-dd",
"type":"date"
},
"project_type":{
"type":"string"
},
"name":{
"type":"string"
},
"project_manager":{
"index":"not_analyzed",
"type":"string"
},
"start_date":{
"format":"yyyy-MM-dd",
"type":"date"
}
}
}
}
}
}
}
这是我的搜索查询
{
"fields":[
"client.name",
"project.name"
],
"query":{
"bool":{
"must":{
"match":{
"client.name":{
"operator":"and",
"query":"ABC systems"
}
}
}
}
},
"aggs":{
"num_projects":{
"cardinality":{
"field":"project.name"
}
}
},
"size":5
}
这些是我得到的结果(为简洁起见,我仅发布了2个结果).请发现num_projects聚合返回5,但必须仅返回4,这是项目的总数.
These are the results I get (I have only posted 2 results for the sake of brevity). Please find that the num_projects aggregation returns 5, but must only return 4, which are the total number of projects.
{
"hits":{
"hits":[
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9IBwwoAW3mzgKz",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
},
{
"_score":5.8553367,
"_type":"vma_docs",
"_id":"AVTMIM9YBwwoAW3mzgK2",
"fields":{
"project.name":[
"ABC"
],
"client.name":[
"ABC systems Pvt Ltd"
]
},
"_index":"vma"
}
],
"total":18,
"max_score":5.8553367
},
"_shards":{
"successful":5,
"failed":0,
"total":5
},
"took":4,
"aggregations":{
"num_projects":{
"value":5
}
},
"timed_out":false
}
仅供参考:项目名称为 ABC
, ABC 2016年11月
, ABC retest November
, ABC Mobile App
FYI: The project names are ABC
, ABC Nov 2016
, ABC retest November
, ABC Mobile App
推荐答案
您需要在 project.name
字段中进行以下映射:
You need the following mapping for your project.name
field:
{
"mappings": {
"vma_docs": {
"properties": {
"client": {
"properties": {
"contact": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"project": {
"properties": {
"end_date": {
"format": "yyyy-MM-dd",
"type": "date"
},
"project_type": {
"type": "string"
},
"name": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"project_manager": {
"index": "not_analyzed",
"type": "string"
},
"start_date": {
"format": "yyyy-MM-dd",
"type": "date"
}
}
}
}
}
}
}
基本上,这是一个名为 raw
的子字段,其中将与 project.name
中相同的值放入 project.name.raw
中,但不涉及它(对它进行标记或分析).然后您需要使用的查询是:
It's basically a subfield called raw
where the same value put in project.name
is put in project.name.raw
but without touching it (tokenizing or analyzing it). And then the query you need to use is:
{
"fields": [
"client.name",
"project.name"
],
"query": {
"bool": {
"must": {
"match": {
"client.name": {
"operator": "and",
"query": "ABC systems"
}
}
}
}
},
"aggs": {
"num_projects": {
"cardinality": {
"field": "project.name.raw"
}
}
},
"size": 5
}
这篇关于Elasticsearch-全域值的基数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!