在ElasticSearch中过滤,嵌套的inner_hits查询的聚合 [英] Aggregation on filtered, nested inner_hits query in ElasticSearch

查看:3134
本文介绍了在ElasticSearch中过滤,嵌套的inner_hits查询的聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚几天刚刚到ElasticSearch,作为一个学习活动,已经实施了一个基本的工作刮板,整合了几个工作列表网站的工作,并填写了一些索引,并为我提供了一些数据。



我的索引包含列出作业的每个网站的文档。这些文档的每个属性都是一个作业数组,其中包含该站点上存在的每个作业的对象。我正在考虑将每个作业作为自己的文档进行索引(特别是因为ElasticSearch文档说inner_hits是一个实验功能),但是现在我试图看看我是否可以使用ElasticSearch的inner_hits和嵌套功能来完成我想要做的。



我可以查询,过滤和返回仅匹配的作业。但是,我不知道如何将相同的inner_hits约束应用于聚合。



这是我的映射:

  {
jobsitesIdx:{
mappings:{
sites:{
properties:{
createdAt:{
type:date,
format:dateOptionalTime
},
jobs:{
类型:嵌套,
属性:{
company:{
type:string
},
{
type:string
},
link:{
type:string,
index:not_analyzed
},
location:{
type:string,
fields:{
raw:{
键入:string,
index:not_analyzed
}
}
},
title:{
type:string
}

},
jobscount:{
type:long
},
sitename:{
type :string
},
url:{
type:string
}
}
}
}
}
}

这是一个我正在尝试的查询和聚合(来自Node.js):

  client.search({
index:'jobsitesIdx,
type:'sites',
body:{


aggs:{
jobs:{
:{
path:jobs
},
aggs:{
location:{terms:{ field:jobs.location.raw,size:25}},
company:{terms:{field:jobs.company.raw,size 25}}
}
}
},


查询:{
过滤:{
查询:{match_all:{}},
filter:{
nested:{
inner_hits:{size:1000},
路径:工作,
查询:{
过滤:{
查询:{match_all:{}},
:{
and:[
{term:{jobs.location:york}},
{term:{jobs.location新的}}
]
}
}
}
}
}
}
}
}
},function(error,response){
response.hits.hits.forEach( function(jobsite){
jobs = jobsite.inner_hits.jobs.hits.hits;

jobs.forEach(function(job){
console.log(job);
});

});

console.log(response.aggregations.jobs.location.buckets);
});

这让我回到纽约的所有内部职位,但总计显示了每一个任何关于如何仅获取匹配的inner_hits中包含的数据的建议?



编辑:
我正在更新,以包括根据请求导出映射和索引数据。我使用Taskrabbit的弹性回收工具导出了这个工具,这里找到:
https://github.com/taskrabbit/弹性搜索转储



索引: http:// pastebin.com/WaZwBwn4
映射: http://pastebin.com/ZkGnYN94



上述链接数据与原始问题中的示例代码不同,因为该索引在数据中命名为jobsites6,而不是问题中提到的jobsitesIdx。而且,数据类型是job,而在上面的代码中,它是sites。



我已经填写了上面代码中的回调来显示响应数据。我正在纽约看到只有在inner_hits的foreach循环的工作,如预期的那样,但是我正在看到这个聚合位置:

  [{key:'New York,NY',doc_count:243},
{key:'San Francisco,CA',doc_count:92},
{key:'Chicago,IL' doc_count:43},
{key:'Boston,MA',doc_count:39},
{key:'Berlin,Germany',doc_count:22},
{key:'Seattle ,WA',doc_count:22},
{key:'Los Angeles,CA',doc_count:20},
{key:'Austin,TX',doc_count:18},
{key:'Anywhere',doc_count:16},
{key:'Cupertino,CA',doc_count:15},
{key:'Washington DC',doc_count:14},
{key:'United States',doc_count:11},
{key:'Atlanta,GA',doc_count:10},
{key:'London,UK',doc_count:10}
{key:'Ulm,Deutschland',doc_count:10},
{key:'Riverton,UT',doc_count:9},
{key: San Diego,CA',doc_count:9},
{key:'Charlotte,NC',doc_count:8},
{key:'Irvine,CA',doc_count:8},
{key:'London',doc_count:8},
{key:'San Mateo,CA',doc_count:8},
{key:'Boulder,CO',doc_count:7}
{key:'Houston,TX',doc_count:7},
{key:'Palo Alto,CA',doc_count:7},
{key:'Sydney,Australia' doc_count:7}]

由于我的inner_hits仅限于纽约的内容,我可以看到聚合不在我的inner_hits上,因为它给了我所有位置的计数。

解决方案

您可以通过添加相同的过滤器在你的聚合只包括纽约的工作。
另请注意,在您的第二个聚合中,您有 company.raw ,但在映射中 jobs.company 字段没有 not_analyzed 部分名为 raw ,所以你可能需要添加它,如果你想聚合在未分析公司名称。

  {
_source:[
sitename
]
查询:{
过滤:{
过滤器:{
嵌套:{
inner_hits:{
大小:1000
},
路径:作业,
查询:{
过滤:{
过滤器:{
条款:{
jobs.location:[
new,
york
]
}
}
}
}
}
}
}
},
aggs:{
jobs:{
嵌套:{
path:jobs
},
aggs:{
only_loc:{
filter:{< ----添加此过滤器
条款:{
jobs.location:[
new,
york
]

},
aggs:{
location:{
terms:{
field:jobs.location.raw ,
size:25
}
},
company:{
terms:{
field:jobs。公司,
size:25
}
}
}
}
}
}
}
}


I'm only a few days new to ElasticSearch, and as a learning exercise have implemented a rudimentary job scraper that aggregates jobs from a few job listing sites and populates an index with some data for me to play with.

My index contains a document for each website that lists jobs. A property of each of these documents is a 'jobs' array, which contains an object for each job that exists on that site. I am considering indexing each job as its own document (especially since the ElasticSearch documentation says that inner_hits is an experimental feature) but for now, I am trying to see if I can accomplish what I want to do using the inner_hits and nested features of ElasticSearch.

I am able to query, filter, and return back only matching jobs. However, I am not sure how to apply the same inner_hits constraints to an aggregation.

This is my mapping:

{
  "jobsitesIdx" : {
    "mappings" : {
      "sites" : {
        "properties" : {
          "createdAt" : {
            "type" : "date",
            "format" : "dateOptionalTime"
          },
          "jobs" : {
            "type" : "nested",
            "properties" : {
              "company" : {
                "type" : "string"
              },
              "engagement" : {
                "type" : "string"
              },
              "link" : {
                "type" : "string",
                "index" : "not_analyzed"
              },
              "location" : {
                "type" : "string",
                "fields" : {
                  "raw" : {
                    "type" : "string",
                    "index" : "not_analyzed"
                  }
                }
              },
              "title" : {
                "type" : "string"
              }
            }
          },
          "jobscount" : {
            "type" : "long"
          },
          "sitename" : {
            "type" : "string"
          },
          "url" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

This is a query and aggregate that I am trying (from Node.js):

client.search({
  "index": 'jobsitesIdx,
  "type": 'sites',
  "body": {


    "aggs" : {
            "jobs" : {
                "nested" : {
                    "path" : "jobs"
                },
                "aggs" : {
                    "location" : { "terms" : { "field" : "jobs.location.raw", "size": 25 } },
                    "company" : { "terms" : { "field" : "jobs.company.raw", "size": 25 } }
                }
            }
        },


    "query": {
        "filtered": {
          "query": {"match_all": {}},
          "filter": {
            "nested": {
              "inner_hits" : { "size": 1000 },
              "path": "jobs",
              "query":{
                "filtered": {
                  "query": { "match_all": {}},
                  "filter": {
                    "and": [
                      {"term": {"jobs.location": "york"}},
                      {"term": {"jobs.location": "new"}}
                    ]
                  }
                }
              }
            }
          }
        }
      }
  }
}, function (error, response) {
    response.hits.hits.forEach(function(jobsite) {
    jobs = jobsite.inner_hits.jobs.hits.hits;

    jobs.forEach(function(job) {
        console.log(job);
    });

});

    console.log(response.aggregations.jobs.location.buckets);
});

This gives me back all inner_hits of jobs in New York, but the aggregate is showing me counts for every location and company, not just the ones matching the inner_hits.

Any suggestions on how to get the aggregate on only the data contained in the matching inner_hits?

Edit: I am updating this to include an export of the mapping and index data, as requested. I exported this using Taskrabbit's elasticdump tool, found here: https://github.com/taskrabbit/elasticsearch-dump

The index: http://pastebin.com/WaZwBwn4 The mapping: http://pastebin.com/ZkGnYN94

The above linked data differs from the sample code in my original question in that the index is named jobsites6 in the data instead of jobsitesIdx as referred to in the question. Also, the type in the data is 'job' whereas in the code above it is 'sites'.

I've filled in the callback in the code above to display the response data. I am seeing only jobs in New York from the foreach loop of the inner_hits, as expected, however I am seeing this aggregation for location:

[ { key: 'New York, NY', doc_count: 243 },
  { key: 'San Francisco, CA', doc_count: 92 },
  { key: 'Chicago, IL', doc_count: 43 },
  { key: 'Boston, MA', doc_count: 39 },
  { key: 'Berlin, Germany', doc_count: 22 },
  { key: 'Seattle, WA', doc_count: 22 },
  { key: 'Los Angeles, CA', doc_count: 20 },
  { key: 'Austin, TX', doc_count: 18 },
  { key: 'Anywhere', doc_count: 16 },
  { key: 'Cupertino, CA', doc_count: 15 },
  { key: 'Washington D.C.', doc_count: 14 },
  { key: 'United States', doc_count: 11 },
  { key: 'Atlanta, GA', doc_count: 10 },
  { key: 'London, UK', doc_count: 10 },
  { key: 'Ulm, Deutschland', doc_count: 10 },
  { key: 'Riverton, UT', doc_count: 9 },
  { key: 'San Diego, CA', doc_count: 9 },
  { key: 'Charlotte, NC', doc_count: 8 },
  { key: 'Irvine, CA', doc_count: 8 },
  { key: 'London', doc_count: 8 },
  { key: 'San Mateo, CA', doc_count: 8 },
  { key: 'Boulder, CO', doc_count: 7 },
  { key: 'Houston, TX', doc_count: 7 },
  { key: 'Palo Alto, CA', doc_count: 7 },
  { key: 'Sydney, Australia', doc_count: 7 } ]

Since my inner_hits are limited to those in New York, I can see that the aggregation is not on my inner_hits because it is giving me counts for all locations.

解决方案

You can achieve this by adding the same filter in your aggregation to only include New York jobs. Also note that in your second aggregation you had company.raw but in your mapping the jobs.company field has no not_analyzed part named raw, so you probably need to add it if you want to aggregate on the not analyzed company name.

{
  "_source": [
    "sitename"
  ],
  "query": {
    "filtered": {
      "filter": {
        "nested": {
          "inner_hits": {
            "size": 1000
          },
          "path": "jobs",
          "query": {
            "filtered": {
              "filter": {
                "terms": {
                  "jobs.location": [
                    "new",
                    "york"
                  ]
                }
              }
            }
          }
        }
      }
    }
  },
  "aggs": {
    "jobs": {
      "nested": {
        "path": "jobs"
      },
      "aggs": {
        "only_loc": {
          "filter": {            <----- add this filter
            "terms": {
              "jobs.location": [
                "new",
                "york"
              ]
            }
          },
          "aggs": {
            "location": {
              "terms": {
                "field": "jobs.location.raw",
                "size": 25
              }
            },
            "company": {
              "terms": {
                "field": "jobs.company",
                "size": 25
              }
            }
          }
        }
      }
    }
  }
}

这篇关于在ElasticSearch中过滤,嵌套的inner_hits查询的聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆