查询Elasticsearch中每种类型的最新文档 [英] Query the latest document of each type on Elasticsearch
问题描述
我正在尝试运行一个关于Elasticsearch的简单查询的开始,但我似乎无法得到我正在寻找的结果。
这是一个简单的例子,我正在努力做:
我有一个新闻数据库。每个消息都包含一个源,一个标题,一个时间戳和一个用户。
我想要为给定的每个可用源获取最后一个(基于时间戳的)标题用户。
#!/ bin / bash
export ELASTICSEARCH_ENDPOINT =http:// localhost: 9200
#创建索引
curl -XPUT$ ELASTICSEARCH_ENDPOINT / news-d'{
mappings:{
news :{
properties:{
source:{type:string,index:not_analyzed},
headline:{type 对象},
timestamp:{type:date,format:date_hour_minute_second_millis},
user:{type:string :not_analyzed}
}
}
}
}'
#索引文档
curl -XPOST$ ELASTICSEARCH_ENDPOINT / _bulk ?refresh = true-d'
{index:{_ index:news,_ type:news}}
{user:John : C NN,标题:好消息,时间戳:2015-07-28T00:07:29.000}
{index:{_ index:news,_ type news}}
{user:John,source:CNN,headline:更多好消息,timestamp:2015-07-28T00:08:23.000 }
{index:{_ index:news,_ type:news}}
{user:John,source:ESPN 标题:体育新闻,时间戳:2015-07-28T00:09:32.000}
{index:{_ index:news,_ type }}
{user:John,source:ESPN,headline:更多体育新闻,时间戳:2015-07-28T00:10:35.000 b $ b {index:{_ index:news,_ type:news}}
{user:Mary,source:Yahoo :更多新闻,时间戳:2015-07-28T00:11:54.000}
{index:{_ index:news,_ type:news b $ b {user:Mary,source:Yahoo,headline:Crazy news,timestamp:2015-07-28T00:12:31.000}
'
那么如何从John获得最后的CNN和最后一个ESPN标题?我有蜜蜂n寻找多重搜索API,但这意味着我需要事先知道所有的来源(在这种情况下是CNN和ESPN)。
首先,请注意,我不得不将标题
字段的映射更改为 string
如您的样本文档中的标题为 string
s而不是对象
s。
所以,像下面这样一个查询将会检索你的期望:
curl -XPOST$ ELASTICSEARCH_ENDPOINT /新闻/ _search-d'{
size:0,
query:{
filtered:{
filter:{
术语:{
user:John< ---用户的过滤器= John
}
}
}
},
aggs:{
sources:{
terms:{
field:source< ---按源码汇总
},
aggs:{
latest:{
top_ hits:{
size:1,< ---只取第一个...
_source:[< ---只有日期和标题
标题,
时间戳
],
排序:{
时间戳:desc<... ...只有最新的命中
}
}
}
}
}
}
}'
这将产生如下结果:
{
...
aggregate:{
sources:{
doc_count_error_upper_bound:0,
sum_other_doc_count:0,
buckets {
key:CNN,
doc_count:2,
最新:{
hits:{
total ,
max_score:null,
hits:[{
_index:news,
_type:news,
_id:AU7Sh3VDGDddn2ZNuDVl,
_score:null,
_source:{
标题:更多好消息,
timestamp 2015-07-28T00:08:23.000
},
sort:[1438042103000]
}]
}
}
},{
key:ESPN,
doc_count:2,
最新:{
hits:{
total
max_score:null,
hits:[{
_index:news,
_type:news,
_id :AU7Sh3VDGDddn2ZNuDVn,
_score:null,
_source:{
标题:更多体育新闻,
timestamp 07-28T00:10:35.000
},
sort:[1438042235000]
}]
}
}
}]
}
}
}
I'm trying to run what started to look like a simple query on Elasticsearch, but I just can't seem to get the result I'm looking for.
Here's a brief example of what I'm trying to do:
I have a database of news. Each piece of news contains a source, a headline, a timestamp and a user.
I want the get the last (timestamp based) headline for each available source for a given user.
#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"
# Create indexes
curl -XPUT "$ELASTICSEARCH_ENDPOINT/news" -d '{
"mappings": {
"news": {
"properties": {
"source": { "type": "string", "index": "not_analyzed" },
"headline": { "type": "object" },
"timestamp": { "type": "date", "format": "date_hour_minute_second_millis" },
"user": { "type": "string", "index": "not_analyzed" }
}
}
}
}'
# Index documents
curl -XPOST "$ELASTICSEARCH_ENDPOINT/_bulk?refresh=true" -d '
{"index":{"_index":"news","_type":"news"}}
{"user": "John", "source": "CNN", "headline": "Great news", "timestamp": "2015-07-28T00:07:29.000"}
{"index":{"_index":"news","_type":"news"}}
{"user": "John", "source": "CNN", "headline": "More great news", "timestamp": "2015-07-28T00:08:23.000"}
{"index":{"_index":"news","_type":"news"}}
{"user": "John", "source": "ESPN", "headline": "Sports news", "timestamp": "2015-07-28T00:09:32.000"}
{"index":{"_index":"news","_type":"news"}}
{"user": "John", "source": "ESPN", "headline": "More sports news", "timestamp": "2015-07-28T00:10:35.000"}
{"index":{"_index":"news","_type":"news"}}
{"user": "Mary", "source": "Yahoo", "headline": "More news", "timestamp": "2015-07-28T00:11:54.000"}
{"index":{"_index":"news","_type":"news"}}
{"user": "Mary", "source": "Yahoo", "headline": "Crazy news", "timestamp": "2015-07-28T00:12:31.000"}
'
So how do I get the last CNN and last ESPN headlines from John for example?
I've been looking into the multi search API, but this would mean that I would need to know all the sources beforehand (in this case CNN and ESPN).
First, please note that I had to change your mapping for the headline
field to string
, as in your sample documents headlines are string
s and not object
s.
So, a query like the following one would retrieve what you expect:
curl -XPOST "$ELASTICSEARCH_ENDPOINT/news/_search" -d '{
"size": 0,
"query": {
"filtered": {
"filter": {
"term": {
"user": "John" <--- filter for user=John
}
}
}
},
"aggs": {
"sources": {
"terms": {
"field": "source" <--- aggregate by source
},
"aggs": {
"latest": {
"top_hits": {
"size": 1, <--- only take the first...
"_source": [ <--- only the date and headline
"headline",
"timestamp"
],
"sort": {
"timestamp": "desc" <--- ...and only the latest hit
}
}
}
}
}
}
}'
That will yield something like this:
{
...
"aggregations" : {
"sources" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ {
"key" : "CNN",
"doc_count" : 2,
"latest" : {
"hits" : {
"total" : 2,
"max_score" : null,
"hits" : [ {
"_index" : "news",
"_type" : "news",
"_id" : "AU7Sh3VDGDddn2ZNuDVl",
"_score" : null,
"_source":{
"headline": "More great news",
"timestamp": "2015-07-28T00:08:23.000"
},
"sort" : [ 1438042103000 ]
} ]
}
}
}, {
"key" : "ESPN",
"doc_count" : 2,
"latest" : {
"hits" : {
"total" : 2,
"max_score" : null,
"hits" : [ {
"_index" : "news",
"_type" : "news",
"_id" : "AU7Sh3VDGDddn2ZNuDVn",
"_score" : null,
"_source":{
"headline": "More sports news",
"timestamp": "2015-07-28T00:10:35.000"
},
"sort" : [ 1438042235000 ]
} ]
}
}
} ]
}
}
}
这篇关于查询Elasticsearch中每种类型的最新文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!