使用python聚合elasticsearch-dsl中的字段 [英] aggregate a field in elasticsearch-dsl using python
问题描述
有人可以告诉我如何编写将汇总(汇总和计数)有关我的文档内容的Python语句吗?
Can someone tell me how to write Python statements that will aggregate (sum and count) stuff about my documents?
SCRIPT
from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])
s = Search(using=client, index="attendance")
s = s.execute()
for tag in s.aggregations.per_tag.buckets:
print (tag.key)
输出
OUTPUT
File "/Library/Python/2.7/site-packages/elasticsearch_dsl/utils.py", line 106, in __getattr__
'%r object has no attribute %r' % (self.__class__.__name__, attr_name))
AttributeError: 'Response' object has no attribute 'aggregations'
这是什么原因? aggregations关键字是否错误?我还需要导入其他软件包吗?如果出勤索引中的文档有一个名为emailAddress的字段,我该如何计算哪些文档具有该字段的值?
What is causing this? Is the "aggregations" keyword wrong? Is there some other package I need to import? If a document in the "attendance" index has a field called emailAddress, how would I count which documents have a value for that field?
推荐答案
首先。现在我注意到,我在这里写的内容实际上没有定义聚合。对我来说,有关如何使用它的文档不是很可读。使用上面的内容,我将进行扩展。我正在更改索引名称以作为更好的示例。
First of all. I notice now that what I wrote here, actually has no aggregations defined. The documentation on how to use this is not very readable for me. Using what I wrote above, I'll expand. I'm changing the index name to make for a nicer example.
from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q
# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])
s = Search(using=client, index="airbnb", doc_type="sleep_overs")
s = s.execute()
# invalid! You haven't defined an aggregation.
#for tag in s.aggregations.per_tag.buckets:
# print (tag.key)
# Lets make an aggregation
# 'by_house' is a name you choose, 'terms' is a keyword for the type of aggregator
# 'field' is also a keyword, and 'house_number' is a field in our ES index
s.aggs.bucket('by_house', 'terms', field='house_number', size=0)
在上面每个门牌号码创建1个存储桶。因此,存储桶的名称将是门牌号。 ElasticSearch(ES)将始终提供适合该存储桶的文档的文档计数。 Size = 0表示要使用所有结果,因为ES的默认设置是仅返回10个结果(或您的开发人员设置为执行此操作的任何结果)。
Above we're creating 1 bucket per house number. Therefore, the name of the bucket will be the house number. ElasticSearch (ES) will always give a document count of documents fitting into that bucket. Size=0 means to give use all results, since ES has a default setting to return 10 results only (or whatever your dev set it up to do).
# This runs the query.
s = s.execute()
# let's see what's in our results
print s.aggregations.by_house.doc_count
print s.hits.total
print s.aggregations.by_house.buckets
for item in s.aggregations.by_house.buckets:
print item.doc_count
我以前的错误是在认为弹性搜索查询默认具有聚合时。您可以自己定义它们,然后执行它们。然后,您的响应可以由您提到的聚合器拆分。
My mistake before was thinking an Elastic Search query had aggregations by default. You sort of define them yourself, then execute them. Then your response can be split b the aggregators you mentioned.
上面的CURL应该如下所示:
注意:我使用SENSE一个ElasticSearch Google Chrome浏览器的插件/扩展程序/附加程序。在SENSE中,您可以使用//注释掉。
The CURL for the above should look like:
NOTE: I use SENSE an ElasticSearch plugin/extension/add-on for Google Chrome. In SENSE you can use // to comment things out.
POST /airbnb/sleep_overs/_search
{
// the size 0 here actually means to not return any hits, just the aggregation part of the result
"size": 0,
"aggs": {
"by_house": {
"terms": {
// the size 0 here means to return all results, not just the the default 10 results
"field": "house_number",
"size": 0
}
}
}
}
解决方法。 DSL的GIT上的某人告诉我忘记翻译,而只是使用这种方法。它更简单,您只需用CURL编写难懂的内容。这就是为什么我将其称为解决方法。
Work-around. Someone on the GIT of DSL told me to forget translating, and just use this method. It's simpler, and you can just write the tough stuff in CURL. That's why I call it a work-around.
# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])
s = Search(using=client, index="airbnb", doc_type="sleep_overs")
# how simple we just past CURL code here
body = {
"size": 0,
"aggs": {
"by_house": {
"terms": {
"field": "house_number",
"size": 0
}
}
}
}
s = Search.from_dict(body)
s = s.index("airbnb")
s = s.doc_type("sleepovers")
body = s.to_dict()
t = s.execute()
for item in t.aggregations.by_house.buckets:
# item.key will the house number
print item.key, item.doc_count
希望这会有所帮助。现在,我在CURL中设计所有内容,然后使用Python语句剥离结果以获取所需的内容。这有助于进行多个级别的汇总(子汇总)。
Hope this helps. I now design everything in CURL, then use Python statement to peel away at the results to get what I want. This helps for aggregations with multiple levels (sub-aggregations).
这篇关于使用python聚合elasticsearch-dsl中的字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!