使用python聚合elasticsearch-dsl中的字段 [英] aggregate a field in elasticsearch-dsl using python

查看:195
本文介绍了使用python聚合elasticsearch-dsl中的字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以告诉我如何编写将汇总(汇总和计数)有关我的文档内容的Python语句吗?

Can someone tell me how to write Python statements that will aggregate (sum and count) stuff about my documents?

SCRIPT

from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q

# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])

s = Search(using=client, index="attendance")
s = s.execute()

for tag in s.aggregations.per_tag.buckets:
    print (tag.key)






输出


OUTPUT

File "/Library/Python/2.7/site-packages/elasticsearch_dsl/utils.py", line 106, in __getattr__
'%r object has no attribute %r' % (self.__class__.__name__, attr_name))
AttributeError: 'Response' object has no attribute 'aggregations'

这是什么原因? aggregations关键字是否错误?我还需要导入其他软件包吗?如果出勤索引中的文档有一个名为emailAddress的字段,我该如何计算哪些文档具有该字段的值?

What is causing this? Is the "aggregations" keyword wrong? Is there some other package I need to import? If a document in the "attendance" index has a field called emailAddress, how would I count which documents have a value for that field?

推荐答案

首先。现在我注意到,我在这里写的内容实际上没有定义聚合。对我来说,有关如何使用它的文档不是很可读。使用上面的内容,我将进行扩展。我正在更改索引名称以作为更好的示例。

First of all. I notice now that what I wrote here, actually has no aggregations defined. The documentation on how to use this is not very readable for me. Using what I wrote above, I'll expand. I'm changing the index name to make for a nicer example.

from datetime import datetime
from elasticsearch_dsl import DocType, String, Date, Integer
from elasticsearch_dsl.connections import connections

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Q

# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])

s = Search(using=client, index="airbnb", doc_type="sleep_overs")
s = s.execute()

# invalid! You haven't defined an aggregation.
#for tag in s.aggregations.per_tag.buckets:
#    print (tag.key)

# Lets make an aggregation
# 'by_house' is a name you choose, 'terms' is a keyword for the type of aggregator
# 'field' is also a keyword, and 'house_number' is a field in our ES index
s.aggs.bucket('by_house', 'terms', field='house_number', size=0)

在上面每个门牌号码创建1个存储桶。因此,存储桶的名称将是门牌号。 ElasticSearch(ES)将始终提供适合该存储桶的文档的文档计数。 Size = 0表示要使用所有结果,因为ES的默认设置是仅返回10个结果(或您的开发人员设置为执行此操作的任何结果)。

Above we're creating 1 bucket per house number. Therefore, the name of the bucket will be the house number. ElasticSearch (ES) will always give a document count of documents fitting into that bucket. Size=0 means to give use all results, since ES has a default setting to return 10 results only (or whatever your dev set it up to do).

# This runs the query.
s = s.execute()

# let's see what's in our results

print s.aggregations.by_house.doc_count
print s.hits.total
print s.aggregations.by_house.buckets

for item in s.aggregations.by_house.buckets:
    print item.doc_count

我以前的错误是在认为弹性搜索查询默认具有聚合时。您可以自己定义它们,然后执行它们。然后,您的响应可以由您提到的聚合器拆分。

My mistake before was thinking an Elastic Search query had aggregations by default. You sort of define them yourself, then execute them. Then your response can be split b the aggregators you mentioned.

上面的CURL应该如下所示:

注意:我使用SENSE一个ElasticSearch Google Chrome浏览器的插件/扩展程序/附加程序。在SENSE中,您可以使用//注释掉。

The CURL for the above should look like:
NOTE: I use SENSE an ElasticSearch plugin/extension/add-on for Google Chrome. In SENSE you can use // to comment things out.

POST /airbnb/sleep_overs/_search
{
// the size 0 here actually means to not return any hits, just the aggregation part of the result
    "size": 0,
    "aggs": {
        "by_house": {
            "terms": {
// the size 0 here means to return all results, not just the the default 10 results
                "field": "house_number",
                "size": 0
            }
        }
    }
}

解决方法。 DSL的GIT上的某人告诉我忘记翻译,而只是使用这种方法。它更简单,您只需用CURL编写难懂的内容。这就是为什么我将其称为解决方法。

Work-around. Someone on the GIT of DSL told me to forget translating, and just use this method. It's simpler, and you can just write the tough stuff in CURL. That's why I call it a work-around.

# Define a default Elasticsearch client
client = connections.create_connection(hosts=['http://blahblahblah:9200'])
s = Search(using=client, index="airbnb", doc_type="sleep_overs")

# how simple we just past CURL code here
body = {
    "size": 0,
    "aggs": {
        "by_house": {
            "terms": {
                "field": "house_number",
                "size": 0
            }
        }
    }
}

s = Search.from_dict(body)
s = s.index("airbnb")
s = s.doc_type("sleepovers")
body = s.to_dict()

t = s.execute()

for item in t.aggregations.by_house.buckets:
# item.key will the house number
    print item.key, item.doc_count

希望这会有所帮助。现在,我在CURL中设计所有内容,然后使用Python语句剥离结果以获取所需的内容。这有助于进行多个级别的汇总(子汇总)。

Hope this helps. I now design everything in CURL, then use Python statement to peel away at the results to get what I want. This helps for aggregations with multiple levels (sub-aggregations).

这篇关于使用python聚合elasticsearch-dsl中的字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆