Mongodb聚合查询,还是太复杂? [英] Mongodb aggregate query, or too complex?

查看:73
本文介绍了Mongodb聚合查询,还是太复杂?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据集:

I have a dataset which looks like this:

{ uid: 1000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 1000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 1000000, from: "bbb", to: "ccc": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 1000000, from: "bbb", to: "ccc": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 2000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 2000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 2000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 2000000, from: "aaa", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 3000000, from: "aaa", to: "aaa": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 3000000, from: "bbb", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 3000000, from: "ccc", to: "ccc": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 3000000, from: "ddd", to: "bbb": timestamp: ISODate("2016-02-02T18:42:06.336Z") },
{ uid: 3000000, from: "eee", to: "eee": timestamp: ISODate("2016-02-02T18:42:06.336Z") }

是否可以通过执行一个mongo查询来获得所需的输出,还是我必须运行多个单独的查询? Hours是文档中的一天.

Is it possible to get the desired output by doing one mongo query, or do i have to run many separate ones ? Hours is the hour of day for the document.

我还应该注意,馆藏每天大约有100万个条目.带有大约400个不同的uid.

I should also note that the collection have roughly ~1 million entries each day. With about 400 different uid's.

{ _id: 1000000, count: 4, from_count: 2, to_count: 2, hours: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0 ] },
{ _id: 2000000, count: 4, from_count: 1, to_count: 1, hours: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0 ] },
{ _id: 3000000, count: 5, from_count: 5, to_count: 5, hours: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0 ] },

推荐答案

尽管您应该在问题中更清楚地指出,但是从源头获得的输出样本表明您正在寻找:

Though it should have been made more clear in your question, your output sample from source suggests that you are looking for:

  • 每个"uid"的邮件总数
  • "to"中不同的值计数
  • 发件人"中不同值的计数
  • 每个"uid"的每个小时"计数摘要

这在单个聚合语句中都是可能的,它只需要对不同列表进行一些仔细的管理,然后进行一些操作以映射24小时内每小时的结果.

This is all possible in a single aggregation statement, and it just takes some careful management of the distinct lists and then some manipulation to map results for each hour in a 24 hour period.

这里最好的方法是在MongoDB 3.2中引入的运算符辅助:

The best approach here is aided by operators introduced in MongoDB 3.2:

db.collection.aggregate([
    // First group by hour within "uid" and keep distinct "to" and "from"
    { "$group": {
        "_id": {
            "uid": "$uid",
            "time": { "$hour": "$timestamp" }
        },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "count": { "$sum": 1 }
    }},

    // Roll-up to "uid" and keep each hour in an array
    { "$group": {
        "_id": "$_id.uid",
        "total": { "$sum": "$count" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { 
            "$push": {
                "index": "$_id.time",
                "count": "$count"
            }
        }
     }},

     // Getting distinct "to" and "from" requires a double unwind of arrays
     { "$unwind": "$to" },
     { "$unwind": "$to" },
     { "$unwind": "$from" },
     { "$unwind": "$from" },

     // And then adding back to sets for distinct
     { "$group": {
        "_id": "$_id",
        "total": { "$first": "$total" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { "$first": "$temp_hours" }
     }},

     // Map out for each hour and count size of distinct lists
     { "$project": {
        "count": "$total",
        "from_count": { "$size": "$from" },
        "to_count": { "$size": "$to" },
        "hours": {
            "$map": {
                "input": [
                     00,01,02,03,04,05,06,07,08,09,10,11,
                     12,13,14,15,16,17,18,19,20,21,22,23
                 ],
                 "as": "el",
                 "in": {
                      "$ifNull": [
                          { "$arrayElemAt": [
                              { "$map": {
                                  "input": { "$filter": {
                                     "input": "$temp_hours",
                                     "as": "tmp",
                                     "cond": {
                                         "$eq": [ "$$el", "$$tmp.index" ]
                                     }
                                  }},
                                 "as": "out",
                                 "in": "$$out.count"
                              }},
                              0
                          ]},
                          0
                      ]
                 }
            }
        }
     }},

     // Optionally sort in "uid" order
     { "$sort": { "_id": 1 } }
 ])

在MongoDB 3.2之前的版本中,您需要花更多的精力来映射一天中所有小时的阵列内容:

Prior MongoDB 3.2 you need to get a bit more involved to map the array content for all hours in the day:

db.collection.aggregate([

    // First group by hour within "uid" and keep distinct "to" and "from"
    { "$group": {
        "_id": {
            "uid": "$uid",
            "time": { "$hour": "$timestamp" }
        },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "count": { "$sum": 1 }
    }},

    // Roll-up to "uid" and keep each hour in an array
    { "$group": {
        "_id": "$_id.uid",
        "total": { "$sum": "$count" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { 
            "$push": {
                "index": "$_id.time",
                "count": "$count"
            }
        }
     }},

     // Getting distinct "to" and "from" requires a double unwind of arrays
     { "$unwind": "$to" },
     { "$unwind": "$to" },
     { "$unwind": "$from" },
     { "$unwind": "$from" },

     // And then adding back to sets for distinct, also adding the indexes array
     { "$group": {
        "_id": "$_id",
        "total": { "$first": "$total" },
        "from": { "$addToSet": "$from" },
        "to": { "$addToSet": "$to" },
        "temp_hours": { "$first": "$temp_hours" },
        "indexes": { "$first": { "$literal": [
                     00,01,02,03,04,05,06,07,08,09,10,11,
                     12,13,14,15,16,17,18,19,20,21,22,23
        ] } }
     }},

     // Denormalize both arrays
     { "$unwind": "$temp_hours" },
     { "$unwind": "$indexes" },

     // Marry up the index entries and keep either the value or 0
     // Note you are normalizing the double unwind to distinct index
     { "$group": {
         "_id": {
             "_id": "$_id",
             "index": "$indexes"
         },
         "total": { "$first": "$total" }, 
         "from": { "$first": "$from" },
         "to": { "$first": "$to" },
         "count": {
             "$max": {
                 "$cond": [
                     { "$eq": [ "$indexes", "$temp_hours.index" ] },
                     "$temp_hours.count",
                     0
                 ]
             }
         }
     }},

     // Sort to keep index order - !!Important!!         
     { "$sort": { "_id": 1 } },

     // Put the hours into the array and get sizes for other results
     { "$group": {
         "_id": "$_id._id",
         "count": { "$first": "$total" },
         "from_count": { "$first": { "$size": "$from" } },
         "to_count": { "$first": { "$size": "$to" } },
         "hours": { "$push": "$count" }
     }},

     // Optionally sort in "uid" order
     { "$sort": { "_id": 1 } }
])

要进行分解,这两种方法都遵循相同的基本步骤,唯一的真正区别发生在24小时内的小时"映射上.

To break that down, both approaches here follow the same basic steps, with the only real difference occuring on the mapping of "hours" for the 24 hour period.

在第一个汇总 $group 阶段,目标是每小时获取数据中每个"uid"值的结果. $hour 的简单日期聚合运算符有助于获取此值,因为分组键的一部分.

In the first aggregation $group stage, the objective is to get results per hour present in the data and for each "uid" value. The simple date aggregation operator of $hour helps obtain this value as part of the grouping key.

$addToSet 操作是一种小型-group",这样就可以保留"to"和"from"值中每个值的不同集",而基本上每小时仍在分组.

The $addToSet operations are a sort of "mini-group" in themselves, and this is allowing to keep the "distinct sets" for each of the "to" and "from" values whilst essentially still grouping per hour.

下一个$group更具组织性",因为每小时记录的计数"保存在数组中,同时汇总所有数据以仅按"uid"进行分组.基本上,这将为您提供结果真正需要的所有数据",但是当然,这里的$addToSet操作只是将每小时确定的不同集合的数组中的数组"添加进来.

The next $group is more "organizational", as the recorded "counts" for each hour are kept in an array whilst rolling up all the data to just be grouped per "uid". This basically gives you all the "data" you really need for the result, but of course the $addToSet operations here are just adding "arrays within arrays" of the distinct sets determined per hour.

为了仅将这些值作为每个"uid"的真正不同的列表,必须使用$addToSet对此进行了压缩,并且 $first 操作只需采用其他字段的第一个"值,这些值对于目标每个uid"数据都是相同的.我们对这些感到满意,因此只需保持原样.

In order to get these values as truly distinct lists per each "uid" and only, it is necessary to deconstruct each array using $unwind and then finally group back as just the distinct "sets". The same $addToSet compacts this, and the $first operations just take the "first" values of the other fields, which are already all the same for the target "per uid" data. We are happy with those, so just keep them as they are.

这里的最后阶段本质上是化妆品"的,可以同样地在客户端代码中实现.由于没有每个小时间隔的数据,因此需要将其映射到代表每个小时的值的数组中.这两种方法在版本之间可用的运算符的功能上有所不同.

The final stage(s) here are essentially "cosmetic" in nature and can equally be achieved in client side code. Since there is not data present for every single hour interval, it needs to be mapped into an array of values representing each hour. The two approaches here vary on the capabilties of the available operators between versions.

在MongoDB 3.2版本中,有 $filter $arrayElemAt 运算符,可以有效地创建逻辑将所有可能的索引位置(24小时)的输入源转换"为已经为可用数据中这些小时的计数确定的值.这是对每个可用小时已经记录的值的直接查找",以查看其是否存在,在何处进行计数转换为整个数组.如果不存在默认值,则使用默认值0.

In the MongoDB 3.2 release, there are $filter and $arrayElemAt operators that effectively allow you to create the logic to "transpose" an input source of all possible index positions ( 24 hours ) into the values that are already determined for the counts fom those hours in the available data. This is basicalyl a "direct lookup" of values already recorded for each available hour to see if it exists, where it does the count is transposed into the full array. Where it is not present, a default value of 0 is used in place.

没有那些运算符,执行此匹配"实质上意味着对两个数组(记录的数据和完整的24个位置)进行归一化,以便进行比较和转置.这是第二种方法中发生的事情,它通过简单比较索引"值来查看该小时是否有结果.此处使用 $max 运算符主要是因为这两个$unwind语句,其中将针对每个可能的索引位置重现源数据中的每个记录值.这样可以压缩"到每个索引小时"所需的值.

Without those operators, doing this "match up" essentially means de-normalizing both arrays ( the recorded data and the full 24 positions ) in order to compare and transpose. This is what is happening in the second approach with a simple comparison of the "index" values to see if there was a result for that hour. The $max operator here is mainly used because of the two $unwind statements, where each recorded value from the source data is going to be reproduced for every possible index position. This "compacts" down to just the values that are wanted per "index hour".

在后一种方法中, $sort 在分组_id值上.这是因为它包含索引"位置,并且在将该内容移回希望排序的数组时将需要此位置.当然,这是最后的$group阶段,在此阶段中,将有序头寸与$push放入数组中.

In that latter approach it then becomes important to $sort on the grouping _id value. This is because it contains the "index" position, and that is going to be needed when moving this content back into an array you expect to be orderred. Which is of course the final $group stage here where the ordered positions are put into an array with $push.

回到不同列表", $size 运算符来确定"to"和"from"列表中不同值的长度",从而确定计数".至少这是对MongoDB 2.6的唯一实际限制,但可以通过简单地展开"每个数组,然后重新组合在已经存在的_id上,以便对每个集合中的数组条目进行计数来代替.这是一个基本过程,但是您应该看到$size运算符是提高整体性能的更好选择.

Back to the "distinct lists", the $size operator is used in all cases to determine the "length" and therefore "count" of distinct values in the lists for "to" and "from". This is the only real constraint on MongoDB 2.6 at least, but can otherwise be replaced with simply "unwinding" each array individually and then grouping back on the _id already present in order to count the array entries in each set. It's a basic process, but as you should see the $size operator is the better option here for overall performance.

最后一点,您的结论数据有些差,因为发件人"中带有"ddd"的条目原本也打算与发件人"中的条目相同,但记录为"bbb".这会将"to"的第三个"uid"分组的唯一计数减少一个条目.但是当然,给定源数据的逻辑结果是合理的:

As a final note, your conclusion data is a little off, as possibly the entry with "ddd" in "from" was intended to also be the same in "to", but is instead recorded as "bbb". This changes the distinct count of the third "uid" grouping for "to" down by one entry. But of course the logical results given the source data is sound:

{ "_id" : 1000000, "count" : 3, "from_count" : 2, "to_count" : 2, "hours" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0 ] }
{ "_id" : 2000000, "count" : 2, "from_count" : 1, "to_count" : 1, "hours" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0 ] }
{ "_id" : 3000000, "count" : 5, "from_count" : 5, "to_count" : 4, "hours" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0 ] }

N.B源也有一个错字,在所有行的时间戳记之后都用:插入了分隔符,而不是逗号.

N.B The source also has a typo with the delimitter being interposed with : instead of a comma right after the timestamp on all lines.

这篇关于Mongodb聚合查询,还是太复杂?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆