如何减少嵌套文档的聚合管道中的展开阶段? [英] How can I decrease unwind stages in aggregation pipeline for nested documents?

查看:41
本文介绍了如何减少嵌套文档的聚合管道中的展开阶段?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是mongodb的新手,正在尝试使用嵌套文档.我有一个如下查询

I am new in mongodb and trying to work with nested documents.I have a query as below

    db.EndpointData.aggregate([
{ "$group" : { "_id" : "$EndpointId", "RequestCount" : { "$sum" : 1 }, "FirstActivity" : { "$min" : "$DateTime" }, "LastActivity" : { "$max" : "$DateTime" }, "Tags" : { "$push" : "$Tags" } } }, 
{ "$unwind" : "$Tags" }, 
{ "$unwind" : "$Tags" }, 
{ "$group" : { "_id" : "$_id", "RequestCount" : { "$first" : "$RequestCount" }, "Tags" : { "$push" : "$Tags" }, "FirstActivity" : { "$first" : "$FirstActivity" }, "LastActivity" : { "$first" : "$LastActivity" } } }, 
{ "$unwind" : "$Tags" }, 
{ "$unwind" : "$Tags.Sensors" }, 
{ "$group" : { "_id" : { "EndpointId" : "$_id", "Uid" : "$Tags.Uid", "Type" : "$Tags.Sensors.Type" }, "RequestCount" : { "$first" : "$RequestCount" }, "FirstActivity" : { "$first" : "$FirstActivity" }, "LastActivity" : { "$first" : "$LastActivity" } } }, 
{ "$group" : { "_id" : { "EndpointId" : "$_id.EndpointId", "Uid" : "$_id.Uid" }, "count" : { "$sum" : 1 }, "RequestCount" : { "$first" : "$RequestCount" }, "FirstActivity" : { "$first" : "$FirstActivity" }, "LastActivity" : { "$first" : "$LastActivity" } } }, 
{ "$group" : { "_id" : "$_id.EndpointId", "TagCount" : { "$sum" : 1 }, "SensorCount" : { "$sum" : "$count" }, "RequestCount" : { "$first" : "$RequestCount" }, "FirstActivity" : { "$first" : "$FirstActivity" }, "LastActivity" : { "$first" : "$LastActivity" } } }])

我的数据结构如下

{
  "_id": "6aef51dfaf42ea1b70d0c4db",  
  "EndpointId": "98799bcc-e86f-4c8a-b340-8b5ed53caf83",  
  "DateTime": "2018-05-06T19:05:02.666Z",
  "Url": "test",
  "Tags": [
    {
      "Uid": "C1:3D:CA:D4:45:11",
      "Type": 1,
      "DateTime": "2018-05-06T19:05:02.666Z",
      "Sensors": [
        {
          "Type": 1,
          "Value": { "$numberDecimal": "-95" }
        },
        {
          "Type": 2,
          "Value": { "$numberDecimal": "-59" }
        },
        {
          "Type": 3,
          "Value": { "$numberDecimal": "11.029802536740132" }
        }
      ]
    },
    {
      "Uid": "C1:3D:CA:D4:45:11",
      "Type": 1,
      "DateTime": "2018-05-06T19:05:02.666Z",
      "Sensors": [
        {
          "Type": 1,
          "Value": { "$numberDecimal": "-92" }
        },
        {
          "Type": 2,
          "Value": { "$numberDecimal": "-59" }
        }
      ]
    }   
  ]
}

此查询正常,并且正确.我计算标签,传感器和每个EdpointID的重复次数.但是问题是,当我处理大量数据(大约10,000,000个文档)时,我会遇到内存问题.在此查询中似乎有4个级别的展开使问题出现.如何减少此查询中的展开?

This query works fine and correct. I count Tags, Sensors and repeat times of each EdpointID. But the problem is when I work with large size of data (about 10,000,000 documents) I get memory problem. It seems having 4 levels of unwind make problem in this query. How can I reduce unwinds in this query?

推荐答案

只要您的数据在每个文档中具有唯一的传感器和标签读数(到目前为止,您所呈现的内容就显示出来),那么您根本就不需要

As long as your data has unique sensor and tag readings per document, which to date what you have presented appears to, then you simply don't need $unwind at all.

实际上,您真正需要的只是一个 $group :

In fact, all you really need is a single $group:

db.endpoints.aggregate([
  // In reality you would $match to limit the selection of documents
  { "$match": { 
    "DateTime": { "$gte": new Date("2018-05-01"), "$lt": new Date("2018-06-01") }
  }},
  { "$group": {
    "_id": "$EndpointId",
    "FirstActivity" : { "$min" : "$DateTime" },
    "LastActivity" : { "$max" : "$DateTime" },
    "RequestCount": { "$sum": 1 },
    "TagCount": {
      "$sum": {
        "$size": { "$setUnion": ["$Tags.Uid",[]] }
      }
    },
    "SensorCount": {
      "$sum": {
        "$sum": {
          "$map": {
            "input": { "$setUnion": ["$Tags.Uid",[]] },
            "as": "tag",
            "in": {
              "$size": {
                "$reduce": {
                  "input": {
                    "$filter": {
                      "input": {
                        "$map": {
                          "input": "$Tags",
                          "in": {
                            "Uid": "$$this.Uid",
                            "Type": "$$this.Sensors.Type"
                          }
                        }
                      },
                      "cond": { "$eq": [ "$$this.Uid", "$$tag" ] }
                    }
                  },
                  "initialValue": [],
                  "in": { "$setUnion": [ "$$value", "$$this.Type" ] }
                }
              }
            }
          }
        }
      }
    }
  }}
])

或者,如果您确实确实确实需要从不同文档中累计传感器"和标签"的唯一"值,那么您仍然需要初始

Or if you actually do need to accumulate those "unique" values of "Sensors" and "Tags" from across different documents, then you still need initial $unwind statements to get the right grouping, but nowhere near as much as you presently have:

db.endpoints.aggregate([
  // In reality you would $match to limit the selection of documents
  { "$match": { 
    "DateTime": { "$gte": new Date("2018-05-01"), "$lt": new Date("2018-06-01") }
  }},
  { "$unwind": "$Tags" },
  { "$unwind": "$Tags.Sensors" },
  { "$group": {
    "_id": {
      "EndpointId": "$EndpointId",
      "Uid": "$Tags.Uid",
      "Type": "$Tags.Sensors.Type"
    },
    "FirstActivity": { "$min": "$DateTime" },
    "LastActivity": { "$max": "$DateTime" },
    "RequestCount": { "$addToSet": "$_id" }
  }},
  { "$group": {
    "_id": {
      "EndpointId": "$_id.EndpointId",
      "Uid": "$_id.Uid",
    },
    "FirstActivity": { "$min": "$FirstActivity" },
    "LastActivity": { "$max": "$LastActivity" },
    "count": { "$sum": 1 },
    "RequestCount": { "$addToSet": "$RequestCount" }
  }},
  { "$group": {
    "_id": "$_id.EndpointId",
    "FirstActivity": { "$min": "$FirstActivity" },
    "LastActivity": { "$max": "$LastActivity" },
    "TagCount": { "$sum": 1 },
    "SensorCount": { "$sum": "$count" },
    "RequestCount": { "$addToSet": "$RequestCount" }
  }},
  { "$addFields": {
    "RequestCount": {
      "$size": {
        "$reduce": {
          "input": {
            "$reduce": {
              "input": "$RequestCount",
              "initialValue": [],
              "in": { "$setUnion": [ "$$value", "$$this" ] }
            }
          },
          "initialValue": [],
          "in": { "$setUnion": [ "$$value", "$$this" ] }
        }
      }
    }
  }}
],{ "allowDiskUse": true })

从MongoDB 4.0开始,您可以使用 $toString _id内的ObjectId上,只需使用

And from MongoDB 4.0 you can use $toString on the ObjectId within _id and simply merge the unique keys for those in order to keep the RequestCount using $mergeObjects. This is cleaner and a bit more scalable than pushing nested array content and flattening it

db.endpoints.aggregate([
  // In reality you would $match to limit the selection of documents
  { "$match": { 
    "DateTime": { "$gte": new Date("2018-05-01"), "$lt": new Date("2018-06-01") }
  }},
  { "$unwind": "$Tags" },
  { "$unwind": "$Tags.Sensors" },
  { "$group": {
    "_id": {
      "EndpointId": "$EndpointId",
      "Uid": "$Tags.Uid",
      "Type": "$Tags.Sensors.Type"
    },
    "FirstActivity": { "$min": "$DateTime" },
    "LastActivity": { "$max": "$DateTime" },
    "RequestCount": {
      "$mergeObjects": {
        "$arrayToObject": [[{ "k": { "$toString": "$_id" }, "v": 1 }]]
      }
    }
  }},
  { "$group": {
    "_id": {
      "EndpointId": "$_id.EndpointId",
      "Uid": "$_id.Uid",
    },
    "FirstActivity": { "$min": "$FirstActivity" },
    "LastActivity": { "$max": "$LastActivity" },
    "count": { "$sum": 1 },
    "RequestCount": { "$mergeObjects": "$RequestCount" }
  }},
  { "$group": {
    "_id": "$_id.EndpointId",
    "FirstActivity": { "$min": "$FirstActivity" },
    "LastActivity": { "$max": "$LastActivity" },
    "TagCount": { "$sum": 1 },
    "SensorCount": { "$sum": "$count" },
    "RequestCount": { "$mergeObjects": "$RequestCount" }
  }},
  { "$addFields": {
    "RequestCount": {
      "$size": {
        "$objectToArray": "$RequestCount"
      }
    }
  }}
],{ "allowDiskUse": true })

尽管结果中键的顺序可能有所不同,但两种形式都返回相同的数据:

Either form returns the same data, though the order of keys in the result may vary:

{
        "_id" : "89799bcc-e86f-4c8a-b340-8b5ed53caf83",
        "FirstActivity" : ISODate("2018-05-06T19:05:02.666Z"),
        "LastActivity" : ISODate("2018-05-06T19:05:02.666Z"),
        "RequestCount" : 2,
        "TagCount" : 4,
        "SensorCount" : 16
}

结果是从这些示例文档中获得的,这些文档最初是您在有关该主题的原始问题中作为示例来源提供的 :

The result is obtained from these sample documents which you originally gave as a sample source in the original question on the topic:

{
    "_id" : ObjectId("5aef51dfaf42ea1b70d0c4db"),    
    "EndpointId" : "89799bcc-e86f-4c8a-b340-8b5ed53caf83",    
    "DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
    "Url" : "test",
    "Tags" : [ 
        {
            "Uid" : "C1:3D:CA:D4:45:11",
            "Type" : 1,
            "DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
            "Sensors" : [ 
                {
                    "Type" : 1,
                    "Value" : NumberDecimal("-95")
                }, 
                {
                    "Type" : 2,
                    "Value" : NumberDecimal("-59")
                }, 
                {
                    "Type" : 3,
                    "Value" : NumberDecimal("11.029802536740132")
                }, 
                {
                    "Type" : 4,
                    "Value" : NumberDecimal("27.25")
                }, 
                {
                    "Type" : 6,
                    "Value" : NumberDecimal("2924")
                }
            ]
        },         
        {
            "Uid" : "C1:3D:CA:D4:45:11",
            "Type" : 1,
            "DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
            "Sensors" : [ 
                {
                    "Type" : 1,
                    "Value" : NumberDecimal("-95")
                }, 
                {
                    "Type" : 2,
                    "Value" : NumberDecimal("-59")
                }, 
                {
                    "Type" : 3,
                    "Value" : NumberDecimal("11.413037961112279")
                }, 
                {
                    "Type" : 4,
                    "Value" : NumberDecimal("27.25")
                }, 
                {
                    "Type" : 6,
                    "Value" : NumberDecimal("2924")
                }
            ]
        },          
        {
            "Uid" : "E5:FA:2A:35:AF:DD",
            "Type" : 1,
            "DateTime" : ISODate("2018-05-06T19:05:02.666Z"),
            "Sensors" : [ 
                {
                    "Type" : 1,
                    "Value" : NumberDecimal("-97")
                }, 
                {
                    "Type" : 2,
                    "Value" : NumberDecimal("-58")
                }, 
                {
                    "Type" : 3,
                    "Value" : NumberDecimal("10.171658037099185")
                }
            ]
        }
    ]
}

/* 2 */
{
    "_id" : ObjectId("5aef51e0af42ea1b70d0c4dc"),    
    "EndpointId" : "89799bcc-e86f-4c8a-b340-8b5ed53caf83",    
    "Url" : "test",
    "Tags" : [ 
        {
            "Uid" : "E2:02:00:18:DA:40",
            "Type" : 1,
            "DateTime" : ISODate("2018-05-06T19:05:04.574Z"),
            "Sensors" : [ 
                {
                    "Type" : 1,
                    "Value" : NumberDecimal("-98")
                }, 
                {
                    "Type" : 2,
                    "Value" : NumberDecimal("-65")
                }, 
                {
                    "Type" : 3,
                    "Value" : NumberDecimal("7.845424441900629")
                }, 
                {
                    "Type" : 4,
                    "Value" : NumberDecimal("0.0")
                }, 
                {
                    "Type" : 6,
                    "Value" : NumberDecimal("3012")
                }
            ]
        }, 
        {
            "Uid" : "12:3B:6A:1A:B7:F9",
            "Type" : 1,
            "DateTime" : ISODate("2018-05-06T19:05:04.574Z"),
            "Sensors" : [ 
                {
                    "Type" : 1,
                    "Value" : NumberDecimal("-95")
                }, 
                {
                    "Type" : 2,
                    "Value" : NumberDecimal("-59")
                }, 
                {
                    "Type" : 3,
                    "Value" : NumberDecimal("12.939770381907275")
                }
            ]
        }
    ]
}

最重要的是,您可以在此处使用第一个给定的表格,该表格将在一个阶段内在每个文档内"累积,然后在每个端点累积",这是最佳选择,或者您实际上需要识别类似以下内容的表格:标签上的"Uid"或传感器上的"Type",这些值在按端点分组的文档的任意组合中多次出现.

Bottom line is that you can either use the first given form here which will accumulate "within each document" and then "accumulate per endpoint" within a single stage and is the most optimal, or you actually require to identify things like the "Uid" on the tags or the "Type" on the sensor where those values occur more than once over any combination of documents grouping by the endpoint.

您到目前为止提供的样本数据仅显示这些值在每个文档中都是唯一的",因此,如果所有其他数据都是这种情况,则第一个给定的格式将是最佳的.

Your sample data supplied to date only shows that these values are "unique within each document", therefore the first given form would be most optimal if this is the case for all remaining data.

如果不是这样,则展开"两个嵌套的数组以汇总文档中的详细信息"是解决此问题的唯一方法.您可以限制日期范围或其他条件,因为大多数查询"通常都具有一定范围,并且实际上无法处理整个"收集数据,但是主要的事实仍然是数组将被展开",从而为每个数据库创建一个文档副本.数组成员.

In the event that it is not, then "unwinding" the two nested arrays in order to "aggregate the detail across documents" is the only way to approach this. You can limit the date range or other criteria as most "queries" typically have some bounds and do not actually work on the "whole" collection data, but the main fact remains that arrays would be "unwound" creating essentially a document copy for every array member.

关于优化的要点意味着您仅需要两次"就可以了,因为只有两个数组.对 $group : //docs.mongodb.com/manual/reference/operator/aggregation/unwind/"rel =" nofollow noreferrer> $unwind一次.在此展示的一系列分级步骤中,是一次方法进行了优化.

The point on optimization means that you only need to do this "twice" as there are only two arrays. Doing successive $group to $unwind to $group is always a sure sign you a doing something really wrong. Once you "take something apart" you should only ever need to "put it back together" once. In a series of graded steps as demonstrated here is the once approach which optimizes.

问题的范围仍然存在:

  • Add other realistic constraints to the query to reduce the documents processed, maybe even do so in "batches" and combine results
  • Add the allowDiskUse option to the pipeline to let temporary storage be used. ( actually demonstrated on the commands )
  • Consider that "nested arrays" are probably not the best storage method for the analysis you want to do. It's always more efficient when you know you need to $unwind to simply write the data in that "unwound" form directly into a collection.

这篇关于如何减少嵌套文档的聚合管道中的展开阶段?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆