在node.js mongodb中求和嵌套数组 [英] Sum nested array in node.js mongodb

查看:63
本文介绍了在node.js mongodb中求和嵌套数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在mongodb中有一个看起来像这样的模式.

first_level:[{
    first_item  : String,
    second_level:[{
        second_item: String,
        third_level:[{
            third_item :String,
            forth_level :[{//4th level
                    price               : Number, // 5th level
                    sales_date          : Date, 
                    quantity_sold       : Number
                }]
        }]
    }]
}]

1).我想根据中的匹配条件添加quantity_sold first_item,second_item,third_item和sales_date

2).我还想查找特定日期所有已售出数量的平均值.

3).我还想查找特定日期所有已售出数量的平均值,并有相应的价格.

我对如何解决这个问题感到非常困惑,我来自sql 背景,所以这很令人困惑

解决方案

让我们从一个基本的免责声明开始,因为在这里,回答问题的主体已经在 Double 也适用于 Triple Quadrupal ANY 嵌套级别,基本上相同的原则总是.

任何答案的另一个要点也是不要嵌套数组,因为该答案也对此进行了解释(并且我已经多次重复了 次) ),无论您出于什么原因而想" ,而实际上却并没有给您带来您所认为的好处.实际上,嵌套" 确实使生活变得更加困难.

嵌套问题

从关系"模型转换数据结构的主要误解几乎总是被解释为每个关联模型的添加嵌套数组级别" .您在此处呈现的内容也不例外,因为它很可能是"normalized" ,因此每个子数组都包含与其父项相关的项.

MongoDB是一个基于文档"的数据库,因此它几乎可以使您执行此操作,或者实际上可以执行您基本上想要的任何数据结构内容.但是,这并不意味着以这种形式的数据易于使用,或者对于实际目的而言确实是实用的.

让我们用一些实际数据填写该模式以演示:

{
  "_id": 1,
  "first_level": [
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "A",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-10-31"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-01"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-02"),
                  "quantity": 1
                },
              ]
            },
            { 
              "third_item": "B",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-10-31"),
                  "quantity": 1
                },
              ]
            }
          ]
        },
        {
          "second_item": "A",
          "third_level": [
            { 
              "third_item": "B",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                },
              ]
            }
          ]
        }
      ]
    },
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "B",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                },
              ]
            }
          ]
        }
      ]
    }
  ]
},
{
  "_id": 2,
  "first_level": [
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "A",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 2,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-10-31"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                }
              ]
            }
          ]
        }
      ]
    }
  ]
},
{
  "_id": 3,
  "first_level": [
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "B",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

与问题的结构有点"不同,但出于演示目的,它具有我们需要研究的内容.主要在文档中有一个数组,其中的项带有子数组,而子数组又具有子数组,依此类推.当然,这里的规范化" 是通过每个级别"上的标识符作为项目类型"或您实际拥有的任何东西来实现的.

核心问题是您只想从这些嵌套数组中获取一些"数据,而MongoDB确实只想返回文档",这意味着您需要进行一些操作才能达到这些匹配条件子项目".

即使在正确" 的问题上,选择与所有这些子条件"都匹配的文档也需要大量使用点符号" 对于那些多种条件.如果没有 $elemMatch 语句,您将无法获得准确的组合",然后获取在 any 数组元素上满足条件的文档.

实际上,过滤出数组内容" 实际上是其他区别的一部分:

db.collection.aggregate([
  { "$match": {
    "first_level": {
      "$elemMatch": {
        "first_item": "A",
        "second_level": {
          "$elemMatch": {
            "second_item": "A",
            "third_level": {
              "$elemMatch": {
                "third_item": "A",
                "forth_level": {
                  "$elemMatch": {
                    "sales_date": {
                      "$gte": new Date("2018-11-01"),
                      "$lt": new Date("2018-12-01")
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }},
  { "$addFields": {
    "first_level": {
      "$filter": {
        "input": {
          "$map": {
            "input": "$first_level",
            "in": {
              "first_item": "$$this.first_item",
              "second_level": {
                "$filter": {
                  "input": {
                    "$map": {
                      "input": "$$this.second_level",
                      "in": {
                        "second_item": "$$this.second_item",
                        "third_level": {
                          "$filter": {
                            "input": {
                              "$map": {
                                "input": "$$this.third_level",
                                 "in": {
                                   "third_item": "$$this.third_item",
                                   "forth_level": {
                                     "$filter": {
                                       "input": "$$this.forth_level",
                                       "cond": {
                                         "$and": [
                                           { "$gte": [ "$$this.sales_date", new Date("2018-11-01") ] },
                                           { "$lt": [ "$$this.sales_date", new Date("2018-12-01") ] }
                                         ]
                                       }
                                     }
                                   }
                                 } 
                              }
                            },
                            "cond": {
                              "$and": [
                                { "$eq": [ "$$this.third_item", "A" ] },
                                { "$gt": [ { "$size": "$$this.forth_level" }, 0 ] }
                              ]
                            }
                          }
                        }
                      }
                    }
                  },
                  "cond": {
                    "$and": [
                      { "$eq": [ "$$this.second_item", "A" ] },
                      { "$gt": [ { "$size": "$$this.third_level" }, 0 ] }
                    ]
                  }
                }
              }
            }
          }
        },
        "cond": {
          "$and": [
            { "$eq": [ "$$this.first_item", "A" ] },
            { "$gt": [ { "$size": "$$this.second_level" }, 0 ] }
          ]
        } 
      }
    }
  }},
  { "$unwind": "$first_level" },
  { "$unwind": "$first_level.second_level" },
  { "$unwind": "$first_level.second_level.third_level" },
  { "$unwind": "$first_level.second_level.third_level.forth_level" },
  { "$group": {
    "_id": {
      "date": "$first_level.second_level.third_level.forth_level.sales_date",
      "price": "$first_level.second_level.third_level.forth_level.price",
    },
    "quantity_sold": {
      "$avg": "$first_level.second_level.third_level.forth_level.quantity"
    } 
  }},
  { "$group": {
    "_id": "$_id.date",
    "prices": {
      "$push": {
        "price": "$_id.price",
        "quanity_sold": "$quantity_sold"
      }
    },
    "quanity_sold": { "$avg": "$quantity_sold" }
  }}
])

最好将其描述为混乱"和涉及".我们不仅可以使用 $elemMatch $elemMatch 不只是一口而过,但随后我们有了随后的 $filter $map 进行每个数组级别的处理.如前所述,无论实际上有多少个级别,这都是这种模式.

您也可以选择 $unwind 和<一个href ="https://docs.mongodb.com/manual/reference/operator/aggregation/match/" rel ="nofollow noreferrer"> $match 组合,而不是过滤数组,但这确实会导致不必要的内容之前 $unwind 的额外开销已删除,因此在MongoDB的现代发行版中, $filter 首先从数组开始.

这里的终点是您要 $group 由实际上位于数组内部的元素组成,因此您最终需要 $unwind 在此之前数组的每个级别.

通常,使用sales_dateprice属性进行 first 累加,然后将后续阶段添加到

{ "$group": {
  "_id": {
    "date": {
      "$dateFromParts": {
        "year": { "$year": "$first_level.second_level.third_level.forth_level.sales_date" },
        "month": { "$month": "$first_level.second_level.third_level.forth_level.sales_date" },
        "day": { "$dayOfMonth": "$first_level.second_level.third_level.forth_level.sales_date" }
      }
    }.
    "price": "$first_level.second_level.third_level.forth_level.price"
  }
  ...
}}

使用 $dateFromParts 和其他日期聚合运算符,以提取日期"信息并显示可以以这种形式追溯到过去.

开始非正规化

从上面的混乱"中应该清楚的是,使用嵌套数组并不是一件容易的事.在MongoDB 3.6之前的版本中,通常甚至无法原子地更新这样的结构,即使您甚至从未更新过它们或基本上没有替换整个数组,它们仍然不容易查询.这就是您所显示的.

必须具有父文档中的数组内容的地方,通常建议拼合" 非规范化" 这样的结构.这可能与关系思维相反,但实际上出于性能原因,这是处理此类数据的最佳方法:

{
  "_id": 1,
  "data": [
    {
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-10-31"),
      "quantity": 1
    },

    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-01"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-02"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "B",
      "price": 1,
      "sales_date": new Date("2018-10-31"),
      "quantity": 1
    },
    {
     "first_item": "A",
     "second_item": "A",
     "third_item": "B",
     "price": 1,
     "sales_date": new Date("2018-11-03"),
     "quantity": 1
    },
    {
      "first_item": "A",
      "second_item": "B",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
     },
  ]
},
{
  "_id": 2,
  "data": [
    {
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 2,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-10-31"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
    }
  ]
},
{
  "_id": 3,
  "data": [
    {
      "first_item": "A",
      "second_item": "B",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
     }
  ]
}

与原始显示的所有数据相同,但实际上我们只是将所有内容放到每个父文档中的单个扁平化数组中,而不是嵌套.当然,这意味着要对各个数据点进行重复,但是查询复杂性和性能上的差异应该是显而易见的:

db.collection.aggregate([
  { "$match": {
    "data": {
      "$elemMatch": {
        "first_item": "A",
        "second_item": "A",
        "third_item": "A",
        "sales_date": {
          "$gte": new Date("2018-11-01"),
          "$lt": new Date("2018-12-01")
        }
      }
    }
  }},
  { "$addFields": {
    "data": {
      "$filter": {
        "input": "$data",
         "cond": {
           "$and": [
             { "$eq": [ "$$this.first_item", "A" ] },
             { "$eq": [ "$$this.second_item", "A" ] },
             { "$eq": [ "$$this.third_item", "A" ] },
             { "$gte": [ "$$this.sales_date", new Date("2018-11-01") ] },
             { "$lt": [ "$$this.sales_date", new Date("2018-12-01") ] }
           ]
         }
      }
    }
  }},
  { "$unwind": "$data" },
  { "$group": {
    "_id": {
      "date": "$data.sales_date",
      "price": "$data.price",
    },
    "quantity_sold": { "$avg": "$data.quantity" }
  }},
  { "$group": {
    "_id": "$_id.date",
    "prices": {
      "$push": {
        "price": "$_id.price",
        "quantity_sold": "$quantity_sold"
      }
    },
    "quantity_sold": { "$avg": "$quantity_sold" }
  }}
])

现在,而不是嵌套那些 $elemMatch 调用,类似地,对于 $filter 表达式,所有内容都更加清晰和易于阅读,并且处理起来非常简单.另一个优点是,您实际上甚至可以为查询中使用的数组中的元素的索引编制索引.这是嵌套模型的一个约束,其中MongoDB根本不允许这样的对数组中数组的键上的多键索引" .对于单个阵列,这是允许的,并且可以用来提高性能.

数组内容过滤" 之后的所有内容都完全相同,不同的是它只是像"data.sales_date"这样的路径名,而不是前一个结构中冗长的"first_level.second_level.third_level.forth_level.sales_date". /p>

何时不嵌入

最后,另一个大的误解是,所有关系需要翻译为嵌入数组中.这确实不是MongoDB的意图,并且只意味着在要对数据进行单次检索而不是对联接"的情况下,将相关"数据保留在同一文档中的数组中.

此处经典的订单/详细信息"模型通常适用于在现代世界中要在同一屏幕"上显示订单"的标题"的详细信息,例如客户地址,订单总数等. 订单"上不同订单项的详细信息.

在RDBMS诞生之初,典型的80字符乘25行的屏幕仅在一个屏幕上具有这样的标题"信息,然后所购买的所有产品的详细信息行都在另一个屏幕上.因此,自然而然地存在将它们存储在单独的表中的某种程度的常识.随着人们对此类屏幕"的更多了解,您通常希望查看整个内容,或者至少要看到标头"以及此类订单"的头几行.

因此,为什么这样的安排才有意义,因为MongoDB一次返回了一个包含相关数据的文档".不需要对单独的渲染屏幕进行单独的请求,也不需要对此类数据进行联接",因为它已经是预联接"的.

考虑是否需要-也称为完全"非规范化

因此,如果您几乎大部分时间都不知道实际上对处理此类数组中的大多数数据感兴趣,那么将它们全部单独放在一个集合中通常是更有意义的.另一个属性以识别父母",如果偶尔需要这种加入":

{
  "_id": 1,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-10-31"),
  "quantity": 1
},
{ 
  "_id": 2,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-01"),
  "quantity": 1
},
{ 
  "_id": 3,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-02"),
  "quantity": 1
},
{ 
  "_id": 4,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "B",
  "price": 1,
  "sales_date": new Date("2018-10-31"),
  "quantity": 1
},
{
  "_id": 5,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "B",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{
  "_id": 6,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "B",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{
  "_id": 7,
  "parent_id": 2,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 2,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{ 
  "_id": 8,
  "parent_id": 2,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-10-31"),
  "quantity": 1
},
{ 
  "_id": 9,
  "parent_id": 2,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{
  "_id": 10,
  "parent_id": 3,
  "first_item": "A",
  "second_item": "B",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
}

再次使用相同的数据,但是这一次是在完全独立的文档中,充其量只有在您实际需要将其用于其他用途的情况下,才能引用父级.请注意,这里的聚合完全不与父数据相关,并且还可以通过简单地将其存储在单独的集合中来明确带来额外性能和降低的复杂性的地方:

db.collection.aggregate([
  { "$match": {
    "first_item": "A",
    "second_item": "A",
    "third_item": "A",
    "sales_date": {
      "$gte": new Date("2018-11-01"),
      "$lt": new Date("2018-12-01")
    }
  }},
  { "$group": {
    "_id": {
      "date": "$sales_date",
      "price": "$price"
    },
    "quantity_sold": { "$avg": "$quantity" }
  }},
  { "$group": {
    "_id": "$_id.date",
    "prices": {
      "$push": {
        "price": "$_id.price",
        "quantity_sold": "$quantity_sold"
      }
    },
    "quantity_sold": { "$avg": "$quantity_sold" }
  }}
])

由于所有内容均已成为文档,因此无需过滤数组" 或具有其他任何复杂性.您要做的就是选择匹配的文档并汇总结果,并且始终执行与最后两个步骤完全相同的两个步骤.

为了仅获得最终结果,它的效果远胜于以上任何一种方法.该查询实际上只涉及详细"数据,因此,最佳做法是将细节与父级完全分开,因为它总是会提供最佳的性能优势.

这里的要点是,其余应用程序 Never 的实际访问模式都需要返回整个数组内容,因此无论如何都不应将其嵌入.同样,大多数写"操作似乎永远也不需要接触相关的父对象,这是该方法起作用或不起作用的另一个决定因素.

结论

通常的信息是,作为一般规则,您永远不要嵌套数组.最多应该在相关的父文档中保留一个带有部分非规范化数据的奇异"数组,并且在其余访问模式实际上根本不使用父级和子级的情况下,则应该真正分离数据.

大"变化是,您认为规范化数据实际上很好的所有原因,最终成为了此类嵌入式文档系统的敌人.避免使用联接"总是好的,但是创建复杂的嵌套结构以使联接"数据的外观也永远不会真正为您带来好处.

处理思维"就是规范化的成本通常最终要花在额外的存储上以及最终存储中维护重复和非规范化数据的维护上.

还请注意,以上所有形式都返回相同的结果集.简洁的示例数据仅包含单个项目,或者在最多有多个价格点的情况下,平均"仍为1,这是相当不错的,因为这是所有值的总和.但是,解释该内容的内容已经非常长,因此实际上只是通过示例":

{
        "_id" : ISODate("2018-11-01T00:00:00Z"),
        "prices" : [
                {
                        "price" : 1,
                        "quantity_sold" : 1
                }
        ],
        "quantity_sold" : 1
}
{
        "_id" : ISODate("2018-11-02T00:00:00Z"),
        "prices" : [
                {
                        "price" : 1,
                        "quantity_sold" : 1
                }
        ],
        "quantity_sold" : 1
}
{
        "_id" : ISODate("2018-11-03T00:00:00Z"),
        "prices" : [
                {
                        "price" : 1,
                        "quantity_sold" : 1
                },
                {
                        "price" : 2,
                        "quantity_sold" : 1
                }
        ],
        "quantity_sold" : 1
}

I have a schema in mongodb which looks like this.

first_level:[{
    first_item  : String,
    second_level:[{
        second_item: String,
        third_level:[{
            third_item :String,
            forth_level :[{//4th level
                    price               : Number, // 5th level
                    sales_date          : Date, 
                    quantity_sold       : Number
                }]
        }]
    }]
}]

1). I want to add quantity_sold based on matching criteria in first_item, second_item, third_item and sales_date

2). I also want to find average of all the quantity_sold in a specific date.

3). I also want to find average of all the quantity_sold in a specific date with there corresponding prices.

I have been very confused as to how I can go about this, I am coming from sql background so this is quite confusing

解决方案

Let's start with a basic disclaimer in that the main body of what answers the problem has already been answered here at Find in Double Nested Array MongoDB. And "for the record" the Double also applies to Triple or Quadrupal or ANY level of nesting as basically the same principle ALWAYS.

The other main point of any answer is also Don't NEST Arrays, since as is explained in that answer as well ( and I've repeated this many times ), whatever reason you "think" you have for "nesting" actually does not give you the advantages that you percieve it will. In fact "nesting" is really just making life far more difficult.

Nested Problems

The main misconception of any translation of a data structure from a "relational" model is just about always interpreted as "add a nested array level" for each and every associated model. What you are presenting here is no exception to this misconception as it very much appears to be "normalized" so that each sub-array contains the related items to it's parent.

MongoDB is a "document" based database, so it pretty much allows you to do this or in fact any data structure content you basically want. That does not however mean the data in such a form is easy to work with or indeed practical for the actual purpose.

Let's fill out the schema with some actual data to demonstrate:

{
  "_id": 1,
  "first_level": [
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "A",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-10-31"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-01"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-02"),
                  "quantity": 1
                },
              ]
            },
            { 
              "third_item": "B",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-10-31"),
                  "quantity": 1
                },
              ]
            }
          ]
        },
        {
          "second_item": "A",
          "third_level": [
            { 
              "third_item": "B",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                },
              ]
            }
          ]
        }
      ]
    },
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "B",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                },
              ]
            }
          ]
        }
      ]
    }
  ]
},
{
  "_id": 2,
  "first_level": [
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "A",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 2,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-10-31"),
                  "quantity": 1
                },
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                }
              ]
            }
          ]
        }
      ]
    }
  ]
},
{
  "_id": 3,
  "first_level": [
    {
      "first_item": "A",
      "second_level": [
        {
          "second_item": "B",
          "third_level": [
            { 
              "third_item": "A",
              "forth_level": [
                { 
                  "price": 1,
                  "sales_date": new Date("2018-11-03"),
                  "quantity": 1
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

That's a "little" different from the structure in the question but for demonstration purposes it has the things we need to look at. Mainly there is an array in the document which has items with a sub-array, which in turn has items in a sub-array and so on. The "normalizing" here is of course by the identifiers on each "level" as an "item type" or whatever you actually have.

The core problem is you just want "some" of the data from within these nested arrays, and MongoDB really just wants to return the "document", which means you need to do some manipulation in order just to get to those matching "sub-items".

Even on the issue of "correctly" selecting the document which matches all of these "sub-criteria" requires extensive use of $elemMatch in order to get the correct combination of conditions on each level of array elements. You cannot use straight up "Dot Notation" due to the need for those multiple conditions. Without the $elemMatch statements you don't get the exact "combination" and just get documents where the condition was true on any array element.

As for actually "filtering out the array contents" then that is actually the part of additional difference:

db.collection.aggregate([
  { "$match": {
    "first_level": {
      "$elemMatch": {
        "first_item": "A",
        "second_level": {
          "$elemMatch": {
            "second_item": "A",
            "third_level": {
              "$elemMatch": {
                "third_item": "A",
                "forth_level": {
                  "$elemMatch": {
                    "sales_date": {
                      "$gte": new Date("2018-11-01"),
                      "$lt": new Date("2018-12-01")
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }},
  { "$addFields": {
    "first_level": {
      "$filter": {
        "input": {
          "$map": {
            "input": "$first_level",
            "in": {
              "first_item": "$$this.first_item",
              "second_level": {
                "$filter": {
                  "input": {
                    "$map": {
                      "input": "$$this.second_level",
                      "in": {
                        "second_item": "$$this.second_item",
                        "third_level": {
                          "$filter": {
                            "input": {
                              "$map": {
                                "input": "$$this.third_level",
                                 "in": {
                                   "third_item": "$$this.third_item",
                                   "forth_level": {
                                     "$filter": {
                                       "input": "$$this.forth_level",
                                       "cond": {
                                         "$and": [
                                           { "$gte": [ "$$this.sales_date", new Date("2018-11-01") ] },
                                           { "$lt": [ "$$this.sales_date", new Date("2018-12-01") ] }
                                         ]
                                       }
                                     }
                                   }
                                 } 
                              }
                            },
                            "cond": {
                              "$and": [
                                { "$eq": [ "$$this.third_item", "A" ] },
                                { "$gt": [ { "$size": "$$this.forth_level" }, 0 ] }
                              ]
                            }
                          }
                        }
                      }
                    }
                  },
                  "cond": {
                    "$and": [
                      { "$eq": [ "$$this.second_item", "A" ] },
                      { "$gt": [ { "$size": "$$this.third_level" }, 0 ] }
                    ]
                  }
                }
              }
            }
          }
        },
        "cond": {
          "$and": [
            { "$eq": [ "$$this.first_item", "A" ] },
            { "$gt": [ { "$size": "$$this.second_level" }, 0 ] }
          ]
        } 
      }
    }
  }},
  { "$unwind": "$first_level" },
  { "$unwind": "$first_level.second_level" },
  { "$unwind": "$first_level.second_level.third_level" },
  { "$unwind": "$first_level.second_level.third_level.forth_level" },
  { "$group": {
    "_id": {
      "date": "$first_level.second_level.third_level.forth_level.sales_date",
      "price": "$first_level.second_level.third_level.forth_level.price",
    },
    "quantity_sold": {
      "$avg": "$first_level.second_level.third_level.forth_level.quantity"
    } 
  }},
  { "$group": {
    "_id": "$_id.date",
    "prices": {
      "$push": {
        "price": "$_id.price",
        "quanity_sold": "$quantity_sold"
      }
    },
    "quanity_sold": { "$avg": "$quantity_sold" }
  }}
])

This is best described as "messy" and "involved". Not only is our initial query for document selection with the $elemMatch more than a mouthful, but then we have the subsequent $filter and $map processing for every array level. As mentioned earlier, this is the pattern no matter how many levels there actually are.

You could alternately do a $unwind and $match combination instead of filtering the arrays in place, but this does cause additional overhead to $unwind before the unwanted content is removed, so in modern releases of MongoDB it's generally better practice to $filter from the array first.

The end place here is you want to $group by elements that are actually inside the array, so you end up needing to $unwind each level of the arrays anyway before this.

The actual "grouping" is then generally straightforward using the sales_date and price properties for the first accumulation, and then adding a subsequent stage to $push the different price values you want to accumulate an average for within each date as a second accumulation.

NOTE: The actual handling of dates may well vary in practical use depending on the granularity at which you store them. In this sample the dates are all just already rounded to the start of each "day". If you actually need to accumulate real "datetime" values, then you probably really want a construct like this or similar:

{ "$group": {
  "_id": {
    "date": {
      "$dateFromParts": {
        "year": { "$year": "$first_level.second_level.third_level.forth_level.sales_date" },
        "month": { "$month": "$first_level.second_level.third_level.forth_level.sales_date" },
        "day": { "$dayOfMonth": "$first_level.second_level.third_level.forth_level.sales_date" }
      }
    }.
    "price": "$first_level.second_level.third_level.forth_level.price"
  }
  ...
}}

Using $dateFromParts and other date aggregation operators to extract the "day" information and present the date back in that form for accumulation.

Starting to Denormalize

What should be clear from the "mess" above is that working with nested arrays is not exactly easy. Such structures were generally not even possible to atomically update in releases prior to MongoDB 3.6, and even if you never even updated them or lived with replacing basically the whole array, they still are not simple to query. This is what you are being shown.

Where you must have array content within a parent document it is generally advised to "flatten" and "denormalize" such structures. This may appear contrary to relational thinking, but it's actually the best way to handle such data for performance reasons:

{
  "_id": 1,
  "data": [
    {
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-10-31"),
      "quantity": 1
    },

    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-01"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-02"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "B",
      "price": 1,
      "sales_date": new Date("2018-10-31"),
      "quantity": 1
    },
    {
     "first_item": "A",
     "second_item": "A",
     "third_item": "B",
     "price": 1,
     "sales_date": new Date("2018-11-03"),
     "quantity": 1
    },
    {
      "first_item": "A",
      "second_item": "B",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
     },
  ]
},
{
  "_id": 2,
  "data": [
    {
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 2,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-10-31"),
      "quantity": 1
    },
    { 
      "first_item": "A",
      "second_item": "A",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
    }
  ]
},
{
  "_id": 3,
  "data": [
    {
      "first_item": "A",
      "second_item": "B",
      "third_item": "A",
      "price": 1,
      "sales_date": new Date("2018-11-03"),
      "quantity": 1
     }
  ]
}

That's all the same data as originally shown, yet instead of nesting we actually just put everything into a singular flattened array within each parent document. Sure this means duplication of various data points, but the difference in query complexity and performance should be self evident:

db.collection.aggregate([
  { "$match": {
    "data": {
      "$elemMatch": {
        "first_item": "A",
        "second_item": "A",
        "third_item": "A",
        "sales_date": {
          "$gte": new Date("2018-11-01"),
          "$lt": new Date("2018-12-01")
        }
      }
    }
  }},
  { "$addFields": {
    "data": {
      "$filter": {
        "input": "$data",
         "cond": {
           "$and": [
             { "$eq": [ "$$this.first_item", "A" ] },
             { "$eq": [ "$$this.second_item", "A" ] },
             { "$eq": [ "$$this.third_item", "A" ] },
             { "$gte": [ "$$this.sales_date", new Date("2018-11-01") ] },
             { "$lt": [ "$$this.sales_date", new Date("2018-12-01") ] }
           ]
         }
      }
    }
  }},
  { "$unwind": "$data" },
  { "$group": {
    "_id": {
      "date": "$data.sales_date",
      "price": "$data.price",
    },
    "quantity_sold": { "$avg": "$data.quantity" }
  }},
  { "$group": {
    "_id": "$_id.date",
    "prices": {
      "$push": {
        "price": "$_id.price",
        "quantity_sold": "$quantity_sold"
      }
    },
    "quantity_sold": { "$avg": "$quantity_sold" }
  }}
])

Now instead of nesting those $elemMatch calls and similarly for the $filter expressions, everything is much clearer and easy to read and really quite simple in processing. There is another advantage in that you actually can even index the keys of the elements in the array as used in the query. That was a constraint of the nested model where MongoDB simply won't allow such "Multikey indexing" on keys of arrays within arrays. With a single array this is allowed and can be used to improve performance.

Everything after the "array content filtering" then remains exactly the same, with the exception it's just path names like "data.sales_date" as opposed to the long winded "first_level.second_level.third_level.forth_level.sales_date" from the previous structure.

When NOT to Embed

Finally the other big misconception is is that ALL Relations need to be translated as embedding within arrays. This really was never the intent of MongoDB and you were only ever meant to keep "related" data within the same document in an array in the case where it meant doing a single retrieval of data as opposed to "joins".

The classic "Order/Details" model here typically applies where in the modern world you want to display "header" for an "Order" with details such as customer address, order total and so on within the same "screen" as the details of different line items on the "Order".

Way back in the inception of the RDBMS, the typical 80 character by 25 line screen simply had such "header" information on one screen, then the detail lines for everything purchased was on a different screen. So naturally there was some level of common sense to store those in separate tables. As the world moved to more detail on such "screens" you typically want to see the whole thing, or at least the "header" and the first so many lines of such an "order".

Hence why this sort of arrangement makes sense to put into an array, since MongoDB returns a "document" containing the related data all at once. No need for separate requests for separate rendered screens and no need for "joins" on such data since it's already "pre-joined" as it were.

Consider if you need it - AKA "Fully" Denormalize

So in cases where you pretty much know you're not actually interested in dealing with most of the data in such arrays most of the time, it generally makes more sense to simply put it all in one collection on it's own with merely another property in order to identify the "parent" should such "joining" be occasionally required:

{
  "_id": 1,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-10-31"),
  "quantity": 1
},
{ 
  "_id": 2,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-01"),
  "quantity": 1
},
{ 
  "_id": 3,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-02"),
  "quantity": 1
},
{ 
  "_id": 4,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "B",
  "price": 1,
  "sales_date": new Date("2018-10-31"),
  "quantity": 1
},
{
  "_id": 5,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "A",
  "third_item": "B",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{
  "_id": 6,
  "parent_id": 1,
  "first_item": "A",
  "second_item": "B",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{
  "_id": 7,
  "parent_id": 2,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 2,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{ 
  "_id": 8,
  "parent_id": 2,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-10-31"),
  "quantity": 1
},
{ 
  "_id": 9,
  "parent_id": 2,
  "first_item": "A",
  "second_item": "A",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
},
{
  "_id": 10,
  "parent_id": 3,
  "first_item": "A",
  "second_item": "B",
  "third_item": "A",
  "price": 1,
  "sales_date": new Date("2018-11-03"),
  "quantity": 1
}

Again it's the same data, but just this time in completely separate documents with a reference to the parent at best in the case where you might actually need it for another purpose. Note that the aggregations here all do not relate to the parent data at all and it's also clear where the additional performance and removed complexity come in by simply storing in a separate collection:

db.collection.aggregate([
  { "$match": {
    "first_item": "A",
    "second_item": "A",
    "third_item": "A",
    "sales_date": {
      "$gte": new Date("2018-11-01"),
      "$lt": new Date("2018-12-01")
    }
  }},
  { "$group": {
    "_id": {
      "date": "$sales_date",
      "price": "$price"
    },
    "quantity_sold": { "$avg": "$quantity" }
  }},
  { "$group": {
    "_id": "$_id.date",
    "prices": {
      "$push": {
        "price": "$_id.price",
        "quantity_sold": "$quantity_sold"
      }
    },
    "quantity_sold": { "$avg": "$quantity_sold" }
  }}
])

Since everything is already a document there is no need to "filter down arrays" or have any of the other complexity. All you are doing is selecting the matching documents and aggregating the results, with exactly the same two final steps that have been present all along.

For the purpose of just getting to the final results, this performs far better than either above alternative. The query in question is really only concerned with the "detail" data, therefore the best course of action is to separate the detail from the parent completely as it's always going to provide the best performance benefit.

And the overall point here is where the actual access pattern of the rest of the application NEVER needs to return the entire array content, then it probably should not have been embedded anyway. Seemingly most "write" operations should similarly never need to touch the related parent anyway, and that's another deciding factor where this works or does not.

Conclusion

The general message is again that as a general rule you should never nest arrays. At most you should keep a "singular" array with partially denormalized data within the related parent document, and where the remaining access patterns really do not use the parent and child in tandem much at all, then the data really should be separated.

The "big" change is that all the reasons why you think normalizing data is actually good, turns out to be the enemy of such embedded document systems. Avoiding "joins" is always good, but creating complex nested structure to have the appearance of "joined" data never really works out for your benefit either.

The cost of dealing with what you "think" is normalization usually ends up outwaying the additional storage and maintenance of duplicated and denormalized data within your eventual storage.

Also note that all forms above return the same result set. It's pretty derivative in that the sample data for brevity only includes singular items, or at most where there are multiple price points the "average" is still 1 since that's what all the values are anyway. But the content to explain this is already exceedingly long so it's really just "by example":

{
        "_id" : ISODate("2018-11-01T00:00:00Z"),
        "prices" : [
                {
                        "price" : 1,
                        "quantity_sold" : 1
                }
        ],
        "quantity_sold" : 1
}
{
        "_id" : ISODate("2018-11-02T00:00:00Z"),
        "prices" : [
                {
                        "price" : 1,
                        "quantity_sold" : 1
                }
        ],
        "quantity_sold" : 1
}
{
        "_id" : ISODate("2018-11-03T00:00:00Z"),
        "prices" : [
                {
                        "price" : 1,
                        "quantity_sold" : 1
                },
                {
                        "price" : 2,
                        "quantity_sold" : 1
                }
        ],
        "quantity_sold" : 1
}

这篇关于在node.js mongodb中求和嵌套数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆