Mongodb聚合管道的大小和速度问题 [英] Mongodb aggregation pipeline size and speed issue

查看:64
本文介绍了Mongodb聚合管道的大小和速度问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用mongodb聚合查询来联接($ lookup)两个集合,然后分别计算联接数组中的所有唯一值. *注意:我不一定知道metaDataMap数组中的哪些字段(键).而且我不想统计或包括地图中可能存在或可能不存在的字段.这就是为什么聚合查询看起来像它的原因.

I'm trying to use a mongodb aggregation query to join($lookup) two collections and then distinct count all the unique values in the joined array. *Note: I don't necessarily know what fields(keys) are in the metaDataMap array. And I don't want to count or include fields that might or might not exist in the Map. So that's why the aggregation query looks like it does.

所以我的两个收藏看起来像这样:events-

So my two collections look like this: events-

{
"_id" : "1",
"name" : "event1",
"objectsIds" : [ "1", "2", "3" ],
}

对象

{
"_id" : "1",
"name" : "object1",
"metaDataMap" : { 
                     "SOURCE" : ["ABC", "DEF"],
                     "DESTINATION" : ["XYZ", "PDQ"],
                     "TYPE" : []
                }
},
{
"_id" : "2",
"name" : "object2",
"metaDataMap" : { 
                     "SOURCE" : ["RST", "LNE"],
                     "TYPE" : ["text"]
                }
},
{
"_id" : "3",
"name" : "object3",
"metaDataMap" : { 
                     "SOURCE" : ["NOP"],
                     "DESTINATION" : ["PHI", "NYC"],
                     "TYPE" : ["video"]
                }
}

我的结果是

{
_id:"SOURCE", count:5
_id:"DESTINATION", count: 4
_id:"TYPE", count: 2
}

到目前为止,我是这样的:

What I have so far is this:

db.events.aggregate([
{$match: {"_id" : id}}

,{$lookup: {"from" : "objects",
        "localField" : "objectsIds",
        "foreignField" : "_id",
        "as" : "objectResults"}}

,{$unwind: "$objectResults"} //Line 1
,{$project: {x: "$objectResults.metaDataMap"}} //Line 2


,{$unwind: "$x"}
,{$project: {"_id":0}}

,{$project: {x: {$objectToArray: "$x"}}}
,{$unwind: "$x"}

,{$group: {_id: "$x.k", tmp: {$push: "$x.v"}}}

,{$addFields: {tmp: {$reduce:{
input: "$tmp",
initialValue:[],
in:{$concatArrays: [ "$$value", "$$this"]}
    }}
}}

,{$unwind: "$tmp"}
,{$group: {_id: "$_id", uniqueVals: {$addToSet: "$tmp"}}}

,{$addFields: {count: {"$size":"$uniqueVals"}}}
,{$project: {_id: "$_id", count: "$count"}}
]);

我的问题是我标记了第1和2行.上面的方法可以工作,但是metaDataMap数组字段(objectsResults.metaDataMap)中的25,000个值大约需要50秒.例如,在对象1 metaDataMap SOURCE数组中有25,000个值.那是减慢速度的方法.我执行此操作的另一种更快的方法是将行1和行2替换为:

My issue is were I marked line 1&2. The above works but takes around 50 seconds for 25,000 values in the metaDataMap array fields(objectsResults.metaDataMap). So for example having a 25,000 values in object 1 metaDataMap SOURCE array. That's way to slow. My other faster way to do it was to replace line 1&2 with:

 ,{$project: {x: "$objectResults.metaDataMap"}} //Line 1
 ,{$unwind: "$x"} //Line 2

这是更快的方式(不到3秒),但是只能在具有约10,000个或更少项目的数据集上运行.更高的值会出现错误,提示超出最大文档大小".

This is way faster (under 3 seconds) but can only be run on datasets that have ~10,000 items or less. Anything higher and I get a error saying "exceeds maximum document size".

请帮助!

推荐答案

如果您可以更改object集合上的架构设计以包含parent_id字段,则可以立即删除以下内容的前4个阶段:您的管道(第一个$match$lookup$unwind$project).这将使对Line 1Line 2的担忧消失.

If you're able to alter your schema design on the object collection to include a parent_id field, you can immediately remove the first 4 stages of your pipeline (the first $match, $lookup, $unwind, and $project). This will make the concern about Line 1 and Line 2 disappear.

例如,object集合中的文档如下所示:

For example, a document in the object collection would look like:

{
  "_id": "1",
  "name": "object1",
  "metaDataMap": {
    "SOURCE": [
      "ABC",
      "DEF"
    ],
    "DESTINATION": [
      "XYZ",
      "PDQ"
    ],
    "TYPE": [ ]
  },
  "parent_id": "1"
}

因此,您不需要昂贵的$lookup$unwind.然后可以将前4个阶段替换为:

Thus you don't need the expensive $lookup and $unwind. The first 4 stages can then be replaced with:

{$match: {parent_id: id}}

基于这个想法,我对管道进行了进一步的优化,结果是:

Based on this idea, I did further optimization of the pipeline, which resulted in:

db.objects.aggregate([
     {$match: {parent_id: id}}
    ,{$project: {metaDataMap: {$filter: {input: {$objectToArray: '$metaDataMap'}, cond: {$ne: [[], '$$this.v']}}}}}
    ,{$unwind: '$metaDataMap'}
    ,{$unwind: '$metaDataMap.v'}
    ,{$group: {_id: '$metaDataMap.k', val: {$addToSet: '$metaDataMap.v'}}}
    ,{$project: {count: {$size: '$val'}}}
])

这将输出:

{ "_id": "TYPE", "count": 2 }
{ "_id": "DESTINATION", "count": 4 }
{ "_id": "SOURCE", "count": 5 }

这篇关于Mongodb聚合管道的大小和速度问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆