Mongodb聚合$group,限制数组长度 [英] Mongodb aggregation $group, restrict length of array

查看:49
本文介绍了Mongodb聚合$group,限制数组长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据一个字段对所有文档进行分组,但要限制为每个值分组的文档数量.

I want to group all the documents according to a field but to restrict the number of documents grouped for each value.

每条消息都有一个conversation_ID.我需要为每个 session_ID 获取 10 条或更少的消息.

Each message has a conversation_ID. I need to get 10 or lesser number of messages for each conversation_ID.

我可以根据以下命令进行分组,但不知道如何限制除切片结果外的分组文档数Message.aggregate({'$group':{_id:'$conversation_ID',msgs:{'$push':{msgid:'$_id'}}}})

I am able to group according to the following command but can't figure out how to restrict the number of grouped documents apart from slicing the results Message.aggregate({'$group':{_id:'$conversation_ID',msgs:{'$push':{msgid:'$_id'}}}})

如何将每个conversation_ID的msgs数组长度限制为10个?

How to limit the length of msgs array for each conversation_ID to 10?

推荐答案

现代

从 MongoDB 3.6 开始,有一个小说"使用 $lookup 来解决这个问题执行自连接"与下面演示的原始游标处理大致相同.

Modern

From MongoDB 3.6 there is a "novel" approach to this by using $lookup to perform a "self join" in much the same way as the original cursor processing demonstrated below.

在此版本中,您可以为 pipeline" 参数rel="noreferrer">$lookup 作为加入"的来源,这实质上意味着您可以使用 $match$limit 收集和限制";数组的条目:

Since in this release you can specify a "pipeline" argument to $lookup as a source for the "join", this essentially means you can use $match and $limit to gather and "limit" the entries for the array:

db.messages.aggregate([
  { "$group": { "_id": "$conversation_ID" } },
  { "$lookup": {
    "from": "messages",
    "let": { "conversation": "$_id" },
    "pipeline": [
      { "$match": { "$expr": { "$eq": [ "$conversation_ID", "$$conversation" ] } }},
      { "$limit": 10 },
      { "$project": { "_id": 1 } }
    ],
    "as": "msgs"
  }}
])

您可以选择在 $lookup 为了使数组项只是值而不是带有 _id 键的文档,但只需执行上述操作即可获得基本结果.

You can optionally add additional projection after the $lookup in order to make the array items simply the values rather than documents with an _id key, but the basic result is there by simply doing the above.

仍有未解决的 SERVER-9277 实际上请求限制为推"直接,但使用 $lookup 在在此期间,这种方式是可行的替代方案.

There is still the outstanding SERVER-9277 which actually requests a "limit to push" directly, but using $lookup in this way is a viable alternative in the interim.

注意:还有 $slice 是在写完原答案后引入的,并被优秀的 JIRA 问题"提及;在原始内容中.虽然您可以使用较小的结果集获得相同的结果,但它仍然涉及推送所有内容".进入数组,然后将最终的数组输出限制为所需的长度.

NOTE: There also is $slice which was introduced after writing the original answer and mentioned by "outstanding JIRA issue" in the original content. Whilst you can get the same result with small result sets, it does involve still "pushing everything" into the array and then later limiting the final array output to the desired length.

所以这是主要的区别,以及为什么 $slice 用于大结果.但当然可以交替使用.

So that's the main distinction and why it's generally not practical to $slice for large results. But of course can be alternately used in cases where it is.

关于 mongodb 按多个字段分组值的更多详细信息,可以了解这两种替代用法.

There are a few more details on mongodb group values by multiple fields about either alternate usage.


原创

如前所述,这并非不可能,但肯定是一个可怕的问题.


Original

As stated earlier, this is not impossible but certainly a horrible problem.

实际上,如果您主要担心生成的数组会特别大,那么最好的方法是为每个不同的conversation_ID"提交;作为单独的查询,然后合并您的结果.在非常 MongoDB 2.6 的语法中,可能需要根据您的语言实现实际进行一些调整:

Actually if your main concern is that your resulting arrays are going to be exceptionally large, then you best approach is to submit for each distinct "conversation_ID" as an individual query and then combine your results. In very MongoDB 2.6 syntax which might need some tweaking depending on what your language implementation actually is:

var results = [];
db.messages.aggregate([
    { "$group": {
        "_id": "$conversation_ID"
    }}
]).forEach(function(doc) {
    db.messages.aggregate([
        { "$match": { "conversation_ID": doc._id } },
        { "$limit": 10 },
        { "$group": {
            "_id": "$conversation_ID",
            "msgs": { "$push": "$_id" }
        }}
    ]).forEach(function(res) {
        results.push( res );
    });
});

但这完全取决于您是否要避免这种情况.下面是真正的答案:

But it all depends on whether that is what you are trying to avoid. So on to the real answer:

这里的第一个问题是没有限制"功能.推送"的项目数成一个数组.这当然是我们想要的,但目前不存在该功能.

The first issue here is that there is no function to "limit" the number of items that are "pushed" into an array. It is certainly something we would like, but the functionality does not presently exist.

第二个问题是,即使将所有项推入数组,也不能使用 $slice 或聚合管道中的任何类似运算符.所以目前没有办法只获得前 10 名".通过简单的操作从生成的数组中得到结果.

The second issue is that even when pushing all items into an array, you cannot use $slice, or any similar operator in the aggregation pipeline. So there is no present way to get just the "top 10" results from a produced array with a simple operation.

但你实际上可以产生一组操作来有效地切片"在您的分组边界上.它相当复杂,例如这里我将减少切片"的数组元素.到六"只要.这里的主要原因是演示该过程并展示如何在不破坏不包含您想要切片"的总数的数组的情况下执行此操作.

But you can actually produce a set of operations to effectively "slice" on your grouping boundaries. It is fairly involved, and for example here I will reduce the array elements "sliced" to "six" only. The main reason here is to demonstrate the process and show how to do this without being destructive with arrays that do not contain the total you want to "slice" to.

给定文档样本:

{ "_id" : 1, "conversation_ID" : 123 }
{ "_id" : 2, "conversation_ID" : 123 }
{ "_id" : 3, "conversation_ID" : 123 }
{ "_id" : 4, "conversation_ID" : 123 }
{ "_id" : 5, "conversation_ID" : 123 }
{ "_id" : 6, "conversation_ID" : 123 }
{ "_id" : 7, "conversation_ID" : 123 }
{ "_id" : 8, "conversation_ID" : 123 }
{ "_id" : 9, "conversation_ID" : 123 }
{ "_id" : 10, "conversation_ID" : 123 }
{ "_id" : 11, "conversation_ID" : 123 }
{ "_id" : 12, "conversation_ID" : 456 }
{ "_id" : 13, "conversation_ID" : 456 }
{ "_id" : 14, "conversation_ID" : 456 }
{ "_id" : 15, "conversation_ID" : 456 }
{ "_id" : 16, "conversation_ID" : 456 }

您可以在那里看到,当按您的条件分组时,您将得到一个包含十个元素的数组,另一个包含五个"元素.你想在这里做什么都减少到前六"没有破坏"只匹配5"的数组元素.

You can see there that when grouping by your conditions you will get one array with ten elements and another with "five". What you want to do here reduce both to the top "six" without "destroying" the array that only will match to "five" elements.

以及以下查询:

db.messages.aggregate([
    { "$group": {
        "_id": "$conversation_ID",
        "first": { "$first": "$_id" },
        "msgs": { "$push": "$_id" },
    }},
    { "$unwind": "$msgs" },
    { "$project": {
        "msgs": 1,
        "first": 1,
        "seen": { "$eq": [ "$first", "$msgs" ] }
    }},
    { "$sort": { "seen": 1 }},
    { "$group": {
        "_id": "$_id",
        "msgs": { 
            "$push": {
               "$cond": [ { "$not": "$seen" }, "$msgs", false ]
            }
        },
        "first": { "$first": "$first" },
        "second": { "$first": "$msgs" }
    }},
    { "$unwind": "$msgs" },
    { "$project": {
        "msgs": 1,
        "first": 1,
        "second": 1,
        "seen": { "$eq": [ "$second", "$msgs" ] }
    }},
    { "$sort": { "seen": 1 }},
    { "$group": {
        "_id": "$_id",
        "msgs": { 
            "$push": {
               "$cond": [ { "$not": "$seen" }, "$msgs", false ]
            }
        },
        "first": { "$first": "$first" },
        "second": { "$first": "$second" },
        "third": { "$first": "$msgs" }
    }},
    { "$unwind": "$msgs" },
    { "$project": {
        "msgs": 1,
        "first": 1,
        "second": 1,
        "third": 1,
        "seen": { "$eq": [ "$third", "$msgs" ] },
    }},
    { "$sort": { "seen": 1 }},
    { "$group": {
        "_id": "$_id",
        "msgs": { 
            "$push": {
               "$cond": [ { "$not": "$seen" }, "$msgs", false ]
            }
        },
        "first": { "$first": "$first" },
        "second": { "$first": "$second" },
        "third": { "$first": "$third" },
        "forth": { "$first": "$msgs" }
    }},
    { "$unwind": "$msgs" },
    { "$project": {
        "msgs": 1,
        "first": 1,
        "second": 1,
        "third": 1,
        "forth": 1,
        "seen": { "$eq": [ "$forth", "$msgs" ] }
    }},
    { "$sort": { "seen": 1 }},
    { "$group": {
        "_id": "$_id",
        "msgs": { 
            "$push": {
               "$cond": [ { "$not": "$seen" }, "$msgs", false ]
            }
        },
        "first": { "$first": "$first" },
        "second": { "$first": "$second" },
        "third": { "$first": "$third" },
        "forth": { "$first": "$forth" },
        "fifth": { "$first": "$msgs" }
    }},
    { "$unwind": "$msgs" },
    { "$project": {
        "msgs": 1,
        "first": 1,
        "second": 1,
        "third": 1,
        "forth": 1,
        "fifth": 1,
        "seen": { "$eq": [ "$fifth", "$msgs" ] }
    }},
    { "$sort": { "seen": 1 }},
    { "$group": {
        "_id": "$_id",
        "msgs": { 
            "$push": {
               "$cond": [ { "$not": "$seen" }, "$msgs", false ]
            }
        },
        "first": { "$first": "$first" },
        "second": { "$first": "$second" },
        "third": { "$first": "$third" },
        "forth": { "$first": "$forth" },
        "fifth": { "$first": "$fifth" },
        "sixth": { "$first": "$msgs" },
    }},
    { "$project": {
         "first": 1,
         "second": 1,
         "third": 1,
         "forth": 1,
         "fifth": 1,
         "sixth": 1,
         "pos": { "$const": [ 1,2,3,4,5,6 ] }
    }},
    { "$unwind": "$pos" },
    { "$group": {
        "_id": "$_id",
        "msgs": {
            "$push": {
                "$cond": [
                    { "$eq": [ "$pos", 1 ] },
                    "$first",
                    { "$cond": [
                        { "$eq": [ "$pos", 2 ] },
                        "$second",
                        { "$cond": [
                            { "$eq": [ "$pos", 3 ] },
                            "$third",
                            { "$cond": [
                                { "$eq": [ "$pos", 4 ] },
                                "$forth",
                                { "$cond": [
                                    { "$eq": [ "$pos", 5 ] },
                                    "$fifth",
                                    { "$cond": [
                                        { "$eq": [ "$pos", 6 ] },
                                        "$sixth",
                                        false
                                    ]}
                                ]}
                            ]}
                        ]}
                    ]}
                ]
            }
        }
    }},
    { "$unwind": "$msgs" },
    { "$match": { "msgs": { "$ne": false } }},
    { "$group": {
        "_id": "$_id",
        "msgs": { "$push": "$msgs" }
    }}
])

您将获得数组中的最高结果,最多六个条目:

You get the top results in the array, up to six entries:

{ "_id" : 123, "msgs" : [ 1, 2, 3, 4, 5, 6 ] }
{ "_id" : 456, "msgs" : [ 12, 13, 14, 15 ] }


正如你在这里看到的,很多乐趣.


As you can see here, loads of fun.

在您最初分组之后,您基本上想要弹出"$first 数组结果的堆栈外的值.为了简化这个过程,我们实际上在初始操作中这样做.于是流程变成:

After you have initially grouped you basically want to "pop" the $first value off of the stack for the array results. To make this process simplified a little, we actually do this in the initial operation. So the process becomes:

  • $unwind 数组
  • 比较已经看到的值 $eq 等式匹配
  • $sort 结果浮动";false 顶部的看不见的值(这仍然保留顺序)
  • $group 再次返回并弹出"$first unseen value 作为堆栈中的下一个成员.这也使用 $cond 操作符替换seen"数组堆栈中的值使用 false 来帮助评估.
  • $unwind the array
  • Compare to the values already seen with an $eq equality match
  • $sort the results to "float" false unseen values to the top ( this still retains order )
  • $group back again and "pop" the $first unseen value as the next member on the stack. Also this uses the $cond operator to replace "seen" values in the array stack with false to help in the evaluation.

$cond 是为了确保未来的迭代不仅仅是在切片"所在的位置一遍又一遍地添加数组的最后一个值.计数大于数组成员.

The final action with $cond is there to make sure that future iterations are not just adding the last value of the array over and over where the "slice" count is greater than the array members.

需要对您希望切片"的任意数量的项目重复整个过程.由于我们已经找到了第一"初始分组中的项目,这意味着 n-1 次迭代以获得所需的切片结果.

That whole process needs to be repeated for as many items as you wish to "slice". Since we already found the "first" item in the initial grouping, that means n-1 iterations for the desired slice result.

最后的步骤实际上只是将所有内容转换回数组以获得最终显示的结果的可选说明.所以真的只是有条件地将项目或 false 推回它们的匹配位置,最后过滤"去掉所有的 false 值,所以最后的数组有六个"和五个"成员分别.

The final steps are really just an optional illustration of converting everything back into arrays for the result as finally shown. So really just conditionally pushing items or false back by their matching position and finally "filtering" out all the false values so the end arrays have "six" and "five" members respectively.

所以没有一个标准的运算符来适应这个,你不能只是限制";推送到 5 或 10 或数组中的任何项目.但如果你真的必须这样做,那么这是你最好的方法.

So there is not a standard operator to accommodate this, and you cannot just "limit" the push to 5 or 10 or whatever items in the array. But if you really have to do it, then this is your best approach.

您可以使用 mapReduce 来解决这个问题,并一起放弃聚合框架.我将采取的方法(在合理范围内)是在服务器上有效地拥有一个内存中的哈希映射并将数组累积到该映射,同时使用 JavaScript 切片来限制"结果:

You could possibly approach this with mapReduce and forsake the aggregation framework all together. The approach I would take ( within reasonable limits ) would be to effectively have an in-memory hash-map on the server and accumulate arrays to that, while using JavaScript slice to "limit" the results:

db.messages.mapReduce(
    function () {

        if ( !stash.hasOwnProperty(this.conversation_ID) ) {
            stash[this.conversation_ID] = [];
        }

        if ( stash[this.conversation_ID.length < maxLen ) {
            stash[this.conversation_ID].push( this._id );
            emit( this.conversation_ID, 1 );
        }

    },
    function(key,values) {
        return 1;   // really just want to keep the keys
    },
    { 
        "scope": { "stash": {}, "maxLen": 10 },
        "finalize": function(key,value) {
            return { "msgs": stash[key] };                
        },
        "out": { "inline": 1 }
    }
)

所以这只是基本上建立了内存中"匹配发出的键"的对象数组永远不会超过您想要从结果中获取的最大大小.此外,这甚至不费心发射"达到最大堆栈时的项目.

So that just basically builds up the "in-memory" object matching the emitted "keys" with an array never exceeding the maximum size you want to fetch from your results. Additionally this does not even bother to "emit" the item when the maximum stack is met.

reduce 部分除了本质上只是减少到key"之外实际上什么都不做.和单个值.因此,为了防止我们的 reducer 没有被调用,如果一个键只存在 1 个值,那么finalize 函数会负责映射stash".最终输出的关键.

The reduce part actually does nothing other than essentially just reduce to "key" and a single value. So just in case our reducer did not get called, as would be true if only 1 value existed for a key, the finalize function takes care of mapping the "stash" keys to the final output.

这种方法的有效性因输出的大小而异,而且 JavaScript 评估肯定不快,但可能比在管道中处理大型数组快.

The effectiveness of this varies on the size of the output, and JavaScript evaluation is certainly not fast, but possibly faster than processing large arrays in a pipeline.

JIRA 问题进行投票以真正获得切片"运算符甚至限制"在$push"上和$addToSet",它们都很方便.个人希望至少可以对进行一些修改$map 操作符公开当前索引";处理时的值.这将有效地允许切片"和其他操作.

Vote up the JIRA issues to actually have a "slice" operator or even a "limit" on "$push" and "$addToSet", which would both be handy. Personally hoping that at least some modification can be made to the $map operator to expose the "current index" value when processing. That would effectively allow "slicing" and other operations.

您真的希望将其编码为生成"所有需要的迭代.如果这里的答案得到了足够的爱和/或其他时间待定,那么我可能会添加一些代码来演示如何执行此操作.这已经是一个相当长的回应.

Really you would want to code this up to "generate" all of the required iterations. If the answer here gets enough love and/or other time pending that I have in tuits, then I might add some code to demonstrate how to do this. It is already a reasonably long response.

生成管道的代码:

var key = "$conversation_ID";
var val = "$_id";
var maxLen = 10;

var stack = [];
var pipe = [];
var fproj = { "$project": { "pos": { "$const": []  } } };

for ( var x = 1; x <= maxLen; x++ ) {

    fproj["$project"][""+x] = 1;
    fproj["$project"]["pos"]["$const"].push( x );

    var rec = {
        "$cond": [ { "$eq": [ "$pos", x ] }, "$"+x ]
    };
    if ( stack.length == 0 ) {
        rec["$cond"].push( false );
    } else {
        lval = stack.pop();
        rec["$cond"].push( lval );
    }

    stack.push( rec );

    if ( x == 1) {
        pipe.push({ "$group": {
           "_id": key,
           "1": { "$first": val },
           "msgs": { "$push": val }
        }});
    } else {
        pipe.push({ "$unwind": "$msgs" });
        var proj = {
            "$project": {
                "msgs": 1
            }
        };
        
        proj["$project"]["seen"] = { "$eq": [ "$"+(x-1), "$msgs" ] };
       
        var grp = {
            "$group": {
                "_id": "$_id",
                "msgs": {
                    "$push": {
                        "$cond": [ { "$not": "$seen" }, "$msgs", false ]
                    }
                }
            }
        };

        for ( n=x; n >= 1; n-- ) {
            if ( n != x ) 
                proj["$project"][""+n] = 1;
            grp["$group"][""+n] = ( n == x ) ? { "$first": "$msgs" } : { "$first": "$"+n };
        }

        pipe.push( proj );
        pipe.push({ "$sort": { "seen": 1 } });
        pipe.push(grp);
    }
}

pipe.push(fproj);
pipe.push({ "$unwind": "$pos" });
pipe.push({
    "$group": {
        "_id": "$_id",
        "msgs": { "$push": stack[0] }
    }
});
pipe.push({ "$unwind": "$msgs" });
pipe.push({ "$match": { "msgs": { "$ne": false } }});
pipe.push({
    "$group": {
        "_id": "$_id",
        "msgs": { "$push": "$msgs" }
    }
}); 

使用从 $unwind$group 的步骤构建了直到 maxLen 的基本迭代方法.还嵌入了所需的最终投影和嵌套"模型的细节.条件语句.最后一个基本上是对这个问题采取的方法:

That builds the basic iterative approach up to maxLen with the steps from $unwind to $group. Also embedded in there are details of the final projections required and the "nested" conditional statement. The last is basically the approach taken on this question:

MongoDB 的 $in 子句是否保证顺序?

这篇关于Mongodb聚合$group,限制数组长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆