MongoDB汇总组内的排序和限制 [英] Mongodb aggregate sort and limit within group

查看:109
本文介绍了MongoDB汇总组内的排序和限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个具有以下架构的待售商品集合:

I have a collection of items for sale with the following schema:

var itemSchema = new Schema({
    "category" : { type : Schema.Types.ObjectId, ref : 'Category' },
    "merchant" : { type : Schema.Types.ObjectId, ref : 'Merchant' },
    "rating" : Number
})

我继承了一个聚合查询,该查询返回与按商户分组的类别匹配的项目,其中分组按组中的最高评分排序:

I have inherited an aggregate query which returns items matching a category, grouped by merchant, with the groups sorted by the maximum rating in the group:

Item.aggregate([
      { "$match" : { category : categoryId, merchant : { $ne : null }}},
      { "$group" : { _id : "$merchant", 
                    rating : { $max : "$rating" }, 
                    items : { $push : "$$ROOT" }}},
      { "$sort" : { rating : -1 }}
    ], { allowDiskUse : true })
    .skip(skip)
    .limit(limit)

此后,代码继续按评分对每个组中的项目进行排序,并删除除每个组中评分最高的2个项目之外的所有项目.

After this the code goes on to sort the items within each group by rating, and remove all but the top 2 highest rated items of each group.

作为汇总功能的一部分,是否可以在组中执行这种排序和限制,以便汇总仅返回每个组中两个评分最高的项目?

Is it possible to perform this sort and limit within the groups as a part of the aggregate function, so that the aggregate will only return the two highest rated items per group?

推荐答案

基本问题

在可预见的不久的将来,尝试在聚合框架中执行此操作不是最明智的主意.当然,主要问题出在您已有的代码中的这一行:

The Basic Problem

It's not the wisest idea out there to try and do this in the aggregation framework at current on in the forseeable near future. The main problem of course comes from this line in the code you already have:

"items" : { "$push": "$$ROOT" }

这就意味着,基本上需要发生的事情是,需要将分组键中的所有对象都放入一个数组中,以便在以后的任何代码中到达前N个"结果.

And that means exactly that, in that what needs to basically happen is that all objects within the grouping key need to be pushed into an array in order to get to the "top N" results in any later code.

这显然无法扩展,因为最终该数组本身的大小可以非常想象地超过16MB的BSON限制,并且不占用分组文档中的其余数据.这里的主要问题是不可能将推送"仅限制为一定数量的项目.关于这种事情,有一个长期存在的JIRA问题.

This clearly does not scale as eventutally the size of that array itself can very conceivably exceed the BSON limit of 16MB, and regarless of the rest of the data in the grouped document. The main catch here being that it is not possible to "limit the push" to just a certain number of items. There is a long standing JIRA issue on just such a thing.

仅出于这个原因,最实用的方法是对每个分组键的前N个"项目运行单独的查询.这些甚至不必是.aggregate()语句(取决于数据),并且实际上可以是仅限制所需的前N个"值的任何内容.

For that reason alone, the most practical approach to this is to run individual queries for the "top N" items for each grouping key. These do not even need to be .aggregate() statments ( depending on the data ) and can really be anything that simply limits the "top N" values you want.

您的体系结构似乎与mongoose一起位于node.js上,但是任何支持异步IO和并行执行查询的方法将是最佳选择.理想情况下,它具有自己的API库,该库支持将那些查询的结果合并为一个响应.

Your architecture appears to be on node.js with mongoose, but anything that supports async IO and parallel execution of queries is going to be the best option. Ideally something with it's own API library that supports combining the results of those queries into a single response.

例如,有一个使用您的体系结构和可用库的简化示例清单(尤其是 async )做到这一点并准确地合并结果:

For example there is this simplified example listing using your architecture and available libraries ( notably async ) that does this parallel and combined results exactly:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      async.waterfall(
        [
          function(callback) {
            Test.distinct("merchant",callback);
          },
          function(merchants,callback) {
            async.concat(
              merchants,
              function(merchant,callback) {
                Test.find({ "merchant": merchant })
                  .sort({ "rating": -1 })
                  .limit(2)
                  .exec(callback);
              },
              function(err,results) {
                console.log(JSON.stringify(results,undefined,2));
                callback(err);
              }
            );
          }
        ],
        callback
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

这将导致输出中每个商家的前2个结果:

This results in just the top 2 results for each merchant in the output:

[
  {
    "_id": "560d153669fab495071553ce",
    "merchant": 1,
    "rating": 3,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553cd",
    "merchant": 1,
    "rating": 2,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553d1",
    "merchant": 2,
    "rating": 3,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553d0",
    "merchant": 2,
    "rating": 2,
    "__v": 0
  }
]

这实际上是处理此问题的最有效方法,因为它仍然需要多个查询,因此会占用资源.但是,如果您尝试将所有文​​档存储在一个数组中并对其进行处理,则聚合管道中的资源不会被消耗掉.

It really is the most efficient way to process this though it's going to take resources since it still is multiple queries. But nowhere near the resources eaten up in the aggregation pipeline if you attempt to store all documents in an array and process it.

对于这一行,有可能考虑到文档数量不会导致违反BSON限制的情况,因此可以做到这一点.当前版本的MongoDB的方法对此并不适用,但是即将发布的版本(在撰写本文时,截至3.1.8 dev分支已做到这一点)至少在聚合管道中引入了$slice运算符.因此,如果您对聚合操作比较聪明,请使用 $sort ,然后可以很容易地挑选出数组中已排序的项目:

To that line, it is possible considering that the number of documents does not cause a breach in the BSON limit that this can be done. Methods with the current release of MongoDB are not great for this, but the upcoming release ( as of writing, 3.1.8 dev branch does this ) at least introduces a $slice operator to the aggregation pipeline. So if you are smarter about the aggregation operation and use a $sort first, then the already sorted items in the array can be picked out easily:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      Test.aggregate(
        [
          { "$sort": { "merchant": 1, "rating": -1 } },
          { "$group": {
            "_id": "$merchant",
            "items": { "$push": "$$ROOT" }
          }},
          { "$project": {
            "items": { "$slice": [ "$items", 2 ] }
          }}
        ],
        function(err,results) {
          console.log(JSON.stringify(results,undefined,2));
          callback(err);
        }
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

与产生的基本结果相同,因为首先对数组中的前2个项目进行了排序".

Which yields the same basic result as the top 2 items are "sliced" from the array once they were sorted first.

在当前版本中它实际上也是可能的",但是具有相同的基本约束,因为这仍然涉及在首先对内容进行排序之后将所有内容推入数组.它仅需要一种迭代"方法.您可以对此进行编码,以生成用于更大条目的聚合管道,但是仅显示两个"应该表明尝试这样做不是一个好主意:

It is also actually "possible" in current releases, but with the same basic constraints in that this still involves pushing all content into an array after sorting the content first. It just takes an "iterative" approach. You can code this out to produce the aggregation pipeline for greater entries, but just showing "two" should show it's not a really great idea to try:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      Test.aggregate(
        [
          { "$sort": { "merchant": 1, "rating": -1 } },
          { "$group": {
            "_id": "$merchant",
            "items": { "$push": "$$ROOT" }
          }},
          { "$unwind": "$items" },
          { "$group": {
            "_id": "$_id",
            "first": { "$first": "$items" },
            "items": { "$push": "$items" }
          }},
          { "$unwind": "$items" },
          { "$redact": {
            "$cond": [
              { "$eq": [ "$items", "$first" ] },
              "$$PRUNE",
              "$$KEEP"
            ]
          }},
          { "$group": {
            "_id": "$_id",
            "first": { "$first": "$first" },
            "second": { "$first": "$items" }
          }},
          { "$project": {
            "items": {
              "$map": {
                "input": ["A","B"],
                "as": "el",
                "in": {
                  "$cond": [
                    { "$eq": [ "$$el", "A" ] },
                    "$first",
                    "$second"
                  ]
                }
              }
            }
          }}
        ],
        function(err,results) {
          console.log(JSON.stringify(results,undefined,2));
          callback(err);
        }
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

再一次,尽管在较早版本中可能"(这是使用2.6引入的功能来缩短,因为您已经标记了$$ROOT),但是基本步骤是存储数组,然后使用 $first 并进行比较(并可能其他)到数组中的项目以将其删除,然后从堆栈中删除下一个优先"项,直到最终完成前N个".

And again while "possible" in earlier versions ( this is using 2.6 introduced features to shorten since you already tag $$ROOT ), the basic steps are storing the array and then getting each item "off the stack" using $first and comparing that ( and potentially others ) to items within the array to remove them and then get the "next first" item off that stack until your "top N" is eventually done.

直到有一天这种操作允许将$push聚合累加器中的项目限制为一定数量,这才不是真正的聚合操作.

Until the day comes that there is such an operation that allows the items in a $push aggregation accumulator to be limited to a certain count, then this is not really a practical operation for aggregate.

如果这些结果中的数据足够小,则可以执行此操作,并且如果数据库服务器的规格足以提供真正的优势,那么它甚至可能甚至比客户端处理更有效率.但是有机会在大多数合理使用的实际应用中都不会出现这种情况.

You can do it, if the data you have in these results is small enough, and it might even just be more efficient than the client side processing if the database servers are of sufficient spec to provide a real advantage. But the chances are that neither is going to be the case in most real applications of reasonable usage.

最好的选择是使用首先演示的并行查询"选项.它总是可以很好地扩展的,并且不需要编码"这样的逻辑,即特定的分组可能至少不会返回所需的全部前N个"项目,也不会弄清楚如何保留它们(更长的示例被省略了) ),因为它只执行每个查询并合并结果.

The best bet is to use the "parallel query" option demonstrated first. It's always going to scale well, and there is no need to "code around" such logic that a particular grouping might not return at least the total "top N" items required and work out how to retain them ( much longer example of that omitted ) as it simply performs each query and combines the results.

使用并行查询.它将比您拥有的编码方法更好,并且将大大优于长期展示的聚合方法.至少直到有更好的选择为止.

Use parallel queries. It's going to be better than the coded approach you have, and it's going to outperform the aggregation approach demonstrated by a long way. Until there is a better option at least.

这篇关于MongoDB汇总组内的排序和限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆