Mongodb 聚合排序和组内限制 [英] Mongodb aggregate sort and limit within group

查看:14
本文介绍了Mongodb 聚合排序和组内限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一系列具有以下架构的待售商品:

I have a collection of items for sale with the following schema:

var itemSchema = new Schema({
    "category" : { type : Schema.Types.ObjectId, ref : 'Category' },
    "merchant" : { type : Schema.Types.ObjectId, ref : 'Merchant' },
    "rating" : Number
})

我继承了一个聚合查询,它返回匹配一个类别的项目,按商家分组,组按组中的最大评级排序:

I have inherited an aggregate query which returns items matching a category, grouped by merchant, with the groups sorted by the maximum rating in the group:

Item.aggregate([
      { "$match" : { category : categoryId, merchant : { $ne : null }}},
      { "$group" : { _id : "$merchant", 
                    rating : { $max : "$rating" }, 
                    items : { $push : "$$ROOT" }}},
      { "$sort" : { rating : -1 }}
    ], { allowDiskUse : true })
    .skip(skip)
    .limit(limit)

此后,代码继续按评分对每个组中的项目进行排序,并删除除每个组中评分最高的前 2 个项目之外的所有项目.

After this the code goes on to sort the items within each group by rating, and remove all but the top 2 highest rated items of each group.

是否可以作为聚合函数的一部分在组内执行这种排序和限制,以便聚合将只返回每个组中评分最高的两个项目?

Is it possible to perform this sort and limit within the groups as a part of the aggregate function, so that the aggregate will only return the two highest rated items per group?

推荐答案

基本问题

在可预见的不久的将来,在当前的聚合框架中尝试执行此操作并不是最明智的想法.主要问题当然来自您已经拥有的代码中的这一行:

The Basic Problem

It's not the wisest idea out there to try and do this in the aggregation framework at current on in the forseeable near future. The main problem of course comes from this line in the code you already have:

"items" : { "$push": "$$ROOT" }

这意味着,基本上需要发生的是分组键内的所有对象都需要被推入一个数组中,以便在任何后续代码中获得前 N 个"结果.

And that means exactly that, in that what needs to basically happen is that all objects within the grouping key need to be pushed into an array in order to get to the "top N" results in any later code.

这显然不能扩展,因为最终该数组本身的大小可能会超过 16MB 的 BSON 限制,而不管分组文档中的其余数据.这里的主要问题是不可能限制推送"到一定数量的项目.有一个长期存在的 JIRA 问题就这样的事情.

This clearly does not scale as eventutally the size of that array itself can very conceivably exceed the BSON limit of 16MB, and regarless of the rest of the data in the grouped document. The main catch here being that it is not possible to "limit the push" to just a certain number of items. There is a long standing JIRA issue on just such a thing.

仅出于这个原因,最实用的方法是对每个分组键的前 N"项运行单独的查询.这些甚至不需要是 .aggregate() 语句(取决于数据),实际上可以是任何简单地限制您想要的前 N"值的东西.

For that reason alone, the most practical approach to this is to run individual queries for the "top N" items for each grouping key. These do not even need to be .aggregate() statments ( depending on the data ) and can really be anything that simply limits the "top N" values you want.

您的架构似乎在 node.jsmongoose 上,但任何支持异步 IO 和并行查询执行的将是最佳选择.理想情况下,它有自己的 API 库,支持将这些查询的结果组合成单个响应.

Your architecture appears to be on node.js with mongoose, but anything that supports async IO and parallel execution of queries is going to be the best option. Ideally something with it's own API library that supports combining the results of those queries into a single response.

例如,有一个使用您的架构和可用库的简化示例列表(特别是 async ) 精确地执行此并行和组合结果:

For example there is this simplified example listing using your architecture and available libraries ( notably async ) that does this parallel and combined results exactly:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      async.waterfall(
        [
          function(callback) {
            Test.distinct("merchant",callback);
          },
          function(merchants,callback) {
            async.concat(
              merchants,
              function(merchant,callback) {
                Test.find({ "merchant": merchant })
                  .sort({ "rating": -1 })
                  .limit(2)
                  .exec(callback);
              },
              function(err,results) {
                console.log(JSON.stringify(results,undefined,2));
                callback(err);
              }
            );
          }
        ],
        callback
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

这只会导致输出中每个商家的前 2 个结果:

This results in just the top 2 results for each merchant in the output:

[
  {
    "_id": "560d153669fab495071553ce",
    "merchant": 1,
    "rating": 3,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553cd",
    "merchant": 1,
    "rating": 2,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553d1",
    "merchant": 2,
    "rating": 3,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553d0",
    "merchant": 2,
    "rating": 2,
    "__v": 0
  }
]

这确实是最有效的处理方式,尽管它会占用资源,因为它仍然是多个查询.但是,如果您尝试将所有文​​档存储在一个数组中并对其进行处理,那么在聚合管道中消耗的资源远不及.

It really is the most efficient way to process this though it's going to take resources since it still is multiple queries. But nowhere near the resources eaten up in the aggregation pipeline if you attempt to store all documents in an array and process it.

对于这一行,考虑到文档数量不会导致违反 BSON 限制,这是可以做到的.当前版本的 MongoDB 中的方法对此并不好,但即将发布的版本(在撰写本文时,3.1.8 dev 分支会这样做)至少向聚合管道引入了一个 $slice 运算符.因此,如果您对聚合操作更聪明并使用 先$sort,然后就可以很容易的挑出数组中已经排序好的项目:

To that line, it is possible considering that the number of documents does not cause a breach in the BSON limit that this can be done. Methods with the current release of MongoDB are not great for this, but the upcoming release ( as of writing, 3.1.8 dev branch does this ) at least introduces a $slice operator to the aggregation pipeline. So if you are smarter about the aggregation operation and use a $sort first, then the already sorted items in the array can be picked out easily:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      Test.aggregate(
        [
          { "$sort": { "merchant": 1, "rating": -1 } },
          { "$group": {
            "_id": "$merchant",
            "items": { "$push": "$$ROOT" }
          }},
          { "$project": {
            "items": { "$slice": [ "$items", 2 ] }
          }}
        ],
        function(err,results) {
          console.log(JSON.stringify(results,undefined,2));
          callback(err);
        }
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

产生相同的基本结果,因为前 2 个项目在首先排序后从数组中切片".

Which yields the same basic result as the top 2 items are "sliced" from the array once they were sorted first.

在当前版本中它实际上也是可能的",但具有相同的基本约束,因为这仍然涉及先对内容进行排序后将所有内容推送到数组中.它只需要一种迭代"的方法.您可以将其编码以生成更多条目的聚合管道,但仅显示两个"应该表明尝试这不是一个好主意:

It is also actually "possible" in current releases, but with the same basic constraints in that this still involves pushing all content into an array after sorting the content first. It just takes an "iterative" approach. You can code this out to produce the aggregation pipeline for greater entries, but just showing "two" should show it's not a really great idea to try:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      Test.aggregate(
        [
          { "$sort": { "merchant": 1, "rating": -1 } },
          { "$group": {
            "_id": "$merchant",
            "items": { "$push": "$$ROOT" }
          }},
          { "$unwind": "$items" },
          { "$group": {
            "_id": "$_id",
            "first": { "$first": "$items" },
            "items": { "$push": "$items" }
          }},
          { "$unwind": "$items" },
          { "$redact": {
            "$cond": [
              { "$eq": [ "$items", "$first" ] },
              "$$PRUNE",
              "$$KEEP"
            ]
          }},
          { "$group": {
            "_id": "$_id",
            "first": { "$first": "$first" },
            "second": { "$first": "$items" }
          }},
          { "$project": {
            "items": {
              "$map": {
                "input": ["A","B"],
                "as": "el",
                "in": {
                  "$cond": [
                    { "$eq": [ "$$el", "A" ] },
                    "$first",
                    "$second"
                  ]
                }
              }
            }
          }}
        ],
        function(err,results) {
          console.log(JSON.stringify(results,undefined,2));
          callback(err);
        }
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

再说一次,虽然在早期版本中是可能的"(这是使用 2.6 引入的功能来缩短,因为您已经标记了 $$ROOT ),基本步骤是存储数组,然后获取每个项目使用 $first 并将其(可能还有其他)与数组中的项目进行比较以删除它们,然后从该堆栈中取出下一个第一个"项目,直到您的前 N 个"最终完成.

And again while "possible" in earlier versions ( this is using 2.6 introduced features to shorten since you already tag $$ROOT ), the basic steps are storing the array and then getting each item "off the stack" using $first and comparing that ( and potentially others ) to items within the array to remove them and then get the "next first" item off that stack until your "top N" is eventually done.

直到有一天有这样的操作允许$push聚合累加器中的项目被限制在一定数量内,那么这不是真正的聚合操作.

Until the day comes that there is such an operation that allows the items in a $push aggregation accumulator to be limited to a certain count, then this is not really a practical operation for aggregate.

如果您在这些结果中拥有的数据足够小,您就可以做到,如果数据库服务器的规格足以提供真正的优势,它甚至可能比客户端处理更有效.但在大多数合理使用的实际应用中,情况可能都不是这样.

You can do it, if the data you have in these results is small enough, and it might even just be more efficient than the client side processing if the database servers are of sufficient spec to provide a real advantage. But the chances are that neither is going to be the case in most real applications of reasonable usage.

最好的办法是使用首先演示的并行查询"选项.它总是会很好地扩展,并且没有必要围绕"这样的逻辑进行编码",以至于特定的分组可能不会返回至少所需的前 N"项总数,并确定如何保留它们(省略了更长的示例)) 因为它只是执行每个查询并合并结果.

The best bet is to use the "parallel query" option demonstrated first. It's always going to scale well, and there is no need to "code around" such logic that a particular grouping might not return at least the total "top N" items required and work out how to retain them ( much longer example of that omitted ) as it simply performs each query and combines the results.

使用并行查询.它会比您拥有的编码方法更好,并且将远远超过所展示的聚合方法.直到至少有更好的选择.

Use parallel queries. It's going to be better than the coded approach you have, and it's going to outperform the aggregation approach demonstrated by a long way. Until there is a better option at least.

这篇关于Mongodb 聚合排序和组内限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆