Mongodb聚合排序和组内限制 [英] Mongodb aggregate sort and limit within group

查看:14
本文介绍了Mongodb聚合排序和组内限制的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一组具有以下架构的待售商品:

I have a collection of items for sale with the following schema:

var itemSchema = new Schema({
    "category" : { type : Schema.Types.ObjectId, ref : 'Category' },
    "merchant" : { type : Schema.Types.ObjectId, ref : 'Merchant' },
    "rating" : Number
})

我继承了一个聚合查询,它返回匹配某个类别的商品,按商家分组,分组按组中的最高评分排序:

I have inherited an aggregate query which returns items matching a category, grouped by merchant, with the groups sorted by the maximum rating in the group:

Item.aggregate([
      { "$match" : { category : categoryId, merchant : { $ne : null }}},
      { "$group" : { _id : "$merchant", 
                    rating : { $max : "$rating" }, 
                    items : { $push : "$$ROOT" }}},
      { "$sort" : { rating : -1 }}
    ], { allowDiskUse : true })
    .skip(skip)
    .limit(limit)

在此之后,代码继续按评分对每个组中的项目进行排序,并删除每个组中评分最高的前 2 个项目以外的所有项目.

After this the code goes on to sort the items within each group by rating, and remove all but the top 2 highest rated items of each group.

是否可以在组内执行这种排序和限制作为聚合函数的一部分,以便聚合只返回每个组中评分最高的两个项目?

Is it possible to perform this sort and limit within the groups as a part of the aggregate function, so that the aggregate will only return the two highest rated items per group?

推荐答案

基本问题

在可预见的不久的将来,在当前的聚合框架中尝试这样做并不是最明智的想法.主要问题当然来自您已经拥有的代码中的这一行:

The Basic Problem

It's not the wisest idea out there to try and do this in the aggregation framework at current on in the forseeable near future. The main problem of course comes from this line in the code you already have:

"items" : { "$push": "$$ROOT" }

这正是意味着,基本上需要发生的是分组键中的所有对象都需要被推入一个数组,以便在任何后续代码中到达前 N 个"结果.

And that means exactly that, in that what needs to basically happen is that all objects within the grouping key need to be pushed into an array in order to get to the "top N" results in any later code.

这显然无法扩展,因为最终该数组本身的大小可以非常想象地超过 16MB 的 BSON 限制,而不管分组文档中的其余数据如何.这里的主要问题是不可能将推送限制"到特定数量的项目.长期存在的 JIRA 问题就是这样的事情.

This clearly does not scale as eventutally the size of that array itself can very conceivably exceed the BSON limit of 16MB, and regarless of the rest of the data in the grouped document. The main catch here being that it is not possible to "limit the push" to just a certain number of items. There is a long standing JIRA issue on just such a thing.

仅出于这个原因,最实用的方法是对每个分组键的前 N"项运行单独的查询.这些甚至不需要是 .aggregate() statments(取决于数据),实际上可以是任何简单地限制您想要的top N"值的东西.

For that reason alone, the most practical approach to this is to run individual queries for the "top N" items for each grouping key. These do not even need to be .aggregate() statments ( depending on the data ) and can really be anything that simply limits the "top N" values you want.

您的架构似乎在 node.jsmongoose 上,但任何支持异步 IO 和并行执行查询的东西都将是最佳选择.理想情况下,它有自己的 API 库,支持将这些查询的结果组合成一个响应.

Your architecture appears to be on node.js with mongoose, but anything that supports async IO and parallel execution of queries is going to be the best option. Ideally something with it's own API library that supports combining the results of those queries into a single response.

例如,使用您的架构和可用库(特别是 async ) 完全实现了这种并行和组合的结果:

For example there is this simplified example listing using your architecture and available libraries ( notably async ) that does this parallel and combined results exactly:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      async.waterfall(
        [
          function(callback) {
            Test.distinct("merchant",callback);
          },
          function(merchants,callback) {
            async.concat(
              merchants,
              function(merchant,callback) {
                Test.find({ "merchant": merchant })
                  .sort({ "rating": -1 })
                  .limit(2)
                  .exec(callback);
              },
              function(err,results) {
                console.log(JSON.stringify(results,undefined,2));
                callback(err);
              }
            );
          }
        ],
        callback
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

这会导致输出中每个商家的前 2 个结果:

This results in just the top 2 results for each merchant in the output:

[
  {
    "_id": "560d153669fab495071553ce",
    "merchant": 1,
    "rating": 3,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553cd",
    "merchant": 1,
    "rating": 2,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553d1",
    "merchant": 2,
    "rating": 3,
    "__v": 0
  },
  {
    "_id": "560d153669fab495071553d0",
    "merchant": 2,
    "rating": 2,
    "__v": 0
  }
]

这确实是处理这个问题的最有效方法,尽管它会占用资源,因为它仍然是多个查询.但是,如果您尝试将所有文​​档存储在一个数组中并对其进行处理,那么聚合管道中消耗的资源远不及.

It really is the most efficient way to process this though it's going to take resources since it still is multiple queries. But nowhere near the resources eaten up in the aggregation pipeline if you attempt to store all documents in an array and process it.

对于该行,考虑到文档数量不会导致违反 BSON 限制,这是可能的.当前版本的 MongoDB 方法对此并不好,但即将发布的版本(在撰写本文时,3.1.8 dev 分支会这样做)至少将 $slice 运算符引入聚合管道.因此,如果您对聚合操作更聪明并使用 先$sort,然后就可以轻松挑出数组中已经排序好的项目了:

To that line, it is possible considering that the number of documents does not cause a breach in the BSON limit that this can be done. Methods with the current release of MongoDB are not great for this, but the upcoming release ( as of writing, 3.1.8 dev branch does this ) at least introduces a $slice operator to the aggregation pipeline. So if you are smarter about the aggregation operation and use a $sort first, then the already sorted items in the array can be picked out easily:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      Test.aggregate(
        [
          { "$sort": { "merchant": 1, "rating": -1 } },
          { "$group": {
            "_id": "$merchant",
            "items": { "$push": "$$ROOT" }
          }},
          { "$project": {
            "items": { "$slice": [ "$items", 2 ] }
          }}
        ],
        function(err,results) {
          console.log(JSON.stringify(results,undefined,2));
          callback(err);
        }
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

这产生的基本结果与前 2 个项目在首先排序后从数组中切片"的基本结果相同.

Which yields the same basic result as the top 2 items are "sliced" from the array once they were sorted first.

在当前版本中它实际上也是可能的",但具有相同的基本约束,因为这仍然涉及在首先对内容进行排序后将所有内容推入一个数组中.它只是采用迭代"方法.您可以对此进行编码以生成更大条目的聚合管道,但仅显示两个"应该表明尝试这样做并不是一个好主意:

It is also actually "possible" in current releases, but with the same basic constraints in that this still involves pushing all content into an array after sorting the content first. It just takes an "iterative" approach. You can code this out to produce the aggregation pipeline for greater entries, but just showing "two" should show it's not a really great idea to try:

var async = require('async'),
    mongoose = require('mongoose'),
    Schema = mongoose.Schema;

mongoose.connect('mongodb://localhost/test');

var data = [
  { "merchant": 1, "rating": 1 },
  { "merchant": 1, "rating": 2 },
  { "merchant": 1, "rating": 3 },
  { "merchant": 2, "rating": 1 },
  { "merchant": 2, "rating": 2 },
  { "merchant": 2, "rating": 3 }
];

var testSchema = new Schema({
  merchant: Number,
  rating: Number
});

var Test = mongoose.model( 'Test', testSchema, 'test' );

async.series(
  [
    function(callback) {
      Test.remove({},callback);
    },
    function(callback) {
      async.each(data,function(item,callback) {
        Test.create(item,callback);
      },callback);
    },
    function(callback) {
      Test.aggregate(
        [
          { "$sort": { "merchant": 1, "rating": -1 } },
          { "$group": {
            "_id": "$merchant",
            "items": { "$push": "$$ROOT" }
          }},
          { "$unwind": "$items" },
          { "$group": {
            "_id": "$_id",
            "first": { "$first": "$items" },
            "items": { "$push": "$items" }
          }},
          { "$unwind": "$items" },
          { "$redact": {
            "$cond": [
              { "$eq": [ "$items", "$first" ] },
              "$$PRUNE",
              "$$KEEP"
            ]
          }},
          { "$group": {
            "_id": "$_id",
            "first": { "$first": "$first" },
            "second": { "$first": "$items" }
          }},
          { "$project": {
            "items": {
              "$map": {
                "input": ["A","B"],
                "as": "el",
                "in": {
                  "$cond": [
                    { "$eq": [ "$$el", "A" ] },
                    "$first",
                    "$second"
                  ]
                }
              }
            }
          }}
        ],
        function(err,results) {
          console.log(JSON.stringify(results,undefined,2));
          callback(err);
        }
      );
    }
  ],
  function(err) {
    if (err) throw err;
    mongoose.disconnect();
  }
);

虽然在早期版本中可能"(这是使用 2.6 引入的功能来缩短,因为您已经标记了 $$ROOT ),但基本步骤是存储数组,然后获取每个项目使用 $first 并将其(可能还有其他)与数组中的项目进行比较以删除它们,然后从堆栈中取出下一个第一个"项目,直到最终完成前 N 个".

And again while "possible" in earlier versions ( this is using 2.6 introduced features to shorten since you already tag $$ROOT ), the basic steps are storing the array and then getting each item "off the stack" using $first and comparing that ( and potentially others ) to items within the array to remove them and then get the "next first" item off that stack until your "top N" is eventually done.

直到有一天,有这样一个操作可以让 $push 聚合累加器中的项目被限制到一定的数量,那么这对于聚合来说并不是真正的实际操作.

Until the day comes that there is such an operation that allows the items in a $push aggregation accumulator to be limited to a certain count, then this is not really a practical operation for aggregate.

如果您在这些结果中拥有的数据足够小,那么您可以做到这一点,并且如果数据库服务器具有足够的规格以提供真正的优势,它甚至可能比客户端处理更有效.但在大多数合理使用的实际应用中,情况可能都不是这样.

You can do it, if the data you have in these results is small enough, and it might even just be more efficient than the client side processing if the database servers are of sufficient spec to provide a real advantage. But the chances are that neither is going to be the case in most real applications of reasonable usage.

最好的办法是使用首先演示的并行查询"选项.它总是可以很好地扩展,并且不需要围绕"这样的逻辑进行编码",即特定分组可能不会至少返回所需的总前 N"项并计算出如何保留它们(省略了更长的示例) 因为它只是执行每个查询并组合结果.

The best bet is to use the "parallel query" option demonstrated first. It's always going to scale well, and there is no need to "code around" such logic that a particular grouping might not return at least the total "top N" items required and work out how to retain them ( much longer example of that omitted ) as it simply performs each query and combines the results.

使用并行查询.它会比你所拥有的编码方法更好,并且它会在很长一段时间内优于聚合方法.直到至少有更好的选择.

Use parallel queries. It's going to be better than the coded approach you have, and it's going to outperform the aggregation approach demonstrated by a long way. Until there is a better option at least.

这篇关于Mongodb聚合排序和组内限制的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆