是否可以通过管道在Mongo中有效地进行排序,分组和限制? [英] Is it possible to sort, group and limit efficiently in Mongo with a pipeline?

查看:66
本文介绍了是否可以通过管道在Mongo中有效地进行排序,分组和限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为用户提供年龄索引:

{ name: 'Bob',
  age:   21  }

{ name: 'Cathy,
  age:   21  }

{ name: 'Joe',
  age:   33  }

要获取输出:

[ 
  { _id: 21,
    names: ['Bob, 'Cathy'] },
  { _id: 33,
    names: ['Joe'] }
]

是否可以按年龄分类,分组和限制?

Is it possible to sort, group and limit by age?

db.users.aggregate(
   [  
      {
        $sort: { 
           age: 1 
        }
      },
      {
        $group : {
           _id : $age,
           names:{ $push: '$name' }
      },
      {
        $limit: 10
      }
  ]

我做了一些研究,但是尚不清楚是否可以先排序然后再分组.在我的测试中,小组失去了排序,但我不知道为什么.

I did some research, but it's not clear if it is possible to sort first and then group. In my testing, the group loses the sort, but I don't see why.

如果组保留排序,则排序和限制可以大大减少所需的处理.它只需要做足够的工作来填补" 10个小组的限制.

If the group preserves the sort, then the sort and limit can greatly reduce the required processing. It only needs to do enough work to "fill" the limit of 10 groups.

所以

  1. 组是否保留排序顺序?还是需要分组然后排序?
  2. 是否可以仅执行足够的处理以对限制进行排序,分组和限制以返回限制?还是需要处理整个集合然后进行限制?

推荐答案

回答第一个问题:$group不会 保留顺序.有公开的更改请求,这些更改也略微突出了背景,但看起来产品不会更改以保留输入文档的顺序:

To answer your first question: $group does not preserve the order. There are a open requests for changes which also highlight the backgrounds a little but it doesn't look like the product will be changed to preserve the input documents' order:

  • https://jira.mongodb.org/browse/SERVER-24799
  • https://jira.mongodb.org/browse/SERVER-4507
  • https://jira.mongodb.org/browse/SERVER-21022

通常可以说两件事:通常,您首先要分组,然后再进行排序.原因是排序较少的元素(通常由分组产生)比排序所有输入的文档要快.

Two things can be said in general: You generally want to group first and then do the sorting. The reason being that sorting less elements (which the grouping generally produces) is going to be faster than sorting all input documents.

其次,MongoDB将确保尽可能高效且尽可能少地进行排序. 文档指出:

Secondly, MongoDB is going to make sure to sort as efficiently and little as possible. The documentation states:

当管道中$ sort紧靠$ limit时,$ sort 操作仅在进行时保持前n个结果,其中n 是指定的限制,MongoDB仅需要在其中存储n个项目 记忆.当allowDiskUse为true并且 n项超过了聚合内存限制.

When a $sort immediately precedes a $limit in the pipeline, the $sort operation only maintains the top n results as it progresses, where n is the specified limit, and MongoDB only needs to store n items in memory. This optimization still applies when allowDiskUse is true and the n items exceed the aggregation memory limit.

因此,此代码可根据您的情况完成工作:

So this code gets the job done in your case:

collection.aggregate({
    $group: {
        _id: '$age',
        names: { $push: '$name' }
    }
}, {
    $sort: { 
        '_id': 1 
    }
}, {
    $limit: 10
})

在您的评论后进行

编辑:

EDIT following your comments:

我同意你的意思.进一步讲您的逻辑,我会进一步说:如果$group足够聪明,可以使用索引,那么它甚至在开始时甚至不需要$sort阶段.不幸的是,事实并非如此(尚未可能).就目前的情况而言,$group将永远不会使用索引,并且不会采用基于以下阶段的快捷方式(在本例中为$limit).另请参见此链接,其中有人进行了一些基本测试.

I agree to what you say. And taking your logic a little further, I would go as far as saying: If $group was smart enough to use an index then it shouldn't even require a $sort stage at the start. Unfortunately, it's not (not yet probably). As things stand today, $group will never use an index and it won't take shortcuts based on the following stages ($limit in this case). Also see this link where someone ran some basic tests.

聚合框架还很年轻,所以我想,为了使聚合管道更智能,更快捷,需要做很多工作.

The aggregation framework is still pretty young so I guess, there is a lot of work being done to make the aggregation pipeline smarter and faster.

此处有关于StackOverflow的答案(例如,此处),人们建议按顺序使用前期$sort阶段以某种方式强制" MongoDB使用索引.但是,这大大减慢了我的测试速度(使用不同的随机分布记录了100万条样本形状的记录).

There are answers here on StackOverflow (e.g. here) where people suggest to use an upfront $sort stage in order to "force" MongoDB to use an index somehow. This however, slowed down my tests (1 million records of your sample shape using different random distributions) significantly.

关于聚合管道的性能,最开始的$match阶段是最有帮助的.如果您可以限制从一开始就需要通过管道的记录总数,那么那是您最好的选择-显然...;)

When it comes to performance of an aggregation pipeline, $match stages at the start are what really helps the most. If you can limit the total amount of records that need to go through the pipeline from the beginning then that's your best bet - obviously... ;)

这篇关于是否可以通过管道在Mongo中有效地进行排序,分组和限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆