在 MongoDB 中如何使用索引进行排序? [英] How does sorting with an index work in MongoDB?

查看:109
本文介绍了在 MongoDB 中如何使用索引进行排序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道在 MongoDB 中如何使用索引进行排序.有一个一对 文章,但实际上并没有描述如何排序进行或时间复杂度.到目前为止,对 SO 和整个互联网的搜索都没有找到任何相关的信息.

I'm wondering how sorting with an index actually works in MongoDB. There are a couple articles in the MongoDB documentation, but they don't actually describe how the sort proceeds or the time complexity. Searches of SO and the interweb in general so far haven't turned up anything relevant.

假设集合中有 a 个文档,find() 子句匹配 b 个文档,返回的文档有 c 个限制、a >> b >> cc 是一些合适的大数,使得返回的集合不能适合内存 - 例如,假设有 100 万个文档.

Let's assume there are a documents in a collection, the find() clause matches b documents, there's a limit of c documents returned, a >> b >> c, and c is some suitably large number such that the returned set cannot fit in memory - let's say 1M documents, for example.

在操作开始时,存在需要排序的b个文档和一个大小为a的排序树索引,用于文档排序的特征.

At the start of the operation, there exist b documents that needs to be sorted and a sorted tree index of size a for the feature the documents will be sorted by.

我可以想象:

A) 按顺序遍历索引,并为每个 ObjectID 遍历 b 文档的列表.返回匹配项,直到达到 c.这将是 O(ab).

A) traverse the index in order, and for each ObjectID traverse the list of b documents. Return matches until c is reached. This would be O(ab).

B) 作为 A),但首先在 b 文档中构建 ObjectID 的哈希集.这是 O(a),但需要 O(b) 内存.

B) as A), but build a hashset of the ObjectIDs in the b documents first. This is O(a), but takes O(b) memory.

我试图考虑基于遍历一组 b 文档的排序,但似乎无法想出比 O(b log b),这并不比没有索引的排序好.

I've tried to consider sorts based on traversing the set of b documents, but can't seem to come up with anything faster than O(b log b), which is no better than sorting without an index.

我假设(但也许我错了)每种排序都不需要索引扫描,那么排序实际上是如何工作的?

I assume (but maybe I'm wrong) that every sort doesn't require an index scan, so how does the sort actually work?

更新:

凯文的回答和提供的链接大大缩小了问题的范围,但我想确认/澄清几点:

Kevin's answer and provided link narrow down the question a lot, but I'd like to confirm/clarify a few points:

  1. 据我所知,如果您想避免内存中排序,则不能对查询和排序使用不同的索引.当我阅读这个页面时,它看起来好像你可以(或者至少,它没有指定一种方式或另一种方式),但这似乎是不正确的.本质上,对文档进行排序是因为它们在查询期间按照索引的顺序进行查找,因此按照索引的顺序返回.对吗?
  2. 查询复合索引时,排序索引必须是复合索引中的第一个索引,查询是等式的索引除外.如果不是,则在内存中执行排序.对吗?
  3. 排序如何处理 $in$or 查询?例如,假设查询是

  1. As I understand it, you cannot use different indexes for the query and the sort if you want to avoid an in-memory sort. When I read this page it appeared as though you could (or at least, it didn't specify one way or the other), but that seems to be incorrect. Essentially, the documents are sorted because they're looked up in the order of the index during the query and therefore returned in the order of the index. Right?
  2. When querying on a compound index, the the sorting index must be the first index in the compound index, except for indexes where the query is an equality. If not, the sort is performed in memory. Right?
  3. How does sorting work with $in or $or queries? For example, assume the query is

{a: {$in: [4, 6, 2, 1, 3, 10]}, b: {$gt: 1, $lt: 6}}

... 并且在 ab 上有一个复合索引,按顺序排列.如果排序在 ab 上,排序将如何工作?$or 更加复杂,因为据我了解,$or 查询本质上被拆分为多个单独的查询.$or 查询是否总是内存中排序,至少对于合并单独查询的结果?

... and there's a compound index on a and b in that order. How would the sort work in the cases the sort is on a or b? $or is even more complicated since, as I understand it, $or queries are essentially split into multiple separate queries. Are $or queries always an in-memory sort, at least for merging the results of the separate queries?

推荐答案

MongoDB 中的索引存储在 B 树结构中,其中每个索引条目指向磁盘上的特定位置.使用 B 树结构也意味着 MongoDB 索引按排序顺序存储,始终按顺序遍历,并且 MongoDB 通过索引按排序顺序获取一系列文档的成本很低.

Indexes in MongoDB are stored in a B-tree structure, where each index entry points to a specific location on-disk. Using a B-tree structure also means that a MongoDB index is stored in a sorted order, always traversed in-order, and is cheap for MongoDB to fetch a series of documents in a sorted order via indexes.

更新:B 树结构适用于 MMAPv1 存储引擎,但 WiredTiger 存储引擎的实现略有不同(自 MongoDB 3.2 起默认).基本思想保持不变,以排序顺序遍历索引的成本很低.

Update: The B-tree structure is true for the MMAPv1 storage engine, but is implemented slightly differently by the WiredTiger storage engine (default since MongoDB 3.2). The basic idea remains the same, where it's cheap to traverse the index in a sorted order.

查询中的 SORT 阶段(即内存中排序)限制为 32MB 的内存使用.如果 SORT 阶段超过此限制,则查询将失败.这个限制可以通过利用索引的排序特性来避免,这样 MongoDB 就可以返回带有 sort() 参数的查询,而无需执行内存中的排序.

A SORT stage (i.e. in-memory sort) in a query is limited to 32MB of memory use. A query will fail if the SORT stage exceeds this limit. This limit can be sidestepped by utilizing the sorted nature of indexes, so that MongoDB can return a query with a sort() parameter without performing an in-memory sort.

假设查询的形状如下:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort(...)

集合 a 的索引为:

    db.a.createIndex({b:1,c:1})

在查询中指定 sort() 阶段时,有两种可能的情况:

There are two possible scenarios when a sort() stage is specified in the query:

1.MongoDB 不能使用索引的排序特性,必须执行内存中的SORT 阶段.

1. MongoDB cannot use the sorted nature of the index and must perform an in-memory SORT stage.

这是查询不能使用索引前缀"时的结果.例如:

This is the outcome if the query cannot use the "index prefix". For example:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort({c:1})

在上面的查询中,索引 {b:1,c:1} 可用于:

In the query above, the index {b:1,c:1} can be used to:

  • 为查询的 {b:{$gt:100}} 部分匹配 b 大于 100 的文档.
  • 但是,不能保证返回的文档按照 c 进行排序.
  • Match documents having b greater than 100 for the {b:{$gt:100}} portion of the query.
  • However, there is no guarantee that the returned documents are sorted in terms of c.

因此,MongoDB别无选择,只能执行内存中排序.此查询的 explain() 输出将有一个 SORT 阶段.此 SORT 阶段将限制为 32MB 的内存使用.

Therefore, MongoDB has no choice but to perform an in-memory sort. The explain() output of this query will have a SORT stage. This SORT stage would be limited to 32MB of memory use.

2.MongoDB 可以使用索引的排序特性.

这是查询使用的结果:

  • 与索引顺序匹配的排序键,以及
  • 指定与索引相同的顺序(即索引 {b:1,c:1} 可用于 sort({b:1,c:1})sort({b:-1,c:-1}) 但不是 sort({b:1,c:-1}))
  • Sort keys that matches the order of the index, and
  • Specifies the same ordering as the index (i.e. the index {b:1,c:1} can be used for sort({b:1,c:1}) or sort({b:-1,c:-1}) but not sort({b:1,c:-1}))

例如:

    db.a.find({b:{$gt:100}, c:{$gt:200}}).sort({b:1})

在上面的查询中,索引 {b:1,c:1} 可用于:

In the query above, the index {b:1,c:1} can be used to:

  • 为查询的 {b:{$gt:100}} 部分匹配 b 大于 100 的文档.
  • 在这种情况下,MongoDB 可以保证返回的文档按照 b 进行排序.
  • Match documents having b greater than 100 for the {b:{$gt:100}} portion of the query.
  • In this case, MongoDB can guarantee that the returned documents are sorted in terms of b.

上述查询的 explain() 输出将具有 SORT 阶段.此外,带有和不带有 sort() 的查询的 explain() 输出是相同的.本质上,我们免费获得 sort().

The explain() output of the query above will not have a SORT stage. Also, the explain() output of the query with and without sort() are identical. In essence, we are getting the sort() for free.

了解这个主题的一个有价值的资源是优化 MongoDB 复合索引.请注意,这篇博文写于 2012 年.虽然某些术语可能已经过时,但博文的技术性仍然相关.

A worthwhile resource to understand this subject is Optimizing MongoDB Compound Indexes. Please note that this blog post was written way back in 2012. Although some of the terminology may be outdated, the technicality of the post is still relevant.

后续问题更新

  1. MongoDB 使用大多数查询仅使用一个索引.例如,为了避免查询中的内存SORT 阶段

  1. MongoDB uses only one index for most queries. So for example, to avoid an in-memory SORT stage in the query

db.a.find({a:1}).sort({b:1})

索引必须同时覆盖ab两个字段;例如需要一个复合索引,例如 {a:1,b:1}.您不能有两个单独的索引 {a:1}{b:1},并且期望 {a:1} 索引是用于相等部分,{b:1} 索引用于排序部分.在这种情况下,MongoDB 将选择两个索引之一.

the index must cover both a and b fields at the same time; e.g. a compound index such as {a:1,b:1} is required. You cannot have two separate indexes {a:1} and {b:1}, and expect the {a:1} index to be used for the equality part, and the {b:1} index to be used for the sort part. In this case, MongoDB will choose one of the two indexes.

因此,对结果进行排序是正确的,因为它们是按照索引的顺序查找和返回的.

Therefore, it is correct that the results are sorted because they are looked up and returned in the order of the index.

为了避免使用复合索引进行内存中排序,索引的第一部分必须满足查询的相等部分第二部分必须迎合查询的排序部分(如上面(1)的解释所示).

To avoid having an in-memory sort using a compound index, the first part of the index must cater to the equality part of the query, and the second part must cater to the sort part of the query (as shown in the explanation for (1) above).

如果您有这样的查询:

db.a.find({}).sort({a:1})

索引 {a:1,b:1} 可用于排序部分(因为您基本上返回整个集合).如果您的查询如下所示:

the index {a:1,b:1} can be used for the sort part (since you're basically returning the whole collection). And if your query looks like this:

db.a.find({a:1}).sort({b:1})

相同的索引 {a:1,b:1} 也可以用于查询的两个部分.还有:

the same index {a:1,b:1} can also be used for both parts of the query. Also:

db.a.find({a:1,b:1})

也可以使用相同的索引{a:1,b:1}

注意这里的模式:find() 后跟 sort() 参数遵循索引顺序 {a:1,b:1}.因此复合索引必须按等式->排序排序.

Notice the pattern here: the find() followed by sort() parameters follow the index order {a:1,b:1}. Therefore a compound index must be ordered by equality -> sort.

关于不同类型排序的更新

如果一个字段在文档之间具有不同的类型(例如,如果 a 在一个文档中是字符串,在其他文档中是数字,在另一个文档中是布尔值),排序如何进行?

If a field has different types between documents (e.g. if a is string in one document, number in others, boolean in yet another), how do the sort proceed?

答案是MongoDB BSON 类型比较顺序.解释一下手册页,顺序是:

The answer is MongoDB BSON type comparison order. To paraphrase the manual page, the order is:

  1. MinKey(内部类型)
  2. 数字(整数、长整数、双精度数、小数)
  3. 符号、字符串
  4. 对象
  5. 数组
  6. BinData
  7. 对象 ID
  8. 布尔值
  9. 日期
  10. 时间戳
  11. 正则表达式
  12. MaxKey(内部类型)

所以从上面使用升序的例子来看,包含数字的文档将首先出现,然后是字符串,然后是布尔值.

So from the example above using ascending order, documents containing numbers will appear first, then strings, then boolean.

这篇关于在 MongoDB 中如何使用索引进行排序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆