mongodb 索引覆盖缺失值 [英] mongodb indexes covering missing values

查看:55
本文介绍了mongodb 索引覆盖缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 mongodb 执行这种形式的高效操作:

I'd like to perform efficient operations of this form with mongodb:

db.getCollection('x').find({a:{$ne:null}})

我的理解是 a 上的索引将不包括缺少字段 a 的文档.因此,{a:{$ne:null}} 形式的查询需要扫描这些文档(即不能仅仅依靠索引来查找所有匹配的文档).

My understanding is that an index on a will not include documents which are missing the field a. So queries of the form {a:{$ne:null}} need to scan for those documents (i.e. can't rely solely on the index to find all the matching documents).

我正在考虑 mongo 功能请求(如果尚未提交),以允许索引选择性地包含具有缺失值的文档.我想知道:

I'm considering a mongo feature request (if one hasn't been submitted already) to allow indices to optionally include documents with missing values. I'm wondering:

  1. 在当前的 mongo 版本中,是否可以以某种方式加速上述查询?请注意,简单地始终为该字段添加一个值是一个很好的答案,但在我的情况下这是不可能的.
  2. 这是一个明智的 mongo 功能请求吗?我不太了解索引是如何实现的,但据我所知,这似乎应该是可能的(尽管并非所有索引都需要它 - 仅适用于某些索引,由程序员自行决定).

我知道这里有很多关于索引和空"的问题.(null 值与缺失值等),但我花了一些时间却找不到这个问题的直接答案.

I know there are a lot of questions here about indices and "null" (the null value versus a missing value, etc), but I spent a bit of time and couldn't find a direct answer to this question.

作为一个真实的例子,我有一个包含约 8000 万个文档的集合.其中大约 1,000 个文档缺少 a 字段.我希望能够遍历那些缺少 a 的文档(以任何顺序).一种解决方法是确保它们永远不会丢失 a 并将其设置为 -1 或其他特定值.这对我来说似乎有点傻 - 应该有办法让 mongo 在幕后为我做这件事.

As a real example, I have a collection with ~80 million documents. About 1,000 of those documents are missing the field a. I'd like to be able to iterate over those documents that are missing a (in any order). One workaround is to make sure they're never missing a and just set it to -1 or some other particular value. That seems a bit silly to me - there should be a way to have mongo do that for me under the hood.

推荐答案

索引有两个相关特性可以实现快速查找(与执行集合扫描相比):

There are two relevant features of an index that permit fast lookups (compared to performing a collection scan):

  • 值的排序
  • 索引值的紧凑性

如果你有值 ab,你可以说 a 出现在 b 之前,按字典顺序.如果您有文档 {a: 2, b: 5}{b: 4, a: 3},则这些文档的单一排序不能满足典型的查询.例如,如果您希望对所有 a 值进行排序,那么您可能期望 2, 3,但如果您希望 b 值期望 4, 5 - 需要反转文档顺序.

If you have values a and b, you can say that a comes before b, lexicographically. If you have documents {a: 2, b: 5} and {b: 4, a: 3}, there is no single ordering of these documents that would satisfy typical queries. For example, if you want all of the a values ordered then you might expect 2, 3, but if you want the b values you might expect 4, 5 - requiring document order to be reversed.

当数据库将索引存储在磁盘上时,值以索引顺序存储(无论特定索引可能是什么,例如排序规则会影响此).一般来说,集合文档没有单一的排序可用于整体,因此集合文档是无序的.

When a database stores the index on disk, the values are stored in index order (whatever that might be for the particular index, e.g. collation affects this). Generally there is no single ordering that is usable for collection documents overall, hence collection documents are unordered.

当您按索引查询时,您获取要搜索的值并使用索引执行二分搜索,因为索引中的数据已排序.

When you query by index, you take the value being searched and essentially execute a binary search using the index because the data in the index is sorted.

使用索引的第二个原因是,如果您正在扫描集合,对于每个文档,通常需要从磁盘检索整个文档并跳过.如果您有一个 100 GB 的集合并且正在执行扫描,则可能需要跳过 100 GB 以上的数据.如果同一个集合在某个字段上有 100 MB 的索引(因为索引只存储该字段中的值而不是整个文档的数据),并且数据库执行完整的索引扫描,它只需要遍历 100 MB 的数据.

The second reason to use the index is, if you are scanning the collection, for each document, the entire document typically needs to be retrieved from disk and skipped over. If you have a 100 GB collection and you are doing a scan you might need to skip over 100 GB of data. If the same collection has a 100 MB index on some field (because the index only stores the values in that field and not the entire document's worth of data), and the database performs a complete index scan, it only has to traverse 100 MB of data.

现在,关于在索引中存储缺少值的问题.

Now, to your question about storing lack of values in an index.

从指数的角度来看,缺乏价值"是指缺乏价值".在不同的文件中是相同的值.当您的所有值都相同时,您将失去进行二分查找的能力.因此,如果您正在寻找缺乏价值"文档,索引会将集合中缺少值的所有文档返回给您,然后您无论如何都必须对它们进行扫描,以根据您拥有的任何其他条件进行过滤.由于这通常会产生不好的选择性,因此数据库不会打扰索引并首先进行集合扫描.

From the index's standpoint, the "lack of a value" in different documents is the same value. You lose the ability to do binary search when all of your values are identical. So if you are looking for that "lack of value" document, the index will give you back all of the documents in the collection that lack the value, and then you have to do a scan through them anyway to filter by whatever other conditions you have. Since this generally produces bad selectivity the databases don't bother with indexes and do collection scans in the first place.

而且,很可能您希望查询中包含其他一些字段,而不是没有值的字段.所以现在你想要索引存储完整的文档,打败紧凑的想法.

And, most likely you want some other fields out of your query, not the field that doesn't have a value. So now you want the index to store complete documents, defeating the compactness idea.

这篇关于mongodb 索引覆盖缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆