MongoDB Find 性能:单复合索引 VS 两个单字段索引 [英] MongoDB Find performance: single compound index VS two single field indexes

查看:28
本文介绍了MongoDB Find 性能:单复合索引 VS 两个单字段索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找有关在 MongoDb 3.4 中使用哪种索引策略的建议.

I'm looking for an advice about which indexing strategy to use in MongoDb 3.4.

假设我们有一个 people 文档集合,其形状如下:

Let's suppose we have a people collection of documents with the following shape:

{
    _id: 10,
    name: "Bob",
    age: 32,
    profession: "Hacker"
}

假设公开了一个用于查询集合的 Web api,并且唯一可能的过滤器是按 name 或按 age.
对 api 的示例调用将类似于:http://myAwesomeWebSite/people?name="Bob"&age=25

Let's imagine that a web api to query the collection is exposed and that the only possibile filters are by name or by age.
A sample call to the api will be something like: http://myAwesomeWebSite/people?name="Bob"&age=25

这样的调用将被翻译成以下查询:db.people.find({name: "Bob", age: 25}).

Such a call will be translated in the following query: db.people.find({name: "Bob", age: 25}).

为了更好地阐明我们的场景,请考虑:

To better clarify our scenario, consider that:

  • 字段 name 已经在我们的文档中,并且我们已经在该字段上建立了索引
  • 由于我们应用程序的一些新功能,我们将添加新字段 age
  • 数据库只能通过上面提到的web api访问,最重要的要求是公开一个超快的web api
  • 对 web api 的所有调用都将在字段 name 和 age 上应用过滤器(换句话说,对 web api 的所有调用都将具有相同的模式,即如上所示)
  • the field name was already in our documents and we already have an index on that field
  • we are going to add the new field age due to some new features of our application
  • the database is only accessible via the web api mentioned above and the most important requirement is to expose a super fast web api
  • all the calls to the web api will apply a filter on both the fields name and age (put another way, all the calls to the web api will have the same pattern, which is the one showed above)

也就是说,我们必须决定以下哪个索引提供最佳性能:

That said, we have to decide which of the following indexes offer the best performance:

  • 一个复合索引:{name: 1, age: 1}
  • 两个单字段索引:{name: 1}{age: 1}

根据一些简单的测试,似乎单个复合索引比两个单字段索引的性能要好得多.

According to some simple tests, it seems that the single compound index is much more performant than the two single-field indexes.

通过 mongo shell 执行单个查询,explain() 方法建议使用单个复合索引查询数据库的速度比使用两个单个字段索引快近十倍.

By executing a single query via the mongo shell, the explain() method suggests that using a single compound index you can query the database nearly ten times faster than using two single fields indexes.

在更现实的场景中,这种差异似乎没有那么戏剧化,其中不是通过 mongo shell 执行单个查询,而是对 nodejs Web 应用程序的两个不同 url 进行多次调用.两个 url 都执行对数据库的查询并将获取的数据作为 json 数组返回,一个使用具有单个复合索引的集合,另一个使用具有两个单字段索引的集合(两个集合具有完全相同的文档).
在本次测试中,单一复合指数在性能方面似乎仍然是最佳选择,但这次差异不那么明显.

This difference seems to be less drammatic in a more realistic scenario, where instead of executing a single query via the mongo shell, multiple calls are made to two different urls of a nodejs web application. Both urls execute a query to the database and return the fetched data as a json array, one using a collection with the single compound index and the other using a collection with two single-field indexes (both collections having exactly the same documents).
In this test the single compound index still seems to be the best choice in terms of performance, but this time the difference is less marked.

根据测试结果,我们正在考虑使用单一复合索引方法.

According to test results, we are considering to use the single compound index approach.

有没有人有这方面的经验?我们是否遗漏了任何重要的考虑因素(可能是大型复合索引的一些缺点)?

Does anyone has experience about this topic ? Are we missing any important consideration (maybe some disadvantage of big compound indexes) ?

推荐答案

给定一个简单的标准查询(没有 limit()sort() 或任何花哨的应用) 在两个字段上具有过滤条件(如您的示例中的 nameage),为了找到结果文档,MongoDB 将:

Given a plain standard query (with no limit() or sort() or anything fancy applied) that has a filter condition on two fields (as in name and age in your example), in order to find the resulting documents, MongoDB will either:

  1. 做一个完整的集合扫描(读取整个集合中的每个文档,解析BSON,找到有问题的值,根据输入测试它们并返回/丢弃每个文档):这太棒了I/O 密集,因此速度缓慢.
  2. 使用一个索引来保存其中一个字段(使用索引树定位相关文档子集,然后扫描它们):取决于您的数据分布/索引 selectivity 这可能非常快或几乎没有提供任何好处(想象一下age 在 30 到 40 岁之间的数百万人的数据集中 --> 每次查找仍然会产生无穷多的文档).
  3. 使用两个索引,它们一起包含有问题的两个字段(加载两个索引,执行键查找,然后计算intersection 结果):同样,根据您的数据分布,这可能会也可能不会为您提供出色的(er)性能.但是,在大多数情况下,它应该比 #2 快.然而,如果它真的比 #4 慢 10 倍(正如你提到的),我会感到惊讶.
  4. 使用复合索引(两个后续键查找立即导致所需的文档):这将是所有选项中最快的选项,因为它需要最少和最便宜的操作来获取正确的文档.为了确保最大程度的重用(不是不受此影响的性能),您通常应该首先从最具选择性的字段开始,因此在您的情况下可能是 name 而不是 age 考虑到与 name(更高的选择性)相比,很多人将具有相同的 age(如此低的选择性).但该选择还取决于您的具体场景以及您打算对数据库运行的查询.网上有一篇很好的文章,介绍了如何最好地定义复合索引,同时考虑到您的具体情况的各个方面:https://emptysqua.re/blog/optimizing-mongodb-compound-indexes
  1. do a full collection scan (read every document in the entire collection, parse the BSON, find the values in question, test them against the input and return/discard each document): This is super I/O intense and hence slow.
  2. use one index that holds one of the fields (use index tree to locate relevant subset of documents followed by a scan of them): Depending on your data distribution/index selectivity this can be very fast or barely provide any benefit (imagine an index on age in a dataset of millions of people between 30 and 40 years --> every lookup would still yield an endless number of documents).
  3. use two indexes that together contain both fields in question (load both indexes, perform key lookups, then calculate the intersection of the results): Again, depending on your data distribution, this may or may not give you great(er) performance. It should, however, in most cases be faster than #2. I would, however, be surprised if it was really 10x slower then #4 (as you mentioned).
  4. use a compound index (two subsequent key lookups immediately lead to the required documents): This will be the fastest option of all given that it requires the least and cheapest operations to get to the right documents. In order to ensure the greatest level of reuse (not performance which won't be affected by this) you should in general start with the most selective field first, so in your case probably name and not age given that a lot of people will have the same age (so low selectivity) compared to name (higher selectivity). But that choice also depends on your concrete scenario and the queries you intend to run against your database. There is a pretty good article on the web about how to best define a compound index taking various aspects of your specific situation into account: https://emptysqua.re/blog/optimizing-mongodb-compound-indexes

其他需要考虑的方面是: 指数更新是有一定代价的.但是,如果您只关心原始读取速度,并且您时不时只有少量更新,那么您应该选择更多/更大的索引.

Other aspects to consider are: Index updates come at a certain price. However, if all you care about is raw read speed and you only have a few updates every now and again, then you should go for more/bigger indexes.

最后但并非最不重要的是(!)过度使用的底线建议:使用真实数据甚至可能是真实的负载场景分析您的系统的地狱.并且随着您的数据/系统随着时间的推移不断变化.

And last but not least (!) the well over-used bottom line advice: Profile the hell out of your system using real data and perhaps even realistic load scenarios. And also keep measuring as your data/system changes over time.

补充说明:https://docs.mongodb.com/manual/core/query-optimization/index.html

https:///dba.stackexchange.com/questions/158240/mongodb-index-intersection-does-not-eliminate-the-need-for-creating-compound-in

索引交叉与复合索引?

mongodb 复合索引与索引相交

如何复合索引的顺序在 MongoDB 性能方面很重要吗?

在 MongoDB 中,我使用了一个大型查询,我将如何创建复合索引或单个索引,所以我的响应时间提高了

这篇关于MongoDB Find 性能:单复合索引 VS 两个单字段索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆