MongoDB查找性能:单个复合索引VS两个单个字段索引 [英] MongoDB Find performance: single compound index VS two single field indexes

查看:775
本文介绍了MongoDB查找性能:单个复合索引VS两个单个字段索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找有关在MongoDb 3.4中使用哪种索引策略的建议.

I'm looking for an advice about which indexing strategy to use in MongoDb 3.4.

假设我们有一个 people 文档,其形状如下:

Let's suppose we have a people collection of documents with the following shape:

{
    _id: 10,
    name: "Bob",
    age: 32,
    profession: "Hacker"
}

让我们想象一下,用于查询集合的Web api已公开,并且唯一可能的过滤器是 name age .
对api的示例调用将类似于:http://myAwesomeWebSite/people?name="Bob"&age=25

Let's imagine that a web api to query the collection is exposed and that the only possibile filters are by name or by age.
A sample call to the api will be something like: http://myAwesomeWebSite/people?name="Bob"&age=25

这样的呼叫将在以下查询中转换:db.people.find({name: "Bob", age: 25}).

Such a call will be translated in the following query: db.people.find({name: "Bob", age: 25}).

为了更好地阐明我们的情况,请考虑:

To better clarify our scenario, consider that:

  • name 字段已经在我们的文档中,并且我们已经在该字段上有了索引
  • 由于我们应用程序的一些新功能,我们将添加新字段 age
  • 只能通过上述Web api访问数据库,最重要的要求是公开超快速的Web api
  • 对Web api的所有调用都会在字段名称和年龄上应用过滤器 (换句话说,对Web api的所有调用将具有相同的模式,即如上所示)
  • the field name was already in our documents and we already have an index on that field
  • we are going to add the new field age due to some new features of our application
  • the database is only accessible via the web api mentioned above and the most important requirement is to expose a super fast web api
  • all the calls to the web api will apply a filter on both the fields name and age (put another way, all the calls to the web api will have the same pattern, which is the one showed above)

也就是说,我们必须决定以下哪个索引提供最佳性能:

That said, we have to decide which of the following indexes offer the best performance:

  • 一个复合索引:{name: 1, age: 1}
  • 两个单字段索引:{name: 1}{age: 1}
  • One compound index: {name: 1, age: 1}
  • Two single-field indexes: {name: 1} and {age: 1}

根据一些简单的测试,似乎单个复合索引比两个单个字段索引具有更高的性能.

According to some simple tests, it seems that the single compound index is much more performant than the two single-field indexes.

通过mongo shell执行单个查询,explain()方法建议使用单个复合索引可以比使用两个单个字段索引快近十倍地查询数据库.

By executing a single query via the mongo shell, the explain() method suggests that using a single compound index you can query the database nearly ten times faster than using two single fields indexes.

在更现实的情况下,这种差异似乎不那么严格,在这种情况下,不是通过mongo shell执行单个查询,而是对nodejs Web应用程序的两个不同url进行了多次调用.这两个url都对数据库执行查询,并将获取的数据作为json数组返回,一个使用带有单个复合索引的集合,另一个使用带有两个单字段索引的集合(两个集合具有完全相同的文档). br> 在此测试中,就性能而言,单个复合索引似乎仍然是最佳选择,但这一次差异并不明显.

This difference seems to be less drammatic in a more realistic scenario, where instead of executing a single query via the mongo shell, multiple calls are made to two different urls of a nodejs web application. Both urls execute a query to the database and return the fetched data as a json array, one using a collection with the single compound index and the other using a collection with two single-field indexes (both collections having exactly the same documents).
In this test the single compound index still seems to be the best choice in terms of performance, but this time the difference is less marked.

根据测试结果,我们正在考虑使用单一复合指数方法.

According to test results, we are considering to use the single compound index approach.

有人对此主题有经验吗?我们是否缺少任何重要的考虑因素(也许是大复合索引的某些缺点)?

Does anyone has experience about this topic ? Are we missing any important consideration (maybe some disadvantage of big compound indexes) ?

推荐答案

给出一个普通标准查询(不包含limit()sort()或任何花哨的东西),该查询在两个字段上都具有过滤条件(如age),为了找到生成的文档,MongoDB将执行以下任一操作:

Given a plain standard query (with no limit() or sort() or anything fancy applied) that has a filter condition on two fields (as in name and age in your example), in order to find the resulting documents, MongoDB will either:

  1. 进行完整的集合扫描(读取整个集合中的每个文档,解析BSON,查找有问题的值,针对输入进行测试并返回/丢弃每个文档):这是超级好I/O密集,因此速度很慢.
  2. 使用保存其中一个字段的一个索引(使用索引树查找文档的相关子集,然后扫描它们):取决于您的数据分布/索引复合索引 (两个关键查找会立即生成所需的文档):这将是所有方法中最快的选择,因为它需要最少且最便宜的操作才能获得正确的文档.为了确保最大程度的重用(不会受到此影响的性能),通常应该首先从最有选择性的字段开始,因此在很多情况下,可能是name而不是age的人与name(较高的选择性)相比具有相同的age(选择性低).但是,这种选择还取决于您的具体情况以及您打算针对数据库运行的查询.网络上有一篇很好的文章,介绍如何在考虑到您具体情况的各个方面的情况下最佳定义复合索引: https://emptysqua.re/blog/optimizing-mongodb-compound-indexes
  1. do a full collection scan (read every document in the entire collection, parse the BSON, find the values in question, test them against the input and return/discard each document): This is super I/O intense and hence slow.
  2. use one index that holds one of the fields (use index tree to locate relevant subset of documents followed by a scan of them): Depending on your data distribution/index selectivity this can be very fast or barely provide any benefit (imagine an index on age in a dataset of millions of people between 30 and 40 years --> every lookup would still yield an endless number of documents).
  3. use two indexes that together contain both fields in question (load both indexes, perform key lookups, then calculate the intersection of the results): Again, depending on your data distribution, this may or may not give you great(er) performance. It should, however, in most cases be faster than #2. I would, however, be surprised if it was really 10x slower then #4 (as you mentioned).
  4. use a compound index (two subsequent key lookups immediately lead to the required documents): This will be the fastest option of all given that it requires the least and cheapest operations to get to the right documents. In order to ensure the greatest level of reuse (not performance which won't be affected by this) you should in general start with the most selective field first, so in your case probably name and not age given that a lot of people will have the same age (so low selectivity) compared to name (higher selectivity). But that choice also depends on your concrete scenario and the queries you intend to run against your database. There is a pretty good article on the web about how to best define a compound index taking various aspects of your specific situation into account: https://emptysqua.re/blog/optimizing-mongodb-compound-indexes

要考虑的其他方面是:索引更新需要一定的价格.但是,如果您只关心原始读取速度,并且一次又一次只有几个更新,则应该选择更多/更大的索引.

Other aspects to consider are: Index updates come at a certain price. However, if all you care about is raw read speed and you only have a few updates every now and again, then you should go for more/bigger indexes.

最后但并非最不重要的(!),是过度使用的底线建议:使用实际数据甚至可能是实际的加载方案,从系统中分析地狱.并且随着数据/系统随时间的变化而不断进行测量.

And last but not least (!) the well over-used bottom line advice: Profile the hell out of your system using real data and perhaps even realistic load scenarios. And also keep measuring as your data/system changes over time.

其他内容: https://docs.mongodb.com/manual/core/query-optimization /index.html

https://dba.stackexchange.com/questions/158240/mongodb-index-intersection-does-not-eminate-the-neide-for-necreated-compound-in

索引交点与复合索引?

mongodb compund索引与索引相交

复合索引的顺序对MongoDB性能有影响吗?

查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆