MongoDB 超过 500 万条记录的查询性能 [英] MongoDB querying performance for over 5 million records

查看:65
本文介绍了MongoDB 超过 500 万条记录的查询性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们最近的一个主要集合的记录超过了 200 万,现在我们开始因该集合的主要性能问题而受到影响.

We've recently hit the >2 Million records for one of our main collections and now we started to suffer for major performance issues on that collection.

集合中的文档有大约 8 个字段,您可以使用 UI 过滤这些字段,并且结果应该按记录处理的时间戳字段排序.

They documents in the collection have about 8 fields which you can filter by using UI and the results are supposed to sorted by a timestamp field the record was processed.

我添加了几个带有过滤字段和时间戳的复合索引例如:

I've added several compound indexes with the filtered fields and the timetamp e.g:

db.events.ensureIndex({somefield: 1, timestamp:-1})

我还添加了几个索引以同时使用多个过滤器,以期获得更好的性能.但是有些过滤器仍然需要很长时间才能执行.

I've also added couple of indexes for using several filters at once to hopefully achieve better performance. But some filters still take awfully long time to perform.

我已经确保使用说明查询确实使用了我创建的索引,但性能仍然不够好.

I've made sure that using explain that the queries do use the indexes I've created but performance is still not good enough.

我想知道分片现在是否可行..但我们很快就会开始在该集合中每天有大约 100 万条新记录..所以我不确定它是否能很好地扩展..

I was wondering if sharding is the way to go now.. but we will soon start to have about 1 million new records per day in that collection.. so I'm not sure if it will scale well..

查询示例:

> db.audit.find({'userAgent.deviceType': 'MOBILE', 'user.userName': {$in: ['nickey@acme.com']}}).sort({timestamp: -1}).limit(25).explain()
{
        "cursor" : "BtreeCursor user.userName_1_timestamp_-1",
        "isMultiKey" : false,
        "n" : 0,
        "nscannedObjects" : 30060,
        "nscanned" : 30060,
        "nscannedObjectsAllPlans" : 120241,
        "nscannedAllPlans" : 120241,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 1,
        "nChunkSkips" : 0,
        "millis" : 26495,
        "indexBounds" : {
                "user.userName" : [
                        [
                                "nickey@acme.com",
                                "nickey@acme.com"
                        ]
                ],
                "timestamp" : [
                        [
                                {
                                        "$maxElement" : 1
                                },
                                {
                                        "$minElement" : 1
                                }
                        ]
                ]
        },
        "server" : "yarin:27017"
}

请注意 deviceType 在我的集合中只有 2 个值.

please note that deviceType has only 2 values in my collection.

推荐答案

这是大海捞针.对于那些表现不佳的查询,我们需要一些 explain() 的输出.不幸的是,即使那样也只能解决那个特定查询的问题,所以这里有一个关于如何解决这个问题的策略:

This is searching the needle in a haystack. We'd need some output of explain() for those queries that don't perform well. Unfortunately, even that would fix the problem only for that particular query, so here's a strategy on how to approach this:

  1. 确保不是因为内存不足和分页过多
  2. 启用数据库分析器(使用 db.setProfilingLevel(1, timeout) 其中 timeout 是查询或命令花费的毫秒数阈值,任何较慢的将被记录)
  3. 检查 db.system.profile 中的慢查询并使用 explain()
  4. 手动运行查询
  5. 尝试找出explain()输出中的慢操作,例如scanAndOrder或大nscanned
  6. 关于查询选择性的原因以及是否可以使用索引改进查询.如果没有,请考虑禁止最终用户进行过滤器设置,或向他发出警告对话框,提示操作可能会很慢.
  1. Ensure it's not because of insufficient RAM and excessive paging
  2. Enable the DB profiler (using db.setProfilingLevel(1, timeout) where timeout is the threshold for the number of milliseconds the query or command takes, anything slower will be logged)
  3. Inspect the slow queries in db.system.profile and run the queries manually using explain()
  4. Try to identify the slow operations in the explain() output, such as scanAndOrder or large nscanned, etc.
  5. Reason about the selectivity of the query and whether it's possible to improve the query using an index at all. If not, consider disallowing the filter setting for the end-user or give him a warning dialog that the operation might be slow.

一个关键问题是您显然允许您的用户随意组合过滤器.如果没有索引交叉,这将大大增加所需索引的数量.

A key problem is that you're apparently allowing your users to combine filters at will. Without index intersectioning, that will blow up the number of required indexes dramatically.

此外,在每个可能的查询中盲目地抛出一个索引是一种非常糟糕的策略.构造查询并确保索引字段具有足够的选择性非常重要.

Also, blindly throwing an index at every possible query is a very bad strategy. It's important to structure the queries and make sure the indexed fields have sufficient selectivity.

假设您要查询具有 status活动"和其他一些条件的所有用户.但是在 500 万用户中,300 万是活跃的,200 万不是,所以超过 500 万个条目只有两个不同的值.这样的索引通常没有帮助.最好先搜索其他条件,然后再扫描结果.平均而言,当返回 100 个文档时,您必须扫描 167 个文档,这不会对性能造成太大影响.但事情没那么简单.如果主要标准是用户的 joined_at 日期,并且随着时间的推移用户停止使用的可能性很高,那么在找到一百场比赛.

Let's say you have a query for all users with status "active" and some other criteria. But of the 5 million users, 3 million are active and 2 million aren't, so over 5 million entries there's only two different values. Such an index doesn't usually help. It's better to search for the other criteria first, then scan the results. On average, when returning 100 documents, you'll have to scan 167 documents, which won't hurt performance too badly. But it's not that simple. If the primary criterion is the joined_at date of the user and the likelihood of users discontinuing use with time is high, you might end up having to scan thousands of documents before finding a hundred matches.

因此优化在很大程度上取决于数据(不仅是其结构,还包括数据本身)、其内部相关性和您的查询模式.

So the optimization depends very much on the data (not only its structure, but also the data itself), its internal correlations and your query patterns.

当数据对于 RAM 来说太大时,情况会变得更糟,因为那样的话,有一个索引是很好的,但是扫描(甚至只是简单地返回)结果可能需要从磁盘随机获取大量数据,这需要很多时间时间.

Things get worse when the data is too big for the RAM, because then, having an index is great, but scanning (or even simply returning) the results might require fetching a lot of data from disk randomly which takes a lot of time.

控制这种情况的最好方法是限制不同查询类型的数量,禁止对低选择性信息的查询,并尽量防止对旧数据的随机访问.

The best way to control this is to limit the number of different query types, disallow queries on low selectivity information and try to prevent random access to old data.

如果所有其他方法都失败了,并且如果您真的需要过滤器的灵活性,那么考虑一个支持索引交叉的单独搜索数据库可能是值得的,从那里获取 mongo id,然后使用 从 mongo 获取结果$in.但这本身就充满了危险.

If all else fails and if you really need that much flexibility in filters, it might be worthwhile to consider a separate search DB that supports index intersections, fetch the mongo ids from there and then get the results from mongo using $in. But that is fraught with its own perils.

-- 编辑 --

您发布的解释是扫描低选择性字段问题的一个很好的例子.显然,nickey@acme.com"有很多文档.现在,查找这些文档并按时间戳降序对它们进行排序非常快,因为它受到高选择性索引的支持.不幸的是,由于只有两种设备类型,mongo 需要扫描 30060 个文档才能找到第一个匹配 'mobile' 的文档.

The explain you posted is a beautiful example of a the problem with scanning low selectivity fields. Apparently, there's a lot of documents for "nickey@acme.com". Now, finding those documents and sorting them descending by timestamp is pretty fast, because it's supported by high-selectivity indexes. Unfortunately, since there are only two device types, mongo needs to scan 30060 documents to find the first one that matches 'mobile'.

我假设这是某种网络跟踪,用户的使用模式使查询变慢(如果他每天切换移动和网络,查询会很快).

I assume this is some kind of web tracking, and the user's usage pattern makes the query slow (would he switch mobile and web on a daily basis, the query would be fast).

可以使用包含设备类型的复合索引来加快此特定查询的速度,例如使用

Making this particular query faster could be done using a compound index that contains the device type, e.g. using

a) ensureIndex({'username': 1, 'userAgent.deviceType' : 1, 'timestamp' :-1})

b) ensureIndex({'userAgent.deviceType' : 1, 'username' : 1, 'timestamp' :-1})

不幸的是,这意味着像 find({"username" : "foo"}).sort({"timestamp" : -1}); 不能使用相同的索引不再,因此,如上所述,索引的数量将增长得非常快.

Unfortunately, that means that queries like find({"username" : "foo"}).sort({"timestamp" : -1}); can't use the same index anymore, so, as described, the number of indexes will grow very quickly.

恐怕目前使用 mongodb 没有很好的解决方案.

I'm afraid there's no very good solution for this using mongodb at this time.

这篇关于MongoDB 超过 500 万条记录的查询性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆