MongoDB查询超过500万条记录的性能 [英] MongoDB querying performance for over 5 million records

查看:157
本文介绍了MongoDB查询超过500万条记录的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们最近为我们的一个主要系列创下了超过200万的记录,现在我们开始因该系列的主要性能问题而受到影响。

We've recently hit the >2 Million records for one of our main collections and now we started to suffer for major performance issues on that collection.

他们的文件在集合中有大约8个字段,您可以使用UI进行过滤,结果应按处理记录的时间戳字段排序。

They documents in the collection have about 8 fields which you can filter by using UI and the results are supposed to sorted by a timestamp field the record was processed.

我添加了几个具有过滤字段和timetamp的复合索引
例如:

I've added several compound indexes with the filtered fields and the timetamp e.g:

db.events.ensureIndex({somefield: 1, timestamp:-1})

我还添加了几个索引,一次使用多个过滤器希望实现更好的表现。但是一些过滤器仍然需要很长时间才能执行。

I've also added couple of indexes for using several filters at once to hopefully achieve better performance. But some filters still take awfully long time to perform.

我已经确保使用解释查询确实使用了我创建的索引,但性能仍然没有足够好了。

I've made sure that using explain that the queries do use the indexes I've created but performance is still not good enough.

我想知道分片是否是现在的方式..但我们很快就会开始在该系列中每天有大约100万条新记录。所以我不确定它是否能很好地扩展..

I was wondering if sharding is the way to go now.. but we will soon start to have about 1 million new records per day in that collection.. so I'm not sure if it will scale well..

编辑:查询示例:

> db.audit.find({'userAgent.deviceType': 'MOBILE', 'user.userName': {$in: ['nickey@acme.com']}}).sort({timestamp: -1}).limit(25).explain()
{
        "cursor" : "BtreeCursor user.userName_1_timestamp_-1",
        "isMultiKey" : false,
        "n" : 0,
        "nscannedObjects" : 30060,
        "nscanned" : 30060,
        "nscannedObjectsAllPlans" : 120241,
        "nscannedAllPlans" : 120241,
        "scanAndOrder" : false,
        "indexOnly" : false,
        "nYields" : 1,
        "nChunkSkips" : 0,
        "millis" : 26495,
        "indexBounds" : {
                "user.userName" : [
                        [
                                "nickey@acme.com",
                                "nickey@acme.com"
                        ]
                ],
                "timestamp" : [
                        [
                                {
                                        "$maxElement" : 1
                                },
                                {
                                        "$minElement" : 1
                                }
                        ]
                ]
        },
        "server" : "yarin:27017"
}

请注意deviceType有我的收藏中只有2个值。

please note that deviceType has only 2 values in my collection.

推荐答案

这是在大海捞针。对于那些效果不佳的查询,我们需要一些 explain()的输出。不幸的是,即使这样也只能针对特定查询解决问题,所以这里有一个如何处理这个问题的策略:

This is searching the needle in a haystack. We'd need some output of explain() for those queries that don't perform well. Unfortunately, even that would fix the problem only for that particular query, so here's a strategy on how to approach this:


  1. 确保不是因为内存不足和分页过多

  2. 启用数据库分析器(使用 db.setProfilingLevel(1,超时)其中 timeout 是查询或命令所用毫秒数的阈值,将记录任何较慢的值。

  3. 检查 db.system.profile 并使用 explain()
  4. 手动运行查询
  5. 尝试识别 explain()输出中的慢操作,例如 scanAndOrder 或大 nscanned 等。

  6. 关于查询选择性的原因以及是否可以使用索引来改善查询。如果没有,请考虑禁止最终用户的过滤器设置,或者给他一个警告对话框,说明操作可能很慢。

  1. Ensure it's not because of insufficient RAM and excessive paging
  2. Enable the DB profiler (using db.setProfilingLevel(1, timeout) where timeout is the threshold for the number of milliseconds the query or command takes, anything slower will be logged)
  3. Inspect the slow queries in db.system.profile and run the queries manually using explain()
  4. Try to identify the slow operations in the explain() output, such as scanAndOrder or large nscanned, etc.
  5. Reason about the selectivity of the query and whether it's possible to improve the query using an index at all. If not, consider disallowing the filter setting for the end-user or give him a warning dialog that the operation might be slow.

A关键问题是你显然允许你的用户随意组合过滤器。如果没有索引交叉,那将大大增加所需索引的数量。

A key problem is that you're apparently allowing your users to combine filters at will. Without index intersectioning, that will blow up the number of required indexes dramatically.

此外,盲目地在每个可能的查询中抛出索引是一个非常糟糕的策略。构建查询并确保索引字段具有足够的选择性非常重要。

Also, blindly throwing an index at every possible query is a very bad strategy. It's important to structure the queries and make sure the indexed fields have sufficient selectivity.

假设您对 status active和其他一些条件的所有用户都有查询。但在500万用户中,300万用户活跃,200万用户不活跃,因此超过500万用户只有两个不同的值。这样的指数通常没有帮助。最好先搜索其他条件,然后扫描结果。平均而言,当返回100个文档时,您将需要扫描167个文档,这不会对性能造成太大影响。但事情并非那么简单。如果主要标准是用户的 joined_at 日期以及用户停止使用时间的可能性很高,则可能最终必须扫描数千找到一百场比赛之前的文件。

Let's say you have a query for all users with status "active" and some other criteria. But of the 5 million users, 3 million are active and 2 million aren't, so over 5 million entries there's only two different values. Such an index doesn't usually help. It's better to search for the other criteria first, then scan the results. On average, when returning 100 documents, you'll have to scan 167 documents, which won't hurt performance too badly. But it's not that simple. If the primary criterion is the joined_at date of the user and the likelihood of users discontinuing use with time is high, you might end up having to scan thousands of documents before finding a hundred matches.

所以优化在很大程度上取决于数据(不仅仅是结构,还有数据本身),其内部关联和您的查询模式

So the optimization depends very much on the data (not only its structure, but also the data itself), its internal correlations and your query patterns.

当数据太大而无法实现时情况会变得更糟RAM,因为那时,索引很棒,但扫描(甚至简单地返回)结果可能需要从磁盘中随机获取大量数据,这需要花费大量时间。

Things get worse when the data is too big for the RAM, because then, having an index is great, but scanning (or even simply returning) the results might require fetching a lot of data from disk randomly which takes a lot of time.

控制此问题的最佳方法是限制不同查询类型的数量,禁止对低选择性信息进行查询,并尝试阻止对旧数据的随机访问。

The best way to control this is to limit the number of different query types, disallow queries on low selectivity information and try to prevent random access to old data.

如果所有其他方法都失败了,如果你真的需要在过滤器方面有这么大的灵活性,那么它可能是值得的le考虑一个支持索引交叉的单独搜索DB,从那里获取mongo id,然后使用 $ in 从mongo获取结果。但这充满了自己的危险。

If all else fails and if you really need that much flexibility in filters, it might be worthwhile to consider a separate search DB that supports index intersections, fetch the mongo ids from there and then get the results from mongo using $in. But that is fraught with its own perils.

- 编辑 -

你发布的解释是扫描低选择性字段的问题的一个很好的例子。显然,有很多关于nickey@acme.com的文件。现在,查找这些文档并按时间戳降序排序非常快,因为它受到高选择性索引的支持。不幸的是,由于只有两种设备类型,mongo需要扫描30060个文档才能找到第一个匹配移动的文档。

The explain you posted is a beautiful example of a the problem with scanning low selectivity fields. Apparently, there's a lot of documents for "nickey@acme.com". Now, finding those documents and sorting them descending by timestamp is pretty fast, because it's supported by high-selectivity indexes. Unfortunately, since there are only two device types, mongo needs to scan 30060 documents to find the first one that matches 'mobile'.

我认为这是某种网络跟踪,用户的使用模式使查询变慢(他会每天切换移动和网络,查询会很快)。

I assume this is some kind of web tracking, and the user's usage pattern makes the query slow (would he switch mobile and web on a daily basis, the query would be fast).

使这个特定查询更快可以使用包含设备类型的复合索引来完成,例如使用

Making this particular query faster could be done using a compound index that contains the device type, e.g. using

a) ensureIndex({'username': 1, 'userAgent.deviceType' : 1, 'timestamp' :-1})

b) ensureIndex({'userAgent.deviceType' : 1, 'username' : 1, 'timestamp' :-1})

不幸的是,这意味着查询如 find({username:foo})。sort({timestamp:-1}); 不能再使用相同的索引了,因此,如上所述,索引的数量会很快增长。

Unfortunately, that means that queries like find({"username" : "foo"}).sort({"timestamp" : -1}); can't use the same index anymore, so, as described, the number of indexes will grow very quickly.

我担心目前使用mongodb没有很好的解决方案。

I'm afraid there's no very good solution for this using mongodb at this time.

这篇关于MongoDB查询超过500万条记录的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆