MongoDB 聚合性能 [英] MongoDB Aggregation Performance

查看:87
本文介绍了MongoDB 聚合性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们遇到了聚合查询运行时间过长(几分钟)的问题.

集合:

我们收集了 2.5 亿个文档,每个文档大约有 20 个字段,该集合的总大小为 110GB.

我们在our_id"上有索引;和 dtKey 字段.

硬件:

记忆:

24GB 内存(6 * 4GB DIMMS 1333 Mhz)

磁盘:

Lvm 11TB 由 4 个 3TB 磁盘构建:

  • 600MB/s 的最大瞬时数据传输.

  • 7200 RPM 主轴.平均延迟 = 4.16 毫秒

  • RAID 0

CPU:

2* E5-2420 0 @ 1.90GHz总共 12 个内核,24 个线程.戴尔 R420.

问题:我们正在尝试对以下内容进行聚合查询:

db.our_collection.aggregate([{$match":{$和":[{dtKey":{$gte":20140916}},{dtKey":{$lt":20141217}},{our_id":111111111"}]}},{$项目":{字段1":1,日期":1}},{$组":{_id":{天":{$dayOfYear":$date"},年":{$年":$日期"}},field1":{$sum":$field1"}}}]);

此查询需要几分钟才能运行,当它运行时,我们可以看到以下内容:

  • Mongo 当前的操作产生了超过 30 万次
  • 在 iostat 上,我们看到大约 100% 的磁盘利用率

这个查询完成后它似乎在缓存中,这可以在一瞬间再次完成,

在为 3-4 个用户运行后,第一个似乎已经从缓存中换出,并且查询再次需要很长时间.

我们在匹配部分测试了一个计数,看到我们有 50K 文档的用户以及拥有 500K 文档的用户,

我们试图只得到匹配的部分:

db.pub_stats.aggregate([{$match":{$和":[{dtKey":{$gte":20140916}},{dtKey":{$lt":20141217}},{"our_id ":"112162107"}]}}]);

而且查询似乎需要大约 300-500M 的内存,

但是运行完整查询后,好像要占用3.5G内存.

问题:

  1. 为什么聚合的流水线需要这么多内存?
  2. 我们如何提高性能,使其在合理的时间内运行 HTTP 请求?

解决方案

  1. 为什么聚合的流水线需要这么多内存?

只需执行 $match 不需要读取实际数据,它可以在索引上完成.通过 field1 的投影访问,将不得不读取实际的文档,并且它可能也会被缓存.

此外,分组可能很昂贵.通常情况下,如果你的分组阶段需要超过100M的内存,它应该会报错——你使用的是什么版本?它需要在产生之前扫描整个结果集,并且 MongoDB 必须至少存储组中每个元素的指针或索引.我猜内存增加的关键原因是前者.

<块引用>

  1. 我们如何提高性能,使其在合理的时间内运行 HTTP 请求?

你的dtKey似乎是对时间进行编码,分组也是基于时间进行的.我会尝试利用这一事实 - 例如,通过预先计算每天的聚合和 our_id 组合 - 如果没有更多标准并且数据不再有太大变化,就很有意义.

否则,我会尝试将 {"our_id":"111111111"} 条件移动到第一个位置,因为相等应始终位于范围查询之前.我猜聚合框架的查询优化器足够聪明,但值得一试.此外,您可能想尝试将两个索引转换为单个复合索引 { our_id, dtkey }.现在支持索引交叉,但我不确定它的效率到底有多高,真的.使用内置配置文件和 .explain() 来分析您的查询.

最后,MongoDB 专为大量写入使用而设计,在几毫秒内从磁盘扫描数百 GB 的数据集在计算上根本不可行.如果您的数据集大于 RAM,您将面临数十毫秒甚至更高、数万或数十万次的大规模 IO 延迟,因为所有必需的磁盘操作.请记住,使用随机访问,您将永远无法接近理论上的顺序磁盘传输速率.如果你不能预先计算,我想你需要更多的内存.也许 SSD 会有所帮助,但这只是猜测.

We have a problem of aggregation queries running long time (couple of minutes).

Collection:

We have a collection of 250 million documents with about 20 fields per document, The total size of the collection is 110GB.

We have indexes over "our_id" and dtKey fields.

Hardware:

Memory:

24GB RAM (6 * 4GB DIMMS 1333 Mhz)

Disk:

Lvm 11TB built from 4 disks of 3TB disks:

  • 600MB/s maximum instantaneous data transfers.

  • 7200 RPM spindle. Average latency = 4.16ms

  • RAID 0

CPU:

2* E5-2420 0 @ 1.90GHz Total of 12 cores with 24 threads. Dell R420.

Problem: We are trying to make an aggregation query of the following:

db.our_collection.aggregate(
    [
        {
            "$match":
            {
                "$and":
                    [
                        {"dtKey":{"$gte":20140916}},
                        {"dtKey":{"$lt":20141217}},
                        {"our_id":"111111111"}
                    ]
            }
        },
        {
            "$project":
            {
                "field1":1,
                "date":1
            }
        },
        {
            "$group":
            {
                "_id":
                {
                    "day":{"$dayOfYear":"$date"},
                    "year":{"$year":"$date"}
                },
                "field1":{"$sum":"$field1"}
            }
        }
    ]
);

This query takes a couple of minutes to run, when it is running we can see the followings:

  • Mongo current operation is yielding more than 300K times
  • On iostat we see ~100% disk utilization

After this query is done it seems to be in cache and this can be done again in a split second,

After running it for 3 – 4 users it seems that the first one is already been swapped out from the cache and the query takes a long time again.

We have tested a count on the matching part and seen that we have users of 50K documents as well as users with 500K documents,

We tried to get only the matching part:

db.pub_stats.aggregate(
    [
        {
            "$match":
            {
                "$and":
                    [
                        {"dtKey":{"$gte":20140916}},
                        {"dtKey":{"$lt":20141217}},
                        {" our_id ":"112162107"}
                    ]
            }
        }
    ]
);

And the queries seems to take approximately 300-500M of memory,

But after running the full query, it seems to take 3.5G of memory.

Questions:

  1. Why the pipelining of the aggregation takes so much memory?
  2. How can we increase our performance for it to run on a reasonable time for HTTP request?

解决方案

  1. Why the pipelining of the aggregation takes so much memory?

Just performing a $match won't have to read the actual data, it can be done on the indexes. Through the projection's access of field1, the actual document will have to be read, and it will probably be cached as well.

Also, grouping can be expensive. Normally, it should report an error if your grouping stage requires more than 100M of memory - what version are you using? It requires to scan the entire result set before yielding, and MongoDB will have to at least store a pointer or an index of each element in the groups. I guess the key reason for the memory increase is the former.

  1. How can we increase our performance for it to run on a reasonable time for HTTP request?

Your dtKey appears to encode time, and the grouping is also done based on time. I'd try to exploit that fact - for instance, by precomputing aggregates for each day and our_id combination - makes a lot of sense if there's no more criteria and the data doesn't change much anymore.

Otherwise, I'd try to move the {"our_id":"111111111"} criterion to the first position, because equality should always precede range queries. I guess the query optimizer of the aggregation framework is smart enough, but it's worth a try. Also, you might want to try turning your two indexes into a single compound index { our_id, dtkey }. Index intersections are supported now, but I'm not sure how efficient that is, really. Use the built-in profile and .explain() to analyze your query.

Lastly, MongoDB is designed for write-heavy use and scanning data sets of hundreds of GB from disk in a matter of milliseconds isn't feasible computationally at all. If your dataset is larger than your RAM, you'll be facing massive IO delays on the scale of tens of milliseconds and upwards, tens or hundreds of thousands of times, because of all the required disk operations. Remember that with random access you'll never get even close to the theoretical sequential disk transfer rates. If you can't precompute, I guess you'll need a lot more RAM. Maybe SSDs help, but that is all just guesswork.

这篇关于MongoDB 聚合性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆