MongoDB在进行聚合时似乎选择了错误的索引 [英] MongoDB seems to choose the wrong index when doing aggregate

查看:404
本文介绍了MongoDB在进行聚合时似乎选择了错误的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Amazon EC2上运行的测试mongodb(版本3.0.1)(3.14.33-26.47.amzn1.x86_64,t2.medium:2 vcpus,4G mem)。

A testing mongodb(version 3.0.1) running on Amazon EC2(3.14.33-26.47.amzn1.x86_64, t2.medium: 2 vcpus, 4G mem).

还有一个集合 access_log(大约40,000,000条记录,每天1,000,000条记录),以及上面的一些索引:

And a collection "access_log"(about 40,000,000 records, 1,000,000 each day), and some indexes on it:

...

db.access_log.ensureIndex({ visit_dt: 1, 'username': 1 })

db.access_log.ensureIndex({ visit_dt: 1, 'file': 1 })
...

在执行聚合操作时,速度非常慢(需要几个小时):

When doing following "aggregate", it's extremely slow(takes several hours):

db.access_log.aggregate([
    { "$match": { "visit_dt": { "$gte": ISODate('2015-03-09'), "$lt": ISODate('2015-03-11') } } },
    { "$project": { "file": 1,  "_id": 0 } },
    { "$group": { "_id": "$file", "count": { "$sum": 1 } } },
    { "$sort": { "count": -1 } }
])

此汇总需要的所有字段包含在第二个索引中({visit_dt:1,'file': 1},即 visit_dt_1_file_1。

All fields needed for this aggregation are included in the 2nd index ({ visit_dt: 1, 'file': 1 }, that is "visit_dt_1_file_1").

所以我很困惑,为什么mongodb不使用此索引,而是使用另一个索引。

So I am very confused that why mongodb does not use this index, but the other one.

在解释计划时,总是得到以下信息,我根本不理解。

When explaining plan, always get following information, which I do not understand at all.

请问您需要帮助吗?非常感谢!

Could you please help? Thanks a lot!

> db.access_log.aggregate([
...     { "$match": { "visit_dt": { "$gte": ISODate('2015-03-09'), "$lt": ISODate('2015-03-11') } } },
...     { "$project": { "file": 1,  "_id": 0 } },
...     { "$group": { "_id": "$file", "count": { "$sum": 1 } } },
...     { "$sort": { "count": -1 } }
... ], { explain: true } );
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {
                                        "visit_dt" : {
                                                "$gte" : ISODate("2015-03-09T00:00:00Z"),
                                                "$lt" : ISODate("2015-03-11T00:00:00Z")
                                        }
                                },
                                "fields" : {
                                        "file" : 1,
                                        "_id" : 0
                                },
                                "queryPlanner" : {
                                        "plannerVersion" : 1,
                                        "namespace" : "xxxx.access_log",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {
                                                "$and" : [
                                                        {
                                                                "visit_dt" : {
                                                                        "$lt" : ISODate("2015-03-11T00:00:00Z")
                                                                }
                                                        },
                                                        {
                                                                "visit_dt" : {
                                                                        "$gte" : ISODate("2015-03-09T00:00:00Z")
                                                                }
                                                        }
                                                ]
                                        },
                                        "winningPlan" : {
                                                "stage" : "FETCH",
                                                "inputStage" : {
                                                        "stage" : "IXSCAN",
                                                        "keyPattern" : {
                                                                "visit_dt" : 1,
                                                                "username" : 1
                                                        },
                                                        "indexName" : "visit_dt_1_username_1",
                                                        "isMultiKey" : false,
                                                        "direction" : "forward",
                                                        "indexBounds" : {
                                                                "visit_dt" : [
                                                                        "[new Date(1425859200000), new Date(1426032000000))"
                                                                ],
                                                                "username" : [
                                                                        "[MinKey, MaxKey]"
                                                                ]
                                                        }
                                                }
                                        },
                                        "rejectedPlans" : [
  ...
                                                {
                                                        "stage" : "FETCH",
                                                        "inputStage" : {
                                                                "stage" : "IXSCAN",
                                                                "keyPattern" : {
                                                                        "visit_dt" : 1,
                                                                        "file" : 1
                                                                },
                                                                "indexName" : "visit_dt_1_file_1",
                                                                "isMultiKey" : false,
                                                                "direction" : "forward",
                                                                "indexBounds" : {
                                                                        "visit_dt" : [
                                                                                "[new Date(1425859200000), new Date(1426032000000))"
                                                                        ],
                                                                        "file" : [
                                                                                "[MinKey, MaxKey]"
                                                                        ]
                                                                }
                                                        }
                                                },
...
                                        ]
                                }
                        }
                },
                {
                        "$project" : {
                                "_id" : false,
                                "file" : true
                        }
                },
                {
                        "$group" : {
                                "_id" : "$file",
                                "count" : {
                                        "$sum" : {
                                                "$const" : 1
                                        }
                                }
                        }
                },
                {
                        "$sort" : {
                                "sortKey" : {
                                        "count" : -1
                                }
                        }
                }
        ],
        "ok" : 1
}


推荐答案

您可能想阅读关于 $ sort 性能的文档

You might want to read the docs regarding $sort performance:


$ sort运算符可以使用将索引置于管道开头时的优势或放在$ project,$ unwind和$ group聚合运算符之前。如果$ project,$ unwind或$ group在$ sort操作之前发生,则$ sort无法使用任何索引。

$sort operator can take advantage of an index when placed at the beginning of the pipeline or placed before the $project, $unwind, and $group aggregation operators. If $project, $unwind, or $group occur prior to the $sort operation, $sort cannot use any indexes.

请记住,将其称为聚合管道是有原因的。匹配后在何处排序都没关系。因此,解决方案应该非常简单:

Also, keep in mind that it is called 'aggregation pipeline' for a reason. It simply doesn't matter where you sort after matching. So the solution should be pretty simple:

db.access_log.aggregate([
  {
       "$match": { 
          "visit_dt": {
             "$gte": ISODate('2015-03-09'),
             "$lt": ISODate('2015-03-11')
           },
           "file": {"$exists": true }
        } 
  },
  { "$sort": { "file": 1 } },
  { "$project": { "file": 1,  "_id": 0 } },
  { "$group": { "_id": "$file", "count": { "$sum": 1 } } },
  { "$sort": { "count": -1 } }
])

当确保每个记录中都存在该字段时,可能不需要检查文件字段是否存在。这没有什么坏处,因为该字段上有索引。相同的是,由于我们确保只有包含文件字段的文档才能进入管道,因此应该使用索引。

The check wether the file field exists might be unnecessary when it is guaranteed that the field exists in every record. This does not hurt, as there is an index on the field. Same goes with the additional sort: since we made sure that only documents containing a file field enter the pipeline, the index should be used.

这篇关于MongoDB在进行聚合时似乎选择了错误的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆