mongodb中大型嵌套数据的查询性能问题 [英] Query performance issue for large nested data in mongodb

查看：251 发布时间：2020/6/2 20:55:53 mongodb aggregation-framework query-performance

本文介绍了mongodb中大型嵌套数据的查询性能问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从名为'tasks'的大型数据集中查询结果，该数据集中包含 187297个文档，这些文档嵌套在另一个名为'workers'的数据集中，依次嵌套在一个名为'production_units'的集合中。

I'm trying to query results from a large dataset called 'tasks' containing 187297 documents which are nested into another dataset called 'workers', that's in its turn nested into a collection called 'production_units'.

production_units-> worker- >任务

production_units -> workers -> tasks

（顺便说一下，这是production_units的简化版本）：

(BTW this is a simplified version of production_units):

[{
    "_id": ObjectId("5aca27b926974863ed9f01ab"),
    "name": "Z",
    "workers": [{
        "name": "X Y",
        "worker_number": 655,
        "employed": false,
        "_id": ObjectId("5aca27bd26974863ed9f0425"),
        "tasks": [{
            "_id": ObjectId("5ac9f6c2e1a668d6d39c1fd1"),
            "inbound_order_number": 3296,
            "task_number": 90,
            "minutes_elapsed": 120,
            "date": "2004-11-30",
            "start": 1101823200,
            "pieces_actual": 160,
            "pause_from": 1101812400,
            "pause_to": 1101814200
        }]
    }]
}]

为了完成此操作，我使用了以下聚合命令：

In order to accomplish this I have used the following aggregation command:

db.production_units.aggregate([{
    '$project': {
        'workers': '$workers'
    }
}, {
    '$unwind': '$workers'
}, {
    '$project': {
        'tasks': '$workers.tasks',
        'worker_number': '$workers.worker_number'
    }
}, {
    '$unwind': '$tasks'
}, {
    '$project': {
        'task_number': '$tasks.task_number',
        'pieces_actual': '$tasks.pieces_actual',
        'minutes_elapsed': '$tasks.minutes_elapsed',
        'worker_number': 1,
        'start': '$tasks.start',
        'inbound_order_number': '$tasks.inbound_order_number',
        'pause_from': '$tasks.pause_from',
        'date': '$tasks.date',
        '_id': '$tasks._id',
        'pause_to': '$tasks.pause_to'
    }
}, {
    '$match': {
        'start': {
            '$exists': true
        }
    }
}, {
    '$group': {
        'entries_count': {
            '$sum': 1
        },
        '_id': null,
        'entries': {
            '$push': '$$ROOT'
        }
    }
}, {
    '$project': {
        'entries_count': 1,
        '_id': 0,
        'entries': 1
    }
}, {
    '$unwind': '$entries'
}, {
    '$project': {
        'task_number': '$entries.task_number',
        'pieces_actual': '$entries.pieces_actual',
        'minutes_elapsed': '$entries.minutes_elapsed',
        'worker_number': '$entries.worker_number',
        'start': '$entries.start',
        'inbound_order_number': '$entries.inbound_order_number',
        'pause_from': '$entries.pause_from',
        'date': '$entries.date',
        'entries_count': 1,
        '_id': '$entries._id',
        'pause_to': '$entries.pause_to'
    }
}, {
    '$sort': {
        'start': 1
    }
}, {
    '$skip': 187290
}, {
    '$limit': 10
}], {
    allowDiskUse: true
})

返回的文档为：

{ "entries_count" : 187297, "task_number" : 100, "pieces_actual" : 68, "minutes_elapsed" : 102, "worker_number" : 411, "start" : 1594118400, "inbound_order_number" : 8569, "pause_from" : 1594119600, "date" : "2020-07-07", "_id" : ObjectId("5ac9f6d3e1a668d6d3a06351"), "pause_to" : 1594119600 } { "entries_count" : 187297, "task_number" : 130, "pieces_actual" : 20, "minutes_elapsed" : 30, "worker_number" : 549, "start" : 1596531600, "inbound_order_number" : 7683, "pause_from" : 1596538800, "date" : "2020-08-04", "_id" : ObjectId("5ac9f6cde1a668d6d39f1b26"), "pause_to" : 1596538800 } { "entries_count" : 187297, "task_number" : 210, "pieces_actual" : 84, "minutes_elapsed" : 180, "worker_number" : 734, "start" : 1601276400, "inbound_order_number" : 8330, "pause_from" : 1601290800, "date" : "2020-09-28", "_id" : ObjectId("5ac9f6d0e1a668d6d39fd677"), "pause_to" : 1601290800 } { "entries_count" : 187297, "task_number" : 20, "pieces_actual" : 64, "minutes_elapsed" : 90, "worker_number" : 114, "start" : 1601800200, "inbound_order_number" : 7690, "pause_from" : 1601809200, "date" : "2020-10-04", "_id" : ObjectId("5ac9f6cee1a668d6d39f3032"), "pause_to" : 1601811900 } { "entries_count" : 187297, "task_number" : 140, "pieces_actual" : 70, "minutes_elapsed" : 84, "worker_number" : 49, "start" : 1603721640, "inbound_order_number" : 4592, "pause_from" : 1603710000, "date" : "2020-10-26", "_id" : ObjectId("5ac9f6c8e1a668d6d39df664"), "pause_to" : 1603712700 } { "entries_count" : 187297, "task_number" : 80, "pieces_actual" : 20, "minutes_elapsed" : 30, "worker_number" : 277, "start" : 1796628600, "inbound_order_number" : 4655, "pause_from" : 1796641200, "date" : "2026-12-07", "_id" : ObjectId("5ac9f6c8e1a668d6d39e1fc0"), "pause_to" : 1796643900 } { "entries_count" : 187297, "task_number" : 40, "pieces_actual" : 79, "minutes_elapsed" : 123, "worker_number" : 96, "start" : 3802247580, "inbound_order_number" : 4592, "pause_from" : 3802244400, "date" : "2090-06-27", "_id" : ObjectId("5ac9f6c8e1a668d6d39de218"), "pause_to" : 3802244400 }

但是，查询需要几秒钟才能显示结果，而不是几毫秒。这是分析器返回的结果：

However, the query takes seconds in order to show the results, instead of few milliseconds. This is the result returned by the profiler:

 db.system.profile.findOne().millis 3216

（更新）

甚至下面的简化计数查询将在312毫秒内执行，而不是花费很少的时间：

Even the following simplified count query gets executed in 312 ms instead of few time:

db.production_units.aggregate([{
        "$unwind": "$workers"
    }, {
        "$unwind": "$workers.tasks"
    },
    {
        "$count": "entries_count"
    }
])

这就是 explain（）返回上面的查询：

This is what explain() returns for the query above:

{
    "stages" : [
        {
            "$cursor" : {
                "query" : {

                },
                "fields" : {
                    "workers" : 1,
                    "_id" : 0
                },
                "queryPlanner" : {
                    "plannerVersion" : 1,
                    "namespace" : "my_db.production_units",
                    "indexFilterSet" : false,
                    "parsedQuery" : {

                    },
                    "winningPlan" : {
                        "stage" : "COLLSCAN",
                        "direction" : "forward"
                    },
                    "rejectedPlans" : [ ]
                },
                "executionStats" : {
                    "executionSuccess" : true,
                    "nReturned" : 28,
                    "executionTimeMillis" : 13,
                    "totalKeysExamined" : 0,
                    "totalDocsExamined" : 28,
                    "executionStages" : {
                        "stage" : "COLLSCAN",
                        "nReturned" : 28,
                        "executionTimeMillisEstimate" : 0,
                        "works" : 30,
                        "advanced" : 28,
                        "needTime" : 1,
                        "needYield" : 0,
                        "saveState" : 1,
                        "restoreState" : 1,
                        "isEOF" : 1,
                        "invalidates" : 0,
                        "direction" : "forward",
                        "docsExamined" : 28
                    },
                    "allPlansExecution" : [ ]
                }
            }
        },
        {
            "$unwind" : {
                "path" : "$workers"
            }
        },
        {
            "$unwind" : {
                "path" : "$workers.tasks"
            }
        },
        {
            "$group" : {
                "_id" : {
                    "$const" : null
                },
                "entries_count" : {
                    "$sum" : {
                        "$const" : 1
                    }
                }
            }
        },
        {
            "$project" : {
                "_id" : false,
                "entries_count" : true
            }
        }
    ],
    "ok" : 1
}

我不是经验丰富的DBA，所以我不知道我的聚合管道中到底缺少什么解决我面临的性能问题。我也调查了问题并进行了研究，但没有找到任何解决方案。

I'm not an experienced DBA, so I don't know what I'm missing exactly in my aggregation pipeline, for solving the performance issue I'm facing. I have also investigated the problem and made research, but without finding any solution.

我缺少了什么？

调用 `$ match` 作为

$ match 允许删除某些文档，因此添加

Call `$match` as early as possible

$match allows to remove some of the documents, so add it as early as possible to apply further aggregation stage on a lower number of documents

由于查询仅返回10个文档，无需在其他180000个文档上应用 $ project 阶段

As the query returns only 10 documents, no need to apply the $project stage on the 180000 other docs

这可能是瓶颈。确保对字段 workers.tasks.start 建立索引（请参见 MongoDB sureIndex（） 了解详细信息）

This is likely to be the bottleneck. Make sure that the field workers.tasks.start is indexed ( see MongoDB ensureIndex() for details )

而不是 $ group / $ unwind 阶段来计算匹配的文档，同时运行另一个查询以仅计算匹配的文档数量

Instead of the $group/$unwind stage to count matching documents, run another query in the same time for counting only the number of matching documents

主查询现在看起来像：

db.collection.aggregate([{
        "$unwind": "$workers"
    }, {
        "$unwind": "$workers.tasks"
    }, {
        "$match": {
            "workers.tasks.start": {
                "$ne": null
            }
        }
    },
    {
        "$sort": {
            "workers.tasks.start": 1
        }
    }, {
        "$skip": 0
    }, {
        "$limit": 10
    },
    {
        "$project": {
            "task_number": "$workers.tasks.task_number",
            "pieces_actual": "$workers.tasks.pieces_actual",
            "minutes_elapsed": "$workers.tasks.minutes_elapsed",
            "worker_number": "$workers.worker_number",
            "start": "$workers.tasks.start",
            "inbound_order_number": "$workers.tasks.inbound_order_number",
            "pause_from": "$workers.tasks.pause_from",
            "date": "$workers.tasks.date",
            "_id": "$workers.tasks._id",
            "pause_to": "$workers.tasks.pause_to"
        }
    }
])

您可以在这里尝试： mongoplayground.net/p/yua7qspo2Jj

计数查询应为

db.collection.aggregate([{
        "$unwind": "$workers"
    }, {
        "$unwind": "$workers.tasks"
    }, {
        "$match": {
            "workers.tasks.start": {
                "$ne": null
            }
        }
    },
    {
        "$count": "entries_count"
    }
])

计数查询看起来像

这篇关于mongodb中大型嵌套数据的查询性能问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

mongodb中大型嵌套数据的查询性能问题 [英] Query performance issue for large nested data in mongodb

问题描述

推荐答案

调用 `$ match` 作为

Call `$match` as early as possible

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

mongodb中大型嵌套数据的查询性能问题 [英] Query performance issue for large nested data in mongodb

问题描述

推荐答案

调用 $ match 作为

Call $match as early as possible

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

调用 `$ match` 作为

Call `$match` as early as possible

登录关闭