$lookup 中的其他连接条件严重降低了性能(使用管道) [英] Terribly degraded performance with other join conditions in $lookup (using pipeline)

查看:19
本文介绍了$lookup 中的其他连接条件严重降低了性能(使用管道)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

So during some code review I decided to improve existing query performance by improving one aggregation that was like this:

    .aggregate([
        //difference starts here
        {
            "$lookup": {
                "from": "sessions",
                "localField": "_id",
                "foreignField": "_client",
                "as": "sessions"
            }
        },
        {
            $unwind: "$sessions"
        },
        {
            $match: {
                "sessions.deleted_at": null
            }
        },
        //difference ends here
        {
            $project: {
                name: client_name_concater,
                email: '$email',
                phone: '$phone',
                address: addressConcater,
                updated_at: '$updated_at',
            }
        }
    ]);

to this:

    .aggregate([
    //difference starts here
    {
        $lookup: {
            from: 'sessions',
            let: {
                id: "$_id"
            },
            pipeline: [
                {
                    $match: {
                        $expr: {
                            $and:
                                [
                                    {
                                        $eq: ["$_client", "$$id"]
                                    }, {
                                    $eq: ["$deleted_at", null]
                                },
                                ]
                        }
                    }
                }
            ],
            as: 'sessions'
        }
    },
    {
        $match: {
            "sessions": {$ne: []}
        }
    },
    //difference ends here
        {
            $project: {
                name: client_name_concater,
                email: '$email',
                phone: '$phone',
                address: addressConcater,
                updated_at: '$updated_at',
            }
        }
    ]);

I thought that the second option should be better, since we have one less stage, but the difference in performance is massive in the opposite way, the first query runs on average ~40ms, the other one ranges between 3.5 - 5 seconds, 100 times more. The other collection (sessions) has around 120 documents, while this one about 152, but still, even if it was acceptable due to data size, why the difference between these two, isn't it basically the same thing, we are just adding the join condition in the pipeline with the other main condition of the join. Am I missing something?

Some functions or variables included there are mostly static or concatenation that shouldn't affect the $lookup part.

Thanks

EDIT:

Added query plans, for version 1:

{
        "stages": [
            {
                "$cursor": {
                    "query": {
                        "$and": [
                            {
                                "deleted_at": null
                            },
                            {}
                        ]
                    },
                    "fields": {
                        "email": 1,
                        "phone": 1,
                        "updated_at": 1,
                        "_id": 1
                    },
                    "queryPlanner": {
                        "plannerVersion": 1,
                        "namespace": "test.clients",
                        "indexFilterSet": false,
                        "parsedQuery": {
                            "deleted_at": {
                                "$eq": null
                            }
                        },
                        "winningPlan": {
                            "stage": "COLLSCAN",
                            "filter": {
                                "deleted_at": {
                                    "$eq": null
                                }
                            },
                            "direction": "forward"
                        },
                        "rejectedPlans": []
                    }
                }
            },
            {
                "$lookup": {
                    "from": "sessions",
                    "as": "sessions",
                    "localField": "_id",
                    "foreignField": "_client",
                    "unwinding": {
                        "preserveNullAndEmptyArrays": false
                    }
                }
            },
            {
                "$project": {
                    "_id": true,
                    "email": "$email",
                    "phone": "$phone",
                    "updated_at": "$updated_at"
                }
            }
        ],
        "ok": 1
    }

For version 2:

{
        "stages": [
            {
                "$cursor": {
                    "query": {
                        "deleted_at": null
                    },
                    "fields": {
                        "email": 1,
                        "phone": 1,
                        "sessions": 1,
                        "updated_at": 1,
                        "_id": 1
                    },
                    "queryPlanner": {
                        "plannerVersion": 1,
                        "namespace": "test.clients",
                        "indexFilterSet": false,
                        "parsedQuery": {
                            "deleted_at": {
                                "$eq": null
                            }
                        },
                        "winningPlan": {
                            "stage": "COLLSCAN",
                            "filter": {
                                "deleted_at": {
                                    "$eq": null
                                }
                            },
                            "direction": "forward"
                        },
                        "rejectedPlans": []
                    }
                }
            },
            {
                "$lookup": {
                    "from": "sessions",
                    "as": "sessions",
                    "let": {
                        "id": "$_id"
                    },
                    "pipeline": [
                        {
                            "$match": {
                                "$expr": {
                                    "$and": [
                                        {
                                            "$eq": [
                                                "$_client",
                                                "$$id"
                                            ]
                                        },
                                        {
                                            "$eq": [
                                                "$deleted_at",
                                                null
                                            ]
                                        }
                                    ]
                                }
                            }
                        }
                    ]
                }
            },
            {
                "$match": {
                    "sessions": {
                        "$not": {
                            "$eq": []
                        }
                    }
                }
            },
            {
                "$project": {
                    "_id": true,
                    "email": "$email",
                    "phone": "$phone",
                    "updated_at": "$updated_at"
                }
            }
        ],
        "ok": 1
    }

One thing of note, the joined sessions collection has certain properties with very big data (some imported data), so I am thinking that in some way it may be affecting the query size due to this data? But why the difference between the two $lookup versions though.

解决方案

The second version adds an aggregation pipeline execution for each document in the joined collection.

The documentation says:

Specifies the pipeline to run on the joined collection. The pipeline determines the resulting documents from the joined collection. To return all documents, specify an empty pipeline [].

The pipeline is executed for each document in the collection, not for each matched document.

Depending on how large the collection is (both # of documents and document size) this could come out to a decent amount of time.

after removing the limit, the pipeline version jumped to over 10 seconds

Makes sense - all of the additional documents due to the removal of limit also must have the aggregation pipeline executed for them.

It is possible that per-document execution of aggregation pipeline isn't as optimized as it could be. For example, if the pipeline is set up and torn down for each document, there could easily be more overhead in that than in the $match conditions.

Is there any case when using one or the other?

Executing an aggregation pipeline per joined document provides additional flexibility. If you need this flexibility, it may make sense to execute the pipeline, though performance needs to be considered regardless. If you don't, it is sensible to use a more performant approach.

这篇关于$lookup 中的其他连接条件严重降低了性能(使用管道)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆