使用聚合和 $group 查询时出现内存溢出错误 [英] Memory overflow error when using aggregate and $group queries

查看:31
本文介绍了使用聚合和 $group 查询时出现内存溢出错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为allvoice"的集合.其结构如下:

I have a collection named "allvoice" which has the following structure:

{
    "_id" : ObjectId("612599bb1cff80e6fc5cbf38"),
    "subscriber_id" : "e3365edb9c781a561107242a81c1a92b4269ef9a",
    "callednumber" : "559198e6f8814773551a457e53a614d603f9deab",
    "originaltimestamp" : "20200113033803",
    "duration" : "13",
    "maincost" : "255.6",
    "type" : "Internal",
    "type_network" : "local_network",
    "month" : "202001"
}

带有originaltimestamp"字段为当天数据的交易时段,为了方便查询当天的大数据,我使用查询聚合"和组"按日期按天获取订户交易(每日交易),然后我写出一个名为dailyvoice"的集合.以后要按日期查询交易数据,我会直接在每日语音"集合中查询.这是我的查询.

with the field "originaltimestamp" being the transaction period of the data of the day, to make it easier to query for large data during the day, I use the query "aggregate" and "group" by date to get the subscriber transactions by day (dailly transaction) and then I write out to a collection named "dailyvoice". From now on to query the transaction data by date, I will query directly in the collection "dailyvoice". Here is my query.

db.getCollection('allvoice').aggregate(
  [ 
  { 
      "$project": {
       "date": { "$toDate": "$originaltimestamp" },
       "subscriber_id":1,
       "callednumber":1,
       "originaltimestamp":1,
       "duration": 1,
       "maincost": 1,
       "type": 1,
       "type_network": 1,
       "month":1
  }},
 
  { "$group": {
    "_id": { "$dateToString": { "format": "%Y-%m-%d", "date": "$date" } },
    "data": { $push: "$$ROOT" } ,
    "count": { "$sum": 1 }
  }},
  {
    '$out': 'dailyvoice' 
  }
],  { allowDiskUse: true }
)

以及dailyvoice"的输出合集如下:

And the output of the "dailyvoice" collection is as follows:

{
    "_id" : "2020-01-13",
    "data" : [ 
        {
            "_id" : ObjectId("612599bb1cff80e6fc5cbf38"),
            "subscriber_id" : "e3365edb9c781a561107242a81c1a92b4269ef9a",
            "callednumber" : "559198e6f8814773551a457e53a614d603f9deab",
            "originaltimestamp" : "20200113033803",
            "duration" : "13",
            "maincost" : "255.6",
            "type" : "trong nuoc",
            "type_network" : "local_network",
            "month" : "202001",
            "date" : ISODate("2020-01-13T03:38:03.000Z")
        }, 
        {
            "_id" : ObjectId("612599bb1cff80e6fc5cbf39"),
            "subscriber_id" : "6cf5d711bfa12160eefe62b8bc9c914370eebd70",
            "callednumber" : "0241052d42e5491b0529733716fb6fb04804248f",
            "originaltimestamp" : "20200113041608",
            "duration" : "28",
            "maincost" : "644.0",
            "type" : "trong nuoc",
            "type_network" : "global_network",
            "month" : "202001",
            "date" : ISODate("2020-01-13T04:16:08.000Z")
        }, 
        {
            "_id" : ObjectId("612599bb1cff80e6fc5cbf3a"),
            "subscriber_id" : "3e554a5a920c469da9faf7375c5265c5cf6fb696",
            "callednumber" : "307219a71c028931a4b74f8f5f014ffa16005ee9",
            "originaltimestamp" : "20200113051416",
            "duration" : "202",
            "maincost" : "2961.4",
            "type" : "trong nuoc",
            "type_network" : "local_network",
            "month" : "202001",
            "date" : ISODate("2020-01-13T05:14:16.000Z")
        }
    ],
    "count" : 3.0
}

我在这里遇到的问题是,如果集合allvoice"有一个小数据集,查询语句工作正常,但是当集合allvoice"有一个大数据集,大约有 114513872 条记录(文档)查询遇到溢出的语句(聚合期间的 PlanExcutor 错误").有没有比增加服务器配置更好的解决方案?请看一下为我找到优化查询的方法!非常感谢你

The problem I have here is that if the collection "allvoice" has a small data set, the query statement works fine, but when the collection "allvoice" has a large data set, about 114513872 records (documents) the statement the query encountered an overflow ("PlanExcutor error during aggregation"). Is there a better solution than increasing the server configuration? Please take a look to find a way to optimize the query for me! Thanks you so much

我将查询优化为逐月查询后,结果还是报错:PlanExcutor error during aggreation::cased by::BSONObj size.."

After I optimize the query to query month by month, the result is that I still get the error: "PlanExcutor error during aggreation::cased by::BSONObj size.."

db.getCollection('allvoice').aggregate(
    [

        { $match: { month: "202001" } },

        {
            "$group": {

                "_id": {
                    "$dateToString": {
                        "format": "%Y-%m-%d", "date": { "$toDate": "$originaltimestamp" }
                    }
                },

                "data": {
                    $push: {
                        "subscriber_id": "$subscriber_id",
                        "type": "$type",
                        // "originaltimestamp":"$originaltimestamp"
                        "date": { "$toDate": "$originaltimestamp" },
                        "month": "$month"
                    }
                },

                "count": { "$sum": 1 }
            }
        },
        
        {
            '$out': 'dailyvoice_202001'
        }
    ], { allowDiskUse: true }
)

推荐答案

一些想法:

您不需要查询中的第一个 $project 阶段.并且,您可以包含 { "$toDate": "$originaltimestamp";}$group 阶段的 _id 中,如下:

You don't need the first $project stage in the query. And, you can include the { "$toDate": "$originaltimestamp" } within the $group stage's _id, as below:

"_id": { 
    "$dateToString": { 
        "format": "%Y-%m-%d", "date": { "$toDate": "$originaltimestamp" } 
    } 
}

关于 $push: "$$ROOT" - 而不是 $$ROOT,只捕获您最需要(或重要)的字段.这是为了减少内存使用.例如:

About the $push: "$$ROOT" - instead of the $$ROOT, capture only the fields you need most (or important). This is to reduce the memory usage. For example:

"data": { 
    $push: { 
        "subscriber_id": "$subscriber_id",
        "type": "$type",
        // other required fields...
    } 
}

最后,您可以考虑限制对一组日期的查询.这将需要针对不同的日期范围多次运行查询 - 但我认为总体上可能会更好.例如,一次匹配 month 字段的一个月.并且,这个 month 可以被索引以提高性能.这将需要在查询的开始(第一阶段)包含一个 $match 阶段,例如:

Finally, you can think about restricting the query for a set of dates at time. This will require running the query more than once for different range of dates - but I think it may fare better overall. For example, a month at a time matching the month field. And, this month can be indexed for performance. This will require to include a $match stage at the beginning (the first stage) of the query, for example:

{ $match: { month: "202001" } }

而且,这将查询 2020 年 1 月的数据.

And, this will query data for the month of January 2020.

这篇关于使用聚合和 $group 查询时出现内存溢出错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆