使用聚合和 $group 查询时出现内存溢出错误 [英] Memory overflow error when using aggregate and $group queries
问题描述
我有一个名为allvoice"的集合.其结构如下:
I have a collection named "allvoice" which has the following structure:
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf38"),
"subscriber_id" : "e3365edb9c781a561107242a81c1a92b4269ef9a",
"callednumber" : "559198e6f8814773551a457e53a614d603f9deab",
"originaltimestamp" : "20200113033803",
"duration" : "13",
"maincost" : "255.6",
"type" : "Internal",
"type_network" : "local_network",
"month" : "202001"
}
带有originaltimestamp"字段为当天数据的交易时段,为了方便查询当天的大数据,我使用查询聚合"和组"按日期按天获取订户交易(每日交易),然后我写出一个名为dailyvoice"的集合.以后要按日期查询交易数据,我会直接在每日语音"集合中查询.这是我的查询.
with the field "originaltimestamp" being the transaction period of the data of the day, to make it easier to query for large data during the day, I use the query "aggregate" and "group" by date to get the subscriber transactions by day (dailly transaction) and then I write out to a collection named "dailyvoice". From now on to query the transaction data by date, I will query directly in the collection "dailyvoice". Here is my query.
db.getCollection('allvoice').aggregate(
[
{
"$project": {
"date": { "$toDate": "$originaltimestamp" },
"subscriber_id":1,
"callednumber":1,
"originaltimestamp":1,
"duration": 1,
"maincost": 1,
"type": 1,
"type_network": 1,
"month":1
}},
{ "$group": {
"_id": { "$dateToString": { "format": "%Y-%m-%d", "date": "$date" } },
"data": { $push: "$$ROOT" } ,
"count": { "$sum": 1 }
}},
{
'$out': 'dailyvoice'
}
], { allowDiskUse: true }
)
以及dailyvoice"的输出合集如下:
And the output of the "dailyvoice" collection is as follows:
{
"_id" : "2020-01-13",
"data" : [
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf38"),
"subscriber_id" : "e3365edb9c781a561107242a81c1a92b4269ef9a",
"callednumber" : "559198e6f8814773551a457e53a614d603f9deab",
"originaltimestamp" : "20200113033803",
"duration" : "13",
"maincost" : "255.6",
"type" : "trong nuoc",
"type_network" : "local_network",
"month" : "202001",
"date" : ISODate("2020-01-13T03:38:03.000Z")
},
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf39"),
"subscriber_id" : "6cf5d711bfa12160eefe62b8bc9c914370eebd70",
"callednumber" : "0241052d42e5491b0529733716fb6fb04804248f",
"originaltimestamp" : "20200113041608",
"duration" : "28",
"maincost" : "644.0",
"type" : "trong nuoc",
"type_network" : "global_network",
"month" : "202001",
"date" : ISODate("2020-01-13T04:16:08.000Z")
},
{
"_id" : ObjectId("612599bb1cff80e6fc5cbf3a"),
"subscriber_id" : "3e554a5a920c469da9faf7375c5265c5cf6fb696",
"callednumber" : "307219a71c028931a4b74f8f5f014ffa16005ee9",
"originaltimestamp" : "20200113051416",
"duration" : "202",
"maincost" : "2961.4",
"type" : "trong nuoc",
"type_network" : "local_network",
"month" : "202001",
"date" : ISODate("2020-01-13T05:14:16.000Z")
}
],
"count" : 3.0
}
我在这里遇到的问题是,如果集合allvoice"有一个小数据集,查询语句工作正常,但是当集合allvoice"有一个大数据集,大约有 114513872 条记录(文档)查询遇到溢出的语句(聚合期间的 PlanExcutor 错误").有没有比增加服务器配置更好的解决方案?请看一下为我找到优化查询的方法!非常感谢你
The problem I have here is that if the collection "allvoice" has a small data set, the query statement works fine, but when the collection "allvoice" has a large data set, about 114513872 records (documents) the statement the query encountered an overflow ("PlanExcutor error during aggregation"). Is there a better solution than increasing the server configuration? Please take a look to find a way to optimize the query for me! Thanks you so much
我将查询优化为逐月查询后,结果还是报错:PlanExcutor error during aggreation::cased by::BSONObj size.."
After I optimize the query to query month by month, the result is that I still get the error: "PlanExcutor error during aggreation::cased by::BSONObj size.."
db.getCollection('allvoice').aggregate(
[
{ $match: { month: "202001" } },
{
"$group": {
"_id": {
"$dateToString": {
"format": "%Y-%m-%d", "date": { "$toDate": "$originaltimestamp" }
}
},
"data": {
$push: {
"subscriber_id": "$subscriber_id",
"type": "$type",
// "originaltimestamp":"$originaltimestamp"
"date": { "$toDate": "$originaltimestamp" },
"month": "$month"
}
},
"count": { "$sum": 1 }
}
},
{
'$out': 'dailyvoice_202001'
}
], { allowDiskUse: true }
)
推荐答案
一些想法:
您不需要查询中的第一个 $project
阶段.并且,您可以包含 { "$toDate": "$originaltimestamp";}
在 $group
阶段的 _id
中,如下:
You don't need the first $project
stage in the query. And, you can include the { "$toDate": "$originaltimestamp" }
within the $group
stage's _id
, as below:
"_id": {
"$dateToString": {
"format": "%Y-%m-%d", "date": { "$toDate": "$originaltimestamp" }
}
}
关于 $push: "$$ROOT"
- 而不是 $$ROOT
,只捕获您最需要(或重要)的字段.这是为了减少内存使用.例如:
About the $push: "$$ROOT"
- instead of the $$ROOT
, capture only the fields you need most (or important). This is to reduce the memory usage. For example:
"data": {
$push: {
"subscriber_id": "$subscriber_id",
"type": "$type",
// other required fields...
}
}
最后,您可以考虑限制对一组日期的查询.这将需要针对不同的日期范围多次运行查询 - 但我认为总体上可能会更好.例如,一次匹配 month
字段的一个月.并且,这个 month
可以被索引以提高性能.这将需要在查询的开始(第一阶段)包含一个 $match
阶段,例如:
Finally, you can think about restricting the query for a set of dates at time. This will require running the query more than once for different range of dates - but I think it may fare better overall. For example, a month at a time matching the month
field. And, this month
can be indexed for performance. This will require to include a $match
stage at the beginning (the first stage) of the query, for example:
{ $match: { month: "202001" } }
而且,这将查询 2020 年 1 月的数据.
And, this will query data for the month of January 2020.
这篇关于使用聚合和 $group 查询时出现内存溢出错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!