总计$ lookup匹配管道中文档的总大小超过了最大文档大小 [英] Aggregate $lookup Total size of documents in matching pipeline exceeds maximum document size
问题描述
我有一个非常简单的$lookup
聚合查询,如下所示:
I have a pretty simple $lookup
aggregation query like the following:
{'$lookup':
{'from': 'edge',
'localField': 'gid',
'foreignField': 'to',
'as': 'from'}}
在具有足够文档的匹配项上运行此命令时,出现以下错误:
When I run this on a match with enough documents I get the following error:
Command failed with error 4568: 'Total size of documents in edge
matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" }
}, {} ] } } exceeds maximum document size' on server
所有限制文档数量的尝试均失败. allowDiskUse: true
不执行任何操作.发送cursor
in不会执行任何操作.将$limit
添加到聚合中也将失败.
All attempts to limit the number of documents fail. allowDiskUse: true
does nothing. Sending a cursor
in does nothing. Adding in a $limit
into the aggregation also fails.
怎么可能?
然后我再次看到错误. $match
和$and
和$eq
来自何处?幕后的聚合管道是否在将$lookup
调用移植到另一种聚合,它是独立运行的,因此我无法为游标提供限制或使用游标?
Then I see the error again. Where did that $match
and $and
and $eq
come from? Is the aggregation pipeline behind the scenes farming out the $lookup
call to another aggregation, one it runs on its own that I have no ability to provide limits for or use cursors with??
这是怎么回事?
推荐答案
如注释前面所述,由于执行 16MB BSON限制.
As stated earlier in comment, the error occurs because when performing the $lookup
which by default produces a target "array" within the parent document from the results of the foreign collection, the total size of documents selected for that array causes the parent to exceed the 16MB BSON Limit.
此操作的计数器是使用 $unwind
紧接在 $lookup
管道阶段之后.实际上,这会更改 $lookup
的行为,而不是在父对象中生成一个数组,结果将是每个匹配的文档的每个父对象的副本".
The counter for this is to process with an $unwind
which immediately follows the $lookup
pipeline stage. This actually alters the behavior of $lookup
in such that instead of producing an array in the parent, the results are instead a "copy" of each parent for every document matched.
就像 $unwind
的常规用法一样,除了将unwinding
操作实际上添加到 $unwind
后面加上 $match
条件,该条件还会创建一个matching
参数,该参数也将添加到 $lookup
.实际上,您可以在管道的explain
输出中看到这一点.
Pretty much just like regular usage of $unwind
, with the exception that instead of processing as a "separate" pipeline stage, the unwinding
action is actually added to the $lookup
pipeline operation itself. Ideally you also follow the $unwind
with a $match
condition, which also creates a matching
argument to also be added to the $lookup
. You can actually see this in the explain
output for the pipeline.
$ lookup + $ unwind合并
3.2版中的新功能. 当$ unwind立即跟随另一个$ lookup,并且$ unwind在$ lookup的as字段上运行时,优化程序可以将$ unwind合并到$ lookup阶段.这样可以避免创建大型的中间文档. When a $unwind immediately follows another $lookup, and the $unwind operates on the as field of the $lookup, the optimizer can coalesce the $unwind into the $lookup stage. This avoids creating large intermediate documents. 最好的清单展示了该清单,该清单通过创建超过16MB BSON限制的相关"文档使服务器承受压力.尽可能简短地打破和解决BSON限制: Best demonstrated with a listing that puts the server under stress by creating "related" documents that would exceed the 16MB BSON limit. Done as briefly as possible to both break and work around the BSON Limit: 在插入一些初始数据之后,清单将尝试运行仅由 After inserting some initial data, the listing will attempt to run an aggregate merely consisting of {MongoError:边缘匹配管道中的文档总大小{$ match:{$ and:[{gid:{$ eq:1}},{}]}}超过了最大文档大小 { MongoError: Total size of documents in edge matching pipeline { $match: { $and : [ { gid: { $eq: 1 } }, {} ] } } exceeds maximum document size 基本上是告诉您检索超出了BSON限制. Which is basically telling you the BSON limit was exceeded on retrieval. 通过对比,下一次尝试将添加 By contrast the next attempt adds the 说明输出: 该结果当然会成功,因为由于不再将结果放入父文档中,所以不能超过BSON限制. And that result of course succeeds, because as the results are no longer being placed into the parent document then the BSON limit cannot be exceeded. 这实际上是由于添加 This really just happens as a result of adding 通过这种方式构造,您可以查询超出BSON限制的参考数据",然后如果需要 By constructing in this way you can query for "referenced data" that would exceed the BSON limit and then if you want 正如上面所有内容所述,BSON限制是您不能违反的硬" 限制,这通常就是 As all the content above notes, the BSON Limit is a "hard" limit that you cannot breach and this is generally why the MongoDB 3.6将新语法添加到 MongoDB 3.6 adds new syntax to 新的表达式将是: 实际上,这基本上就是MongoDB使用以前的语法在幕后" 所做的事情,因为3.6使用 In fact this would be basically what MongoDB is doing "under the covers" with the previous syntax since 3.6 uses 如果通过const MongoClient = require('mongodb').MongoClient;
const uri = 'mongodb://localhost/test';
function data(data) {
console.log(JSON.stringify(data, undefined, 2))
}
(async function() {
let db;
try {
db = await MongoClient.connect(uri);
console.log('Cleaning....');
// Clean data
await Promise.all(
["source","edge"].map(c => db.collection(c).remove() )
);
console.log('Inserting...')
await db.collection('edge').insertMany(
Array(1000).fill(1).map((e,i) => ({ _id: i+1, gid: 1 }))
);
await db.collection('source').insert({ _id: 1 })
console.log('Fattening up....');
await db.collection('edge').updateMany(
{},
{ $set: { data: "x".repeat(100000) } }
);
// The full pipeline. Failing test uses only the $lookup stage
let pipeline = [
{ $lookup: {
from: 'edge',
localField: '_id',
foreignField: 'gid',
as: 'results'
}},
{ $unwind: '$results' },
{ $match: { 'results._id': { $gte: 1, $lte: 5 } } },
{ $project: { 'results.data': 0 } },
{ $group: { _id: '$_id', results: { $push: '$results' } } }
];
// List and iterate each test case
let tests = [
'Failing.. Size exceeded...',
'Working.. Applied $unwind...',
'Explain output...'
];
for (let [idx, test] of Object.entries(tests)) {
console.log(test);
try {
let currpipe = (( +idx === 0 ) ? pipeline.slice(0,1) : pipeline),
options = (( +idx === tests.length-1 ) ? { explain: true } : {});
await new Promise((end,error) => {
let cursor = db.collection('source').aggregate(currpipe,options);
for ( let [key, value] of Object.entries({ error, end, data }) )
cursor.on(key,value);
});
} catch(e) {
console.error(e);
}
}
} catch(e) {
console.error(e);
} finally {
db.close();
}
})();
$lookup
会失败,并显示以下错误:$lookup
which will fail with the following error:
$unwind
并 $match
管道阶段$unwind
and $match
pipeline stages {
"$lookup": {
"from": "edge",
"as": "results",
"localField": "_id",
"foreignField": "gid",
"unwinding": { // $unwind now is unwinding
"preserveNullAndEmptyArrays": false
},
"matching": { // $match now is matching
"$and": [ // and actually executed against
{ // the foreign collection
"_id": {
"$gte": 1
}
},
{
"_id": {
"$lte": 5
}
}
]
}
}
},
// $unwind and $match stages removed
{
"$project": {
"results": {
"data": false
}
}
},
{
"$group": {
"_id": "$_id",
"results": {
"$push": "$results"
}
}
}
$unwind
,但添加了 $match
来显示也添加到 $lookup
阶段,并且总体效果是限制"以有效方式返回的结果,因为所有操作都是通过$unwind
only, but the $match
is added for example to show that this is also added into the $lookup
stage and that the overall effect is to "limit" the results returned in an effective way, since it's all done in that $lookup
operation and no other results other than those matching are actually returned.$lookup
.$group
the results back into an array format, once they have been effectively filtered by the "hidden query" that is actually being performed by $lookup
.$unwind
is necessary as an interim step. There is however the limitation that the "LEFT JOIN" becomes an "INNER JOIN" by virtue of the $unwind
where it cannot preserve the content. Also even preserveNulAndEmptyArrays
would negate the "coalescence" and still leave the intact array, causing the same BSON Limit problem. $lookup
用来代替本地"和外来"键的子管道"表达式.因此,除了使用所展示的"coalescence"选项之外,只要生成的数组也没有超出限制,就可以在返回数组"intact"的管道中放置条件,并且可能没有指示性的匹配项"LEFT JOIN".$lookup
that allows a "sub-pipeline" expression to be used in place of the "local" and "foreign" keys. So instead of using the "coalescence" option as demonstrated, as long as the produced array does not also breach the limit it is possible to put conditions in that pipeline which returns the array "intact", and possibly with no matches as would be indicative of a "LEFT JOIN".{ "$lookup": {
"from": "edge",
"let": { "gid": "$gid" },
"pipeline": [
{ "$match": {
"_id": { "$gte": 1, "$lte": 5 },
"$expr": { "$eq": [ "$$gid", "$to" ] }
}}
],
"as": "from"
}}
$expr
"internally" in order to construct the statement. The difference of course is there is no "unwinding"
option present in how the $lookup
actually gets executed."pipeline"
表达式实际上未生成任何文档,则主文档中的目标数组实际上将为空,就像"LEFT JOIN"实际上那样,并且将是<没有任何其他选项的href ="https://docs.mongodb.com/manual/reference/operator/query/lookup/" rel ="noreferrer"> $lookup
If no documents are actually produced as a result of the "pipeline"
expression, then the target array within the master document will in fact be empty, just as a "LEFT JOIN" actually does and would be the normal behavior of $lookup
without any other options.
但是,输出数组要不得使创建该文档的文档超出BSON限制.因此,实际上取决于您的条件是确保任何匹配"内容都保持在此限制之内,否则相同的错误将持续存在,除非您当然实际使用
However the output array to MUST NOT cause the document where it is being created to exceed the BSON Limit. So it really is up to you to ensure that any "matching" content by the conditions stays under this limit or the same error will persist, unless of course you actually use $unwind
to effect the "INNER JOIN".
这篇关于总计$ lookup匹配管道中文档的总大小超过了最大文档大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!