总计$ lookup匹配管道中文档的总大小超过了最大文档大小 [英] Aggregate $lookup Total size of documents in matching pipeline exceeds maximum document size

查看:565
本文介绍了总计$ lookup匹配管道中文档的总大小超过了最大文档大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的$lookup聚合查询,如下所示:

I have a pretty simple $lookup aggregation query like the following:

{'$lookup':
 {'from': 'edge',
  'localField': 'gid',
  'foreignField': 'to',
  'as': 'from'}}

在具有足够文档的匹配项上运行此命令时,出现以下错误:

When I run this on a match with enough documents I get the following error:

Command failed with error 4568: 'Total size of documents in edge
matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" }
}, {} ] } } exceeds maximum document size' on server

所有限制文档数量的尝试均失败. allowDiskUse: true不执行任何操作.发送cursor in不会执行任何操作.将$limit添加到聚合中也将失败.

All attempts to limit the number of documents fail. allowDiskUse: true does nothing. Sending a cursor in does nothing. Adding in a $limit into the aggregation also fails.

怎么可能?

然后我再次看到错误. $match$and$eq来自何处?幕后的聚合管道是否在将$lookup调用移植到另一种聚合,它是独立运行的,因此我无法为游标提供限制或使用游标?

Then I see the error again. Where did that $match and $and and $eq come from? Is the aggregation pipeline behind the scenes farming out the $lookup call to another aggregation, one it runs on its own that I have no ability to provide limits for or use cursors with??

这是怎么回事?

推荐答案

如注释前面所述,由于执行 16MB BSON限制.

As stated earlier in comment, the error occurs because when performing the $lookup which by default produces a target "array" within the parent document from the results of the foreign collection, the total size of documents selected for that array causes the parent to exceed the 16MB BSON Limit.

此操作的计数器是使用 $unwind 紧接在 $lookup 管道阶段之后.实际上,这会更改 $lookup 的行为,而不是在父对象中生成一个数组,结果将是每个匹配的文档的每个父对象的副本".

The counter for this is to process with an $unwind which immediately follows the $lookup pipeline stage. This actually alters the behavior of $lookup in such that instead of producing an array in the parent, the results are instead a "copy" of each parent for every document matched.

就像 $unwind 的常规用法一样,除了将unwinding操作实际上添加到 $unwind 后面加上 $match 条件,该条件还会创建一个matching参数,该参数也将添加到 $lookup .实际上,您可以在管道的explain输出中看到这一点.

Pretty much just like regular usage of $unwind, with the exception that instead of processing as a "separate" pipeline stage, the unwinding action is actually added to the $lookup pipeline operation itself. Ideally you also follow the $unwind with a $match condition, which also creates a matching argument to also be added to the $lookup. You can actually see this in the explain output for the pipeline.

$ lookup + $ unwind合并

3.2版中的新功能.

当$ unwind立即跟随另一个$ lookup,并且$ unwind在$ lookup的as字段上运行时,优化程序可以将$ unwind合并到$ lookup阶段.这样可以避免创建大型的中间文档.

When a $unwind immediately follows another $lookup, and the $unwind operates on the as field of the $lookup, the optimizer can coalesce the $unwind into the $lookup stage. This avoids creating large intermediate documents.

最好的清单展示了该清单,该清单通过创建超过16MB BSON限制的相关"文档使服务器承受压力.尽可能简短地打破和解决BSON限制:

Best demonstrated with a listing that puts the server under stress by creating "related" documents that would exceed the 16MB BSON limit. Done as briefly as possible to both break and work around the BSON Limit:

const MongoClient = require('mongodb').MongoClient;

const uri = 'mongodb://localhost/test';

function data(data) {
  console.log(JSON.stringify(data, undefined, 2))
}

(async function() {

  let db;

  try {
    db = await MongoClient.connect(uri);

    console.log('Cleaning....');
    // Clean data
    await Promise.all(
      ["source","edge"].map(c => db.collection(c).remove() )
    );

    console.log('Inserting...')

    await db.collection('edge').insertMany(
      Array(1000).fill(1).map((e,i) => ({ _id: i+1, gid: 1 }))
    );
    await db.collection('source').insert({ _id: 1 })

    console.log('Fattening up....');
    await db.collection('edge').updateMany(
      {},
      { $set: { data: "x".repeat(100000) } }
    );

    // The full pipeline. Failing test uses only the $lookup stage
    let pipeline = [
      { $lookup: {
        from: 'edge',
        localField: '_id',
        foreignField: 'gid',
        as: 'results'
      }},
      { $unwind: '$results' },
      { $match: { 'results._id': { $gte: 1, $lte: 5 } } },
      { $project: { 'results.data': 0 } },
      { $group: { _id: '$_id', results: { $push: '$results' } } }
    ];

    // List and iterate each test case
    let tests = [
      'Failing.. Size exceeded...',
      'Working.. Applied $unwind...',
      'Explain output...'
    ];

    for (let [idx, test] of Object.entries(tests)) {
      console.log(test);

      try {
        let currpipe = (( +idx === 0 ) ? pipeline.slice(0,1) : pipeline),
            options = (( +idx === tests.length-1 ) ? { explain: true } : {});

        await new Promise((end,error) => {
          let cursor = db.collection('source').aggregate(currpipe,options);
          for ( let [key, value] of Object.entries({ error, end, data }) )
            cursor.on(key,value);
        });
      } catch(e) {
        console.error(e);
      }

    }

  } catch(e) {
    console.error(e);
  } finally {
    db.close();
  }

})();

在插入一些初始数据之后,清单将尝试运行仅由 $lookup 会失败,并显示以下错误:

After inserting some initial data, the listing will attempt to run an aggregate merely consisting of $lookup which will fail with the following error:

{MongoError:边缘匹配管道中的文档总大小{$ match:{$ and:[{gid:{$ eq:1}},{}]}}超过了最大文档大小

{ MongoError: Total size of documents in edge matching pipeline { $match: { $and : [ { gid: { $eq: 1 } }, {} ] } } exceeds maximum document size

基本上是告诉您检索超出了BSON限制.

Which is basically telling you the BSON limit was exceeded on retrieval.

通过对比,下一次尝试将添加 $unwind $match 管道阶段

By contrast the next attempt adds the $unwind and $match pipeline stages

说明输出:

  {
    "$lookup": {
      "from": "edge",
      "as": "results",
      "localField": "_id",
      "foreignField": "gid",
      "unwinding": {                        // $unwind now is unwinding
        "preserveNullAndEmptyArrays": false
      },
      "matching": {                         // $match now is matching
        "$and": [                           // and actually executed against 
          {                                 // the foreign collection
            "_id": {
              "$gte": 1
            }
          },
          {
            "_id": {
              "$lte": 5
            }
          }
        ]
      }
    }
  },
  // $unwind and $match stages removed
  {
    "$project": {
      "results": {
        "data": false
      }
    }
  },
  {
    "$group": {
      "_id": "$_id",
      "results": {
        "$push": "$results"
      }
    }
  }

该结果当然会成功,因为由于不再将结果放入父文档中,所以不能超过BSON限制.

And that result of course succeeds, because as the results are no longer being placed into the parent document then the BSON limit cannot be exceeded.

这实际上是由于添加 $unwind ,但添加了 $match 来显示添加到 $lookup 阶段,并且总体效果是限制"以有效方式返回的结果,因为所有操作都是通过

This really just happens as a result of adding $unwind only, but the $match is added for example to show that this is also added into the $lookup stage and that the overall effect is to "limit" the results returned in an effective way, since it's all done in that $lookup operation and no other results other than those matching are actually returned.

通过这种方式构造,您可以查询超出BSON限制的参考数据",然后如果需要

By constructing in this way you can query for "referenced data" that would exceed the BSON limit and then if you want $group the results back into an array format, once they have been effectively filtered by the "hidden query" that is actually being performed by $lookup.

正如上面所有内容所述,BSON限制是您不能违反的硬" 限制,这通常就是

As all the content above notes, the BSON Limit is a "hard" limit that you cannot breach and this is generally why the $unwind is necessary as an interim step. There is however the limitation that the "LEFT JOIN" becomes an "INNER JOIN" by virtue of the $unwind where it cannot preserve the content. Also even preserveNulAndEmptyArrays would negate the "coalescence" and still leave the intact array, causing the same BSON Limit problem.

MongoDB 3.6将新语法添加到 $lookup 用来代替本地"和外来"键的子管道"表达式.因此,除了使用所展示的"coalescence"选项之外,只要生成的数组也没有超出限制,就可以在返回数组"intact"的管道中放置条件,并且可能没有指示性的匹配项"LEFT JOIN".

MongoDB 3.6 adds new syntax to $lookup that allows a "sub-pipeline" expression to be used in place of the "local" and "foreign" keys. So instead of using the "coalescence" option as demonstrated, as long as the produced array does not also breach the limit it is possible to put conditions in that pipeline which returns the array "intact", and possibly with no matches as would be indicative of a "LEFT JOIN".

新的表达式将是:

{ "$lookup": {
  "from": "edge",
  "let": { "gid": "$gid" },
  "pipeline": [
    { "$match": {
      "_id": { "$gte": 1, "$lte": 5 },
      "$expr": { "$eq": [ "$$gid", "$to" ] }
    }}          
  ],
  "as": "from"
}}

实际上,这基本上就是MongoDB使用以前的语法在幕后" 所做的事情,因为3.6使用 实际上被执行了.

In fact this would be basically what MongoDB is doing "under the covers" with the previous syntax since 3.6 uses $expr "internally" in order to construct the statement. The difference of course is there is no "unwinding" option present in how the $lookup actually gets executed.

如果通过"pipeline"表达式实际上未生成任何文档,则主文档中的目标数组实际上将为空,就像"LEFT JOIN"实际上那样,并且将是<没有任何其他选项的href ="https://docs.mongodb.com/manual/reference/operator/query/lookup/" rel ="noreferrer"> $lookup .

If no documents are actually produced as a result of the "pipeline" expression, then the target array within the master document will in fact be empty, just as a "LEFT JOIN" actually does and would be the normal behavior of $lookup without any other options.

但是,输出数组要不得使创建该文档的文档超出BSON限制.因此,实际上取决于您的条件是确保任何匹配"内容都保持在此限制之内,否则相同的错误将持续存在,除非您当然实际使用

However the output array to MUST NOT cause the document where it is being created to exceed the BSON Limit. So it really is up to you to ensure that any "matching" content by the conditions stays under this limit or the same error will persist, unless of course you actually use $unwind to effect the "INNER JOIN".

这篇关于总计$ lookup匹配管道中文档的总大小超过了最大文档大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆