聚合 $lookup 匹配管道中文档的总大小超过最大文档大小 [英] Aggregate $lookup Total size of documents in matching pipeline exceeds maximum document size

查看:45
本文介绍了聚合 $lookup 匹配管道中文档的总大小超过最大文档大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的 $lookup 聚合查询,如下所示:

I have a pretty simple $lookup aggregation query like the following:

{'$lookup':
 {'from': 'edge',
  'localField': 'gid',
  'foreignField': 'to',
  'as': 'from'}}

当我在有足够文档的比赛中运行此程序时,出现以下错误:

When I run this on a match with enough documents I get the following error:

Command failed with error 4568: 'Total size of documents in edge
matching { $match: { $and: [ { from: { $eq: "geneDatabase:hugo" }
}, {} ] } } exceeds maximum document size' on server

所有限制文档数量的尝试都失败了.allowDiskUse: true 什么都不做.发送 cursor in 什么都不做.将 $limit 添加到聚合中也会失败.

All attempts to limit the number of documents fail. allowDiskUse: true does nothing. Sending a cursor in does nothing. Adding in a $limit into the aggregation also fails.

怎么会这样?

然后我再次看到错误.$match$and$eq 是从哪里来的?幕后的聚合管道是否将 $lookup 调用转移到另一个聚合中,它自己运行,我无法提供限制或使用游标?

Then I see the error again. Where did that $match and $and and $eq come from? Is the aggregation pipeline behind the scenes farming out the $lookup call to another aggregation, one it runs on its own that I have no ability to provide limits for or use cursors with??

这里发生了什么?

推荐答案

如前面评论中所述,发生错误是因为在执行 $lookup 默认情况下产生一个目标数组";在来自外部集合结果的父文档中,为该数组选择的文档总大小导致父文档超过 16MB BSON 限制.

As stated earlier in comment, the error occurs because when performing the $lookup which by default produces a target "array" within the parent document from the results of the foreign collection, the total size of documents selected for that array causes the parent to exceed the 16MB BSON Limit.

计数器是用 $unwind 紧跟在 $lookup 流水线阶段.这实际上改变了 $lookup 这样,结果不是在父级中生成一个数组,而是一个副本";每个匹配的文档的每个父级.

The counter for this is to process with an $unwind which immediately follows the $lookup pipeline stage. This actually alters the behavior of $lookup in such that instead of producing an array in the parent, the results are instead a "copy" of each parent for every document matched.

很像 $unwind,除了不是作为单独的"处理管道阶段,unwinding 动作实际上被添加到 $lookup 管道操作本身.理想情况下,您还遵循 $unwind使用 $match 条件,其中还创建了一个 matching 参数,也将添加到 $lookup.您实际上可以在管道的 explain 输出中看到这一点.

Pretty much just like regular usage of $unwind, with the exception that instead of processing as a "separate" pipeline stage, the unwinding action is actually added to the $lookup pipeline operation itself. Ideally you also follow the $unwind with a $match condition, which also creates a matching argument to also be added to the $lookup. You can actually see this in the explain output for the pipeline.

该主题实际上(简要)在 聚合管道优化:

The topic is actually covered (briefly) in a section of Aggregation Pipeline Optimization in the core documentation:

$lookup + $unwind Coalescence

3.2 版中的新功能.

当 $unwind 紧跟在另一个 $lookup 之后,并且 $unwind 在 $lookup 的 as 字段上运行时,优化器可以将 $unwind 合并到 $lookup 阶段.这避免了创建大型中间文档.

When a $unwind immediately follows another $lookup, and the $unwind operates on the as field of the $lookup, the optimizer can coalesce the $unwind into the $lookup stage. This avoids creating large intermediate documents.

最好的展示是通过创建相关"列表使服务器承受压力.超过 16MB BSON 限制的文档.尽可能简短地打破和解决 BSON 限制:

Best demonstrated with a listing that puts the server under stress by creating "related" documents that would exceed the 16MB BSON limit. Done as briefly as possible to both break and work around the BSON Limit:

const MongoClient = require('mongodb').MongoClient;

const uri = 'mongodb://localhost/test';

function data(data) {
  console.log(JSON.stringify(data, undefined, 2))
}

(async function() {

  let db;

  try {
    db = await MongoClient.connect(uri);

    console.log('Cleaning....');
    // Clean data
    await Promise.all(
      ["source","edge"].map(c => db.collection(c).remove() )
    );

    console.log('Inserting...')

    await db.collection('edge').insertMany(
      Array(1000).fill(1).map((e,i) => ({ _id: i+1, gid: 1 }))
    );
    await db.collection('source').insert({ _id: 1 })

    console.log('Fattening up....');
    await db.collection('edge').updateMany(
      {},
      { $set: { data: "x".repeat(100000) } }
    );

    // The full pipeline. Failing test uses only the $lookup stage
    let pipeline = [
      { $lookup: {
        from: 'edge',
        localField: '_id',
        foreignField: 'gid',
        as: 'results'
      }},
      { $unwind: '$results' },
      { $match: { 'results._id': { $gte: 1, $lte: 5 } } },
      { $project: { 'results.data': 0 } },
      { $group: { _id: '$_id', results: { $push: '$results' } } }
    ];

    // List and iterate each test case
    let tests = [
      'Failing.. Size exceeded...',
      'Working.. Applied $unwind...',
      'Explain output...'
    ];

    for (let [idx, test] of Object.entries(tests)) {
      console.log(test);

      try {
        let currpipe = (( +idx === 0 ) ? pipeline.slice(0,1) : pipeline),
            options = (( +idx === tests.length-1 ) ? { explain: true } : {});

        await new Promise((end,error) => {
          let cursor = db.collection('source').aggregate(currpipe,options);
          for ( let [key, value] of Object.entries({ error, end, data }) )
            cursor.on(key,value);
        });
      } catch(e) {
        console.error(e);
      }

    }

  } catch(e) {
    console.error(e);
  } finally {
    db.close();
  }

})();

插入一些初始数据后,列表将尝试运行仅由 $lookup 将失败并出现以下错误:

After inserting some initial data, the listing will attempt to run an aggregate merely consisting of $lookup which will fail with the following error:

{ MongoError: 边缘匹配管道中文档的总大小 { $match: { $and : [ { gid: { $eq: 1 } }, {} ] } } 超过最大文档大小

{ MongoError: Total size of documents in edge matching pipeline { $match: { $and : [ { gid: { $eq: 1 } }, {} ] } } exceeds maximum document size

这基本上是在告诉您检索时超出了 BSON 限制.

Which is basically telling you the BSON limit was exceeded on retrieval.

相比之下,下一次尝试添加了 $unwind$match 流水线阶段

By contrast the next attempt adds the $unwind and $match pipeline stages

解释输出:

  {
    "$lookup": {
      "from": "edge",
      "as": "results",
      "localField": "_id",
      "foreignField": "gid",
      "unwinding": {                        // $unwind now is unwinding
        "preserveNullAndEmptyArrays": false
      },
      "matching": {                         // $match now is matching
        "$and": [                           // and actually executed against 
          {                                 // the foreign collection
            "_id": {
              "$gte": 1
            }
          },
          {
            "_id": {
              "$lte": 5
            }
          }
        ]
      }
    }
  },
  // $unwind and $match stages removed
  {
    "$project": {
      "results": {
        "data": false
      }
    }
  },
  {
    "$group": {
      "_id": "$_id",
      "results": {
        "$push": "$results"
      }
    }
  }

这个结果当然是成功的,因为结果不再被放入父文档中,所以不能超过 BSON 限制.

And that result of course succeeds, because as the results are no longer being placed into the parent document then the BSON limit cannot be exceeded.

这实际上只是由于添加了$unwind 仅,但 $match 以表明这 添加到$lookup 阶段,整体效果是限制"结果以有效的方式返回,因为这一切都在 $ 中完成lookup 操作,实际返回匹配的结果.

This really just happens as a result of adding $unwind only, but the $match is added for example to show that this is also added into the $lookup stage and that the overall effect is to "limit" the results returned in an effective way, since it's all done in that $lookup operation and no other results other than those matching are actually returned.

通过这种方式构建,您可以查询引用的数据";这将超过 BSON 限制,然后如果你想要 $group 将结果重新转换为数组格式,一旦它们被隐藏查询"有效过滤;这实际上是由 $lookup.

By constructing in this way you can query for "referenced data" that would exceed the BSON limit and then if you want $group the results back into an array format, once they have been effectively filtered by the "hidden query" that is actually being performed by $lookup.

正如以上所有内容所指出的,BSON 限制是一个硬" 限制,您不能违反,这通常是 $unwind 作为临时步骤是必要的.然而,LEFT JOIN"有限制.成为内部连接"凭借 $unwind 其中它无法保留内容.甚至 preserveNulAndEmptyArrays 也会否定合并"仍然保留完整的数组,导致同样的 BSON 限制问题.

As all the content above notes, the BSON Limit is a "hard" limit that you cannot breach and this is generally why the $unwind is necessary as an interim step. There is however the limitation that the "LEFT JOIN" becomes an "INNER JOIN" by virtue of the $unwind where it cannot preserve the content. Also even preserveNulAndEmptyArrays would negate the "coalescence" and still leave the intact array, causing the same BSON Limit problem.

MongoDB 3.6 为 $lookup 添加了新语法 允许子管道"用于代替本地"的表达式和外国"键.所以,而不是使用聚结"选项,只要生成的数组也没有违反限制,就可以在返回数组完整"的管道中放置条件,并且可能没有匹配项,因为这表示LEFT JOIN"..

MongoDB 3.6 adds new syntax to $lookup that allows a "sub-pipeline" expression to be used in place of the "local" and "foreign" keys. So instead of using the "coalescence" option as demonstrated, as long as the produced array does not also breach the limit it is possible to put conditions in that pipeline which returns the array "intact", and possibly with no matches as would be indicative of a "LEFT JOIN".

新的表达式将是:

{ "$lookup": {
  "from": "edge",
  "let": { "gid": "$gid" },
  "pipeline": [
    { "$match": {
      "_id": { "$gte": 1, "$lte": 5 },
      "$expr": { "$eq": [ "$$gid", "$to" ] }
    }}          
  ],
  "as": "from"
}}

事实上,这基本上就是 MongoDB 正在使用以前的语法在幕后" 做的事情,因为 3.6 使用 $expr 内部"为了构造语句.当然,不同之处在于 "unwinding" 选项rel="noreferrer">$lookup 实际上被执行了.

In fact this would be basically what MongoDB is doing "under the covers" with the previous syntax since 3.6 uses $expr "internally" in order to construct the statement. The difference of course is there is no "unwinding" option present in how the $lookup actually gets executed.

如果 pipeline" 表达式的结果实际上没有生成任何文档,那么主文档中的目标数组实际上将是空的,就像LEFT JOIN"一样.实际上是并且将是 $lookup 的正常行为 没有任何其他选项.

If no documents are actually produced as a result of the "pipeline" expression, then the target array within the master document will in fact be empty, just as a "LEFT JOIN" actually does and would be the normal behavior of $lookup without any other options.

然而,输出数组不得导致创建它的文档超过 BSON 限制.因此,确保任何匹配"完全由您决定.条件的内容保持在此限制下,否则相同的错误将持续存在,除非您实际上使用 $unwind 以实现INNER JOIN".

However the output array to MUST NOT cause the document where it is being created to exceed the BSON Limit. So it really is up to you to ensure that any "matching" content by the conditions stays under this limit or the same error will persist, unless of course you actually use $unwind to effect the "INNER JOIN".

这篇关于聚合 $lookup 匹配管道中文档的总大小超过最大文档大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆