mongodb按多个字段分组的值 [英] mongodb group values by multiple fields

查看:101
本文介绍了mongodb按多个字段分组的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我有以下文件:

{
  "addr": "address1",
  "book": "book1"
},
{
  "addr": "address2",
  "book": "book1"
},
{
  "addr": "address1",
  "book": "book5"
},
{
  "addr": "address3",
  "book": "book9"
},
{
  "addr": "address2",
  "book": "book5"
},
{
  "addr": "address2",
  "book": "book1"
},
{
  "addr": "address1",
  "book": "book1"
},
{
  "addr": "address15",
  "book": "book1"
},
{
  "addr": "address9",
  "book": "book99"
},
{
  "addr": "address90",
  "book": "book33"
},
{
  "addr": "address4",
  "book": "book3"
},
{
  "addr": "address5",
  "book": "book1"
},
{
  "addr": "address77",
  "book": "book11"
},
{
  "addr": "address1",
  "book": "book1"
}

以此类推.


我如何提出一个请求,该请求将描述每个地址的前N个地址和前M个书?

预期结果示例:

address1 | book_1:5
| book_2:10
| book_3:50
|总计:65
______________________
地址2 | book_1:10
| book_2:10
| ...
| book_M:10
|总计:M * 10
...
______________________
地址N | book_1:20
| book_2:20
| ...
| book_M:20
|总计:M * 20

and so on.


How can I make a request, which will describe the top N addresses and the top M books per address?

Example of expected result:

address1 | book_1: 5
| book_2: 10
| book_3: 50
| total: 65
______________________
address2 | book_1: 10
| book_2: 10
|...
| book_M: 10
| total: M*10
...
______________________
addressN | book_1: 20
| book_2: 20
|...
| book_M: 20
| total: M*20

推荐答案

TLDR摘要

在现代的MongoDB版本中,您可以使用 $slice
a>刚好达到基本聚合结果.对于大型"结果,请为每个分组运行并行查询(示例列表位于答案的结尾),或等待
SERVER-9377 进行解析,这将允许对数组中$push的项数进行限制".

TLDR Summary

In modern MongoDB releases you can brute force this with $slice just off the basic aggregation result. For "large" results, run parallel queries instead for each grouping ( a demonstration listing is at the end of the answer ), or wait for SERVER-9377 to resolve, which would allow a "limit" to the number of items to $push to an array.

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 },
    { "$project": {
        "books": { "$slice": [ "$books", 2 ] },
        "count": 1
    }}
])


MongoDB 3.6预览版

仍然无法解析 SERVER-9377 ,但在此版本中表达式的新非相关"选项作为参数而不是"localFields""foreignFields"选项.然后,这允许与另一个管道表达式进行自我联接",在其中我们可以应用 $limit ,以返回"top-n"结果.


MongoDB 3.6 Preview

Still not resolving SERVER-9377, but in this release $lookup allows a new "non-correlated" option which takes an "pipeline" expression as an argument instead of the "localFields" and "foreignFields" options. This then allows a "self-join" with another pipeline expression, in which we can apply $limit in order to return the "top-n" results.

db.books.aggregate([
  { "$group": {
    "_id": "$addr",
    "count": { "$sum": 1 }
  }},
  { "$sort": { "count": -1 } },
  { "$limit": 2 },
  { "$lookup": {
    "from": "books",
    "let": {
      "addr": "$_id"
    },
    "pipeline": [
      { "$match": { 
        "$expr": { "$eq": [ "$addr", "$$addr"] }
      }},
      { "$group": {
        "_id": "$book",
        "count": { "$sum": 1 }
      }},
      { "$sort": { "count": -1  } },
      { "$limit": 2 }
    ],
    "as": "books"
  }}
])

这里的另一个附加功能当然是能够使用 $limit 结果分开.

The other addition here is of course the ability to interpolate the variable through $expr using $match to select the matching items in the "join", but the general premise is a "pipeline within a pipeline" where the inner content can be filtered by matches from the parent. Since they are both "pipelines" themselves we can $limit each result separately.

这将是运行并行查询的第二个最佳选择,并且如果 $match ,并且能够在子管道"处理中使用索引.因此,它没有使用所引用问题所要求的"$push的限制",而是实际上提供了一些效果更好的东西.

This would be the next best option to running parallel queries, and actually would be better if the $match were allowed and able to use an index in the "sub-pipeline" processing. So which is does not use the "limit to $push" as the referenced issue asks, it actually delivers something that should work better.

您似乎偶然发现了最重要的"N"问题.从某种意义上说,您的问题很容易解决,尽管没有您要求的确切限制:

You seem have stumbled upon the top "N" problem. In a way your problem is fairly easy to solve though not with the exact limiting that you ask for:

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 }
])

现在,您会得到如下结果:

Now that will give you a result like this:

{
    "result" : [
            {
                    "_id" : "address1",
                    "books" : [
                            {
                                    "book" : "book4",
                                    "count" : 1
                            },
                            {
                                    "book" : "book5",
                                    "count" : 1
                            },
                            {
                                    "book" : "book1",
                                    "count" : 3
                            }
                    ],
                    "count" : 5
            },
            {
                    "_id" : "address2",
                    "books" : [
                            {
                                    "book" : "book5",
                                    "count" : 1
                            },
                            {
                                    "book" : "book1",
                                    "count" : 2
                            }
                    ],
                    "count" : 3
            }
    ],
    "ok" : 1
}

因此,这与您所要求的有所不同,尽管我们确实获得了地址值的最佳结果,但书"的基本选择不仅限于所需数量的结果.

So this differs from what you are asking in that, while we do get the top results for the address values the underlying "books" selection is not limited to only a required amount of results.

事实证明这很难做到,但是可以做到,尽管复杂度随所需匹配项的数量而增加.为了简单起见,我们最多只能将其保留在2个匹配项中:

This turns out to be very difficult to do, but it can be done though the complexity just increases with the number of items you need to match. To keep it simple we can keep this at 2 matches at most:

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 },
    { "$unwind": "$books" },
    { "$sort": { "count": 1, "books.count": -1 } },
    { "$group": {
        "_id": "$_id",
        "books": { "$push": "$books" },
        "count": { "$first": "$count" }
    }},
    { "$project": {
        "_id": {
            "_id": "$_id",
            "books": "$books",
            "count": "$count"
        },
        "newBooks": "$books"
    }},
    { "$unwind": "$newBooks" },
    { "$group": {
      "_id": "$_id",
      "num1": { "$first": "$newBooks" }
    }},
    { "$project": {
        "_id": "$_id",
        "newBooks": "$_id.books",
        "num1": 1
    }},
    { "$unwind": "$newBooks" },
    { "$project": {
        "_id": "$_id",
        "num1": 1,
        "newBooks": 1,
        "seen": { "$eq": [
            "$num1",
            "$newBooks"
        ]}
    }},
    { "$match": { "seen": false } },
    { "$group":{
        "_id": "$_id._id",
        "num1": { "$first": "$num1" },
        "num2": { "$first": "$newBooks" },
        "count": { "$first": "$_id.count" }
    }},
    { "$project": {
        "num1": 1,
        "num2": 1,
        "count": 1,
        "type": { "$cond": [ 1, [true,false],0 ] }
    }},
    { "$unwind": "$type" },
    { "$project": {
        "books": { "$cond": [
            "$type",
            "$num1",
            "$num2"
        ]},
        "count": 1
    }},
    { "$group": {
        "_id": "$_id",
        "count": { "$first": "$count" },
        "books": { "$push": "$books" }
    }},
    { "$sort": { "count": -1 } }
])

因此,实际上这将为您提供来自前两个地址"条目的前2个书".

So that will actually give you the top 2 "books" from the top two "address" entries.

但是,为了我的钱,请保留第一种形式,然后简单地切片"返回第一个"N"个元素的数组元素.

But for my money, stay with the first form and then simply "slice" the elements of the array that are returned to take the first "N" elements.

该演示代码适用于v8.x和v10.x发行版的NodeJS的当前LTS版本.这主要是针对async/await语法的,但是在通用流程中确实没有任何这样的限制,并且几乎没有改动即可适应普通的承诺,甚至可以恢复至普通的回调实现.

The demonstration code is appropriate for usage with current LTS versions of NodeJS from v8.x and v10.x releases. That's mostly for the async/await syntax, but there is nothing really within the general flow that has any such restriction, and adapts with little alteration to plain promises or even back to plain callback implementation.

index.js

const { MongoClient } = require('mongodb');
const fs = require('mz/fs');

const uri = 'mongodb://localhost:27017';

const log = data => console.log(JSON.stringify(data, undefined, 2));

(async function() {

  try {
    const client = await MongoClient.connect(uri);

    const db = client.db('bookDemo');
    const books = db.collection('books');

    let { version } = await db.command({ buildInfo: 1 });
    version = parseFloat(version.match(new RegExp(/(?:(?!-).)*/))[0]);

    // Clear and load books
    await books.deleteMany({});

    await books.insertMany(
      (await fs.readFile('books.json'))
        .toString()
        .replace(/\n$/,"")
        .split("\n")
        .map(JSON.parse)
    );

    if ( version >= 3.6 ) {

    // Non-correlated pipeline with limits
      let result = await books.aggregate([
        { "$group": {
          "_id": "$addr",
          "count": { "$sum": 1 }
        }},
        { "$sort": { "count": -1 } },
        { "$limit": 2 },
        { "$lookup": {
          "from": "books",
          "as": "books",
          "let": { "addr": "$_id" },
          "pipeline": [
            { "$match": {
              "$expr": { "$eq": [ "$addr", "$$addr" ] }
            }},
            { "$group": {
              "_id": "$book",
              "count": { "$sum": 1 },
            }},
            { "$sort": { "count": -1 } },
            { "$limit": 2 }
          ]
        }}
      ]).toArray();

      log({ result });
    }

    // Serial result procesing with parallel fetch

    // First get top addr items
    let topaddr = await books.aggregate([
      { "$group": {
        "_id": "$addr",
        "count": { "$sum": 1 }
      }},
      { "$sort": { "count": -1 } },
      { "$limit": 2 }
    ]).toArray();

    // Run parallel top books for each addr
    let topbooks = await Promise.all(
      topaddr.map(({ _id: addr }) =>
        books.aggregate([
          { "$match": { addr } },
          { "$group": {
            "_id": "$book",
            "count": { "$sum": 1 }
          }},
          { "$sort": { "count": -1 } },
          { "$limit": 2 }
        ]).toArray()
      )
    );

    // Merge output
    topaddr = topaddr.map((d,i) => ({ ...d, books: topbooks[i] }));
    log({ topaddr });

    client.close();

  } catch(e) {
    console.error(e)
  } finally {
    process.exit()
  }

})()

books.json

{ "addr": "address1",  "book": "book1"  }
{ "addr": "address2",  "book": "book1"  }
{ "addr": "address1",  "book": "book5"  }
{ "addr": "address3",  "book": "book9"  }
{ "addr": "address2",  "book": "book5"  }
{ "addr": "address2",  "book": "book1"  }
{ "addr": "address1",  "book": "book1"  }
{ "addr": "address15", "book": "book1"  }
{ "addr": "address9",  "book": "book99" }
{ "addr": "address90", "book": "book33" }
{ "addr": "address4",  "book": "book3"  }
{ "addr": "address5",  "book": "book1"  }
{ "addr": "address77", "book": "book11" }
{ "addr": "address1",  "book": "book1"  }

这篇关于mongodb按多个字段分组的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆