mongodb 按多个字段对值进行分组 [英] mongodb group values by multiple fields

查看:287
本文介绍了mongodb 按多个字段对值进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我有这些文件:

{
  "addr": "address1",
  "book": "book1"
},
{
  "addr": "address2",
  "book": "book1"
},
{
  "addr": "address1",
  "book": "book5"
},
{
  "addr": "address3",
  "book": "book9"
},
{
  "addr": "address2",
  "book": "book5"
},
{
  "addr": "address2",
  "book": "book1"
},
{
  "addr": "address1",
  "book": "book1"
},
{
  "addr": "address15",
  "book": "book1"
},
{
  "addr": "address9",
  "book": "book99"
},
{
  "addr": "address90",
  "book": "book33"
},
{
  "addr": "address4",
  "book": "book3"
},
{
  "addr": "address5",
  "book": "book1"
},
{
  "addr": "address77",
  "book": "book11"
},
{
  "addr": "address1",
  "book": "book1"
}

等等.


我怎样才能提出请求,描述前N个地址和每个地址前M本书?

预期结果示例:

地址1 |book_1: 5
|book_2:10
|book_3:50
|总计:65
______________________
address2 |book_1:10
|book_2: 10
|...
|book_M: 10
|总计:M*10
...
____________
地址N |book_1: 20
|book_2:20
|...
|book_M: 20
|总计:M*20

and so on.


How can I make a request, which will describe the top N addresses and the top M books per address?

Example of expected result:

address1 | book_1: 5
| book_2: 10
| book_3: 50
| total: 65
______________________
address2 | book_1: 10
| book_2: 10
|...
| book_M: 10
| total: M*10
...
______________________
addressN | book_1: 20
| book_2: 20
|...
| book_M: 20
| total: M*20

推荐答案

TLDR 摘要

在现代 MongoDB 版本中,您可以使用 $slice 强制执行此操作 只是基本聚合结果.对于大"结果,而不是为每个分组运行并行查询(演示列表在答案的末尾),或者等待 SERVER-9377 解决,这将允许限制"到 $push 到数组的项目数.

TLDR Summary

In modern MongoDB releases you can brute force this with $slice just off the basic aggregation result. For "large" results, run parallel queries instead for each grouping ( a demonstration listing is at the end of the answer ), or wait for SERVER-9377 to resolve, which would allow a "limit" to the number of items to $push to an array.

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 },
    { "$project": {
        "books": { "$slice": [ "$books", 2 ] },
        "count": 1
    }}
])


MongoDB 3.6 预览版

仍然无法解析 SERVER-9377,但在此版本中 $lookup 允许一个新的非相关"选项采用 pipeline" 表达式作为参数,而不是 localFields"foreignFields" 选项.这然后允许自加入"连接.使用另一个管道表达式,我们可以在其中应用 $limit 以返回top-n"结果.


MongoDB 3.6 Preview

Still not resolving SERVER-9377, but in this release $lookup allows a new "non-correlated" option which takes an "pipeline" expression as an argument instead of the "localFields" and "foreignFields" options. This then allows a "self-join" with another pipeline expression, in which we can apply $limit in order to return the "top-n" results.

db.books.aggregate([
  { "$group": {
    "_id": "$addr",
    "count": { "$sum": 1 }
  }},
  { "$sort": { "count": -1 } },
  { "$limit": 2 },
  { "$lookup": {
    "from": "books",
    "let": {
      "addr": "$_id"
    },
    "pipeline": [
      { "$match": { 
        "$expr": { "$eq": [ "$addr", "$$addr"] }
      }},
      { "$group": {
        "_id": "$book",
        "count": { "$sum": 1 }
      }},
      { "$sort": { "count": -1  } },
      { "$limit": 2 }
    ],
    "as": "books"
  }}
])

这里的另一个添加当然是使用 $match 来选择join"中的匹配项,但一般前提是pipeline in a pipeline";其中内部内容可以通过来自父级的匹配进行过滤.由于它们都是管道"我们可以自己$limit 每个结果分开.

The other addition here is of course the ability to interpolate the variable through $expr using $match to select the matching items in the "join", but the general premise is a "pipeline within a pipeline" where the inner content can be filtered by matches from the parent. Since they are both "pipelines" themselves we can $limit each result separately.

这将是运行并行查询的下一个最佳选择,如果 $match 被允许并且能够在子管道"中使用索引.加工.所以哪个不使用限制$push"?正如所引用的问题所要求的那样,它实际上提供了一些应该更好地工作的东西.

This would be the next best option to running parallel queries, and actually would be better if the $match were allowed and able to use an index in the "sub-pipeline" processing. So which is does not use the "limit to $push" as the referenced issue asks, it actually delivers something that should work better.

您似乎偶然发现了最上面的N"个字.问题.在某种程度上,您的问题很容易解决,尽管没有您要求的确切限制:

You seem have stumbled upon the top "N" problem. In a way your problem is fairly easy to solve though not with the exact limiting that you ask for:

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 }
])

现在会给你这样的结果:

Now that will give you a result like this:

{
    "result" : [
            {
                    "_id" : "address1",
                    "books" : [
                            {
                                    "book" : "book4",
                                    "count" : 1
                            },
                            {
                                    "book" : "book5",
                                    "count" : 1
                            },
                            {
                                    "book" : "book1",
                                    "count" : 3
                            }
                    ],
                    "count" : 5
            },
            {
                    "_id" : "address2",
                    "books" : [
                            {
                                    "book" : "book5",
                                    "count" : 1
                            },
                            {
                                    "book" : "book1",
                                    "count" : 2
                            }
                    ],
                    "count" : 3
            }
    ],
    "ok" : 1
}

所以这与您所要求的不同,虽然我们确实获得了基础书籍"地址值的最高结果.选择不仅限于所需数量的结果.

So this differs from what you are asking in that, while we do get the top results for the address values the underlying "books" selection is not limited to only a required amount of results.

事实证明这很难做到,但可以做到,尽管复杂性会随着您需要匹配的项目数量而增加.为简单起见,我们最多可以将其保留为 2 场比赛:

This turns out to be very difficult to do, but it can be done though the complexity just increases with the number of items you need to match. To keep it simple we can keep this at 2 matches at most:

db.books.aggregate([
    { "$group": {
        "_id": {
            "addr": "$addr",
            "book": "$book"
        },
        "bookCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.addr",
        "books": { 
            "$push": { 
                "book": "$_id.book",
                "count": "$bookCount"
            },
        },
        "count": { "$sum": "$bookCount" }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 2 },
    { "$unwind": "$books" },
    { "$sort": { "count": 1, "books.count": -1 } },
    { "$group": {
        "_id": "$_id",
        "books": { "$push": "$books" },
        "count": { "$first": "$count" }
    }},
    { "$project": {
        "_id": {
            "_id": "$_id",
            "books": "$books",
            "count": "$count"
        },
        "newBooks": "$books"
    }},
    { "$unwind": "$newBooks" },
    { "$group": {
      "_id": "$_id",
      "num1": { "$first": "$newBooks" }
    }},
    { "$project": {
        "_id": "$_id",
        "newBooks": "$_id.books",
        "num1": 1
    }},
    { "$unwind": "$newBooks" },
    { "$project": {
        "_id": "$_id",
        "num1": 1,
        "newBooks": 1,
        "seen": { "$eq": [
            "$num1",
            "$newBooks"
        ]}
    }},
    { "$match": { "seen": false } },
    { "$group":{
        "_id": "$_id._id",
        "num1": { "$first": "$num1" },
        "num2": { "$first": "$newBooks" },
        "count": { "$first": "$_id.count" }
    }},
    { "$project": {
        "num1": 1,
        "num2": 1,
        "count": 1,
        "type": { "$cond": [ 1, [true,false],0 ] }
    }},
    { "$unwind": "$type" },
    { "$project": {
        "books": { "$cond": [
            "$type",
            "$num1",
            "$num2"
        ]},
        "count": 1
    }},
    { "$group": {
        "_id": "$_id",
        "count": { "$first": "$count" },
        "books": { "$push": "$books" }
    }},
    { "$sort": { "count": -1 } }
])

所以这实际上会给你排名前 2 的书籍"从前两个地址"开始条目.

So that will actually give you the top 2 "books" from the top two "address" entries.

但为了我的钱,保持第一种形式,然后简单地切片";返回的数组元素取第一个N"元素.

But for my money, stay with the first form and then simply "slice" the elements of the array that are returned to take the first "N" elements.

演示代码适用于 v8.x 和 v10.x 版本的当前 LTS 版本的 NodeJS.这主要是针对 async/await 语法的,但在一般流程中没有任何真正有任何此类限制的内容,并且几乎不改变普通承诺甚至返回普通回调实现.

The demonstration code is appropriate for usage with current LTS versions of NodeJS from v8.x and v10.x releases. That's mostly for the async/await syntax, but there is nothing really within the general flow that has any such restriction, and adapts with little alteration to plain promises or even back to plain callback implementation.

index.js

const { MongoClient } = require('mongodb');
const fs = require('mz/fs');

const uri = 'mongodb://localhost:27017';

const log = data => console.log(JSON.stringify(data, undefined, 2));

(async function() {

  try {
    const client = await MongoClient.connect(uri);

    const db = client.db('bookDemo');
    const books = db.collection('books');

    let { version } = await db.command({ buildInfo: 1 });
    version = parseFloat(version.match(new RegExp(/(?:(?!-).)*/))[0]);

    // Clear and load books
    await books.deleteMany({});

    await books.insertMany(
      (await fs.readFile('books.json'))
        .toString()
        .replace(/
$/,"")
        .split("
")
        .map(JSON.parse)
    );

    if ( version >= 3.6 ) {

    // Non-correlated pipeline with limits
      let result = await books.aggregate([
        { "$group": {
          "_id": "$addr",
          "count": { "$sum": 1 }
        }},
        { "$sort": { "count": -1 } },
        { "$limit": 2 },
        { "$lookup": {
          "from": "books",
          "as": "books",
          "let": { "addr": "$_id" },
          "pipeline": [
            { "$match": {
              "$expr": { "$eq": [ "$addr", "$$addr" ] }
            }},
            { "$group": {
              "_id": "$book",
              "count": { "$sum": 1 },
            }},
            { "$sort": { "count": -1 } },
            { "$limit": 2 }
          ]
        }}
      ]).toArray();

      log({ result });
    }

    // Serial result procesing with parallel fetch

    // First get top addr items
    let topaddr = await books.aggregate([
      { "$group": {
        "_id": "$addr",
        "count": { "$sum": 1 }
      }},
      { "$sort": { "count": -1 } },
      { "$limit": 2 }
    ]).toArray();

    // Run parallel top books for each addr
    let topbooks = await Promise.all(
      topaddr.map(({ _id: addr }) =>
        books.aggregate([
          { "$match": { addr } },
          { "$group": {
            "_id": "$book",
            "count": { "$sum": 1 }
          }},
          { "$sort": { "count": -1 } },
          { "$limit": 2 }
        ]).toArray()
      )
    );

    // Merge output
    topaddr = topaddr.map((d,i) => ({ ...d, books: topbooks[i] }));
    log({ topaddr });

    client.close();

  } catch(e) {
    console.error(e)
  } finally {
    process.exit()
  }

})()

books.json

{ "addr": "address1",  "book": "book1"  }
{ "addr": "address2",  "book": "book1"  }
{ "addr": "address1",  "book": "book5"  }
{ "addr": "address3",  "book": "book9"  }
{ "addr": "address2",  "book": "book5"  }
{ "addr": "address2",  "book": "book1"  }
{ "addr": "address1",  "book": "book1"  }
{ "addr": "address15", "book": "book1"  }
{ "addr": "address9",  "book": "book99" }
{ "addr": "address90", "book": "book33" }
{ "addr": "address4",  "book": "book3"  }
{ "addr": "address5",  "book": "book1"  }
{ "addr": "address77", "book": "book11" }
{ "addr": "address1",  "book": "book1"  }

这篇关于mongodb 按多个字段对值进行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆