如何按不同领域分组 [英] How to group by different fields

查看:57
本文介绍了如何按不同领域分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想找到所有名为汉斯"的用户,并通过对他们的年龄"和孩子数"进行分组. 假设我的数据库中有用户".

I want to find all users named 'Hans' and aggregate their 'age' and number of 'childs' by grouping them. Assuming I have following in my database 'users'.

{
    "_id" : "01",
    "user" : "Hans",
    "age" : "50"
    "childs" : "2"
}
{
    "_id" : "02",
    "user" : "Hans",
    "age" : "40"
    "childs" : "2"
}
{
    "_id" : "03",
    "user" : "Fritz",
    "age" : "40"
    "childs" : "2"
}
{
    "_id" : "04",
    "user" : "Hans",
    "age" : "40"
    "childs" : "1"
}

结果应该是这样的:

"result" : 
[
  { 
    "age" : 
      [
        {
          "value" : "50",
          "count" : "1"
        },
        {
          "value" : "40",
          "count" : "2"
        }
      ]
  },
  { 
    "childs" : 
      [
        {
          "value" : "2",
          "count" : "2"
        },
        {
          "value" : "1",
          "count" : "1"
        }
      ]
  }  
]

我该如何实现?

推荐答案

这几乎应该是MongoDB的常见问题,主要是因为这是一个真实的示例概念,说明了您应该如何改变SQL处理的思维方式,并采用MongoDB之类的引擎.

This should almost be a MongoDB FAQ, mostly because it is a real example concept of how you should be altering your thinking from SQL processing and embracing what engines like MongoDB do.

这里的基本原则是"MongoDB不执行联接".任何构想"您将如何构造SQL来实现此目的的方法本质上都需要联接"操作.典型的形式是"UNION",实际上是"join".

The basic principle here is "MongoDB does not do joins". Any way of "envisioning" how you would construct SQL to do this essentially requires a "join" operation. The typical form is "UNION" which is in fact a "join".

那么如何在不同的范式下做到这一点?首先,让我们探讨如何做到这一点,并了解其原因.即使它当然适用于非常小的样本:

So how to do it under a different paradigm? Well first, let's approach how not to do it and understand the reasons why. Even if of course it will work for your very small sample:

db.docs.aggregate([
    { "$group": {
        "_id": null,
        "age": { "$push": "$age" },
        "childs": { "$push": "$childs" }
    }},
    { "$unwind": "$age" },
    { "$group": {
        "_id": "$age",
        "count": { "$sum": 1  },
        "childs": { "$first": "$childs" }
    }},
    { "$sort": { "_id": -1 } },
    { "$group": {
        "_id": null,
        "age": { "$push": {
            "value": "$_id",
            "count": "$count"
        }},
        "childs": { "$first": "$childs" }
    }},
    { "$unwind": "$childs" },
    { "$group": {
        "_id": "$childs",
        "count": { "$sum": 1 },
        "age": { "$first": "$age" }
    }},
    { "$sort": { "_id": -1 } },
    { "$group": {
        "_id": null,
        "age": { "$first": "$age" },
        "childs": { "$push": {
            "value": "$_id",
            "count": "$count"
        }}
    }}
])

这将为您提供如下结果:

That will give you a result like this:

{
    "_id" : null,
    "age" : [
            {
                    "value" : "50",
                    "count" : 1
            },
            {
                    "value" : "40",
                    "count" : 3
            }
    ],
    "childs" : [
            {
                    "value" : "2",
                    "count" : 3
            },
            {
                    "value" : "1",
                    "count" : 1
            }
    ]
}

那这为什么不好?主要问题应该在管道的第一个阶段就显而易见:

So why is this bad? The main problem should be apparent in the very first pipeline stage:

    { "$group": {
        "_id": null,
        "age": { "$push": "$age" },
        "childs": { "$push": "$childs" }
    }},

我们要在此处执行的操作是将集合中的所有内容分组为所需的值,然后将这些结果$push放入数组中.当事情变小时,这是可行的,但现实世界中的收集将导致管道中的单个文档" 超过,该限制为允许的16MB BSON限制.那是不好的.

What we asked to do here is group up everything in the collection for the values we want and $push those results into an array. When things are small then this works, but real world collections would result in this "single document" in the pipeline that exceeds the 16MB BSON limit that is allowed. That is what is bad.

其余的逻辑通过处理每个数组而遵循自然路线.但是,当然,现实世界中的场景几乎总是会使这一点变得站不住脚.

The rest of the logic follows the natural course by working with each array. But of course real world scenarios would almost always make this untenable.

您可以通过复制"为类型",年龄"或子"类型的文档并按类型对文档进行单独分组等方式来避免这种情况,但这有点过于复杂",而不是扎实的做事方式.

You could avoid this somewhat, by doing things like "duplicating" the documents to be of "type" "age or "child" and grouping documents individually by type. But it's all a bit to "over complex" and not a solid way of doing things.

自然的反应是"UNION是什么?",但是既然MongoDB不执行"join",那么该如何处理呢?

The natural response is "what about a UNION?", but since MongoDB does not do the "join" then how to approach that?

在架构和性能上,您最好的方法是简单地通过客户端API与服务器并行"提交两个"查询(是两个).收到结果后,您可以将它们组合"为一个响应,然后可以将其作为数据源发送回最终的客户端"应用程序.

Your best approach here both architecturally and performance wise is to simply submit "both" queries ( yes two ) in "parallel" to the server via your client API. As the results are received you then "combine" them into a single response you can then send back as a source of data to your eventual "client" application.

不同的语言对此有不同的处理方法,但是通常的情况是寻找一个异步处理" API,该API允许您一前一后地执行此操作.

Different languages have different approaches to this, but the general case is to look for an "asynchronous processing" API that allows you to do this in tandem.

这里我的示例用途使用node.js,因为异步"面基本上是内置"的,并且遵循起来相当直观.事物的组合"方面可以是任何类型的哈希/映射/字典"表实现,仅通过简单的示例即可实现:

My example purpose here uses node.js as the "asynchronous" side is basically "built in" and reasonably intuitive to follow. The "combination" side of things can be any type of "hash/map/dict" table implementation, just doing it the simple way for example only:

var async = require('async'),
    MongoClient = require('mongodb');

MongoClient.connect('mongodb://localhost/test',function(err,db) {

  var collection = db.collection('docs');

  async.parallel(
    [
      function(callback) {
        collection.aggregate(
          [
            { "$group": {
              "_id": "$age",
              "type": { "$first": { "$literal": "age" } },
              "count": { "$sum": 1 }
            }},
            { "$sort": { "_id": -1 } }
          ],
          callback
        );
      },
      function(callback) {
        collection.aggregate(
          [
            { "$group": {
              "_id": "$childs",
              "type": { "$first": { "$literal": "childs" } },
              "count": { "$sum": 1 }
            }},
            { "$sort": { "_id": -1 } }

          ],
          callback
        );
      }
    ],
    function(err,results) {
      if (err) throw err;
      var response = {};
      results.forEach(function(res) {
        res.forEach(function(doc) {
          if ( !response.hasOwnProperty(doc.type) )
            response[doc.type] = [];

          response[doc.type].push({
            "value": doc._id,
            "count": doc.count
          });
        });
      });

      console.log( JSON.stringify( response, null, 2 ) );
    }
  );
});

哪个给出了可爱的结果:

Which gives the cute result:

{
  "age": [
    {
      "value": "50",
      "count": 1
    },
    {
      "value": "40",
      "count": 3
    }
  ],
  "childs": [
    {
      "value": "2",
      "count": 3
    },
    {
      "value": "1",
      "count": 1
    }
  ]
}

因此,此处要注意的关键是,单独的"聚合语句本身实际上非常简单.您面对的唯一一件事就是将这些结合到您的最终结果中. 组合"有很多方法,特别是处理每个查询的大结果,但这是执行模型的基本示例.

So the key thing to note here is that the "separate" aggregation statements themselves are actually quite simple. The only thing you face is combining those in your final result. There are many approaches to "combining", particularly to deal with large results from each of the queries, but this is the basic example of the execution model.

关键点在这里.

  • 可能会在聚合管道中对数据进行混排,但对于大型数据集则无法执行.

  • Shuffling data in the aggregation pipeline is possible but not performant for large data sets.

使用支持并行"和异步"执行的语言实现和API,以便您可以一次加载"所有或大多数"操作.

Use a language implementation and API that support "parallel" and "asynchronous" execution so you can "load up" all or "most" of your operations at once.

API应该支持某种组合"方法,否则应允许单独的流"写入来处理接收到的每个结果集.

The API should support some method of "combination" or otherwise allow a separate "stream" write to process each result set received into one.

忘记有关SQL的方法. NoSQL方式将诸如"joins"之类的处理委托给您的数据逻辑层",其中包含如下所示的代码.之所以这样做,是因为它可以扩展到非常大的数据集.大型应用程序中的数据逻辑"处理节点的工作就是将其交付给最终API.

Forget about the SQL way. The NoSQL way delegates the processing of such things as "joins" to your "data logic layer", which is what contains the code as shown here. It does it this way because it is scalable to very large datasets. It is rather the job of your "data logic" handling nodes in large applications to deliver this to the end API.

与我可能描述的任何其他形式的争吵"相比,这快速. "NoSQL"思维的一部分是不学习所学",并以不同的方式看待事物.而且,如果这种方法的执行效果不佳,那么请坚持使用SQL方法进行存储和查询.

This is fast compared to any other form of "wrangling" I could possibly describe. Part of "NoSQL" thinking is to "Unlearn what you have learned" and look at things a different way. And if that way doesn't perform better, then stick with the SQL approach for storage and query.

这就是为什么存在替代品的原因.

That's why alternatives exist.

这篇关于如何按不同领域分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆