如何按不同的领域分组 [英] How to group by different fields

查看:16
本文介绍了如何按不同的领域分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想找到一个名为汉斯"的所有用户,并通过将它们分组聚合他们的孩子的"的年龄"和数量.假设我在我的数据库具有以下用户".

<代码>{_id": 01",用户":汉斯",时代": 50"儿童的": 2"}{_id": 02",用户":汉斯",时代": 40"儿童的": 2"}{_id": 03",用户":弗里茨"时代": 40"儿童的": 2"}{_id": 04",用户":汉斯",时代": 40"儿童的": 1"}

结果应该是这样的:

 <代码> 结果":[{年龄" :[{值": 50",计数": 1"},{值": 40",计数": 2"}]},{孩子的":[{值": 2",计数": 2"},{值": 1",计数": 1"}]}]

我怎样才能做到这一点?

解决方案

这几乎应该是一个MongoDB的常见问题,主要是因为它是你应该如何改变从SQL处理你的思想和拥抱是什么样的MongoDB引擎做一个真实的例子概念.

其基本原理是在这里的MongoDB没有做连接".构想"你将如何构建SQL做到这一点的任何方式本质上需要一个连接"的操作.典型的形式是联盟"这实际上是一个连接".

那么如何做到这一点在一个不同的模式?那么首先,我们的做法是如何的不可以这样做,并明白其中的原因.即使当然它会为你的非常小的样本工作:

困难的方法

<预类= 郎-JS prettyprint-越权"> <代码> db.docs.aggregate([{$组":{_id":空,时代":{ $推": $时代"},孩子的":{ $推": $孩子的"}}},{ $开卷": $时代"},{$组":{_id": $时代",计数":{ $总和":1},孩子的":{ 第一个$": $孩子的"}}},{ $排序":{ _id":-1}},{$组":{_id":空,时代":{ $推送":{值": $ _id"算":$计数"}},孩子的":{ 第一个$": $孩子的"}}},{ $开卷": $孩子的"},{$组":{_id": $孩子的","count": { "$sum": 1 },时代":{ $第一个": $时代"}}},{ $排序":{ _id":-1}},{$组":{_id":空,时代":{ $第一个": $时代"},孩子的":{ $推送":{值": $ _id"算":$计数"}}}}])

这会给你这样的结果:

{_id":空,年龄" : [{值": 50",计数":1},{值": 40",计数":3}],孩子的":{值": 2",计数":3},{值": 1",计数":1}]}

那么,为什么是这个坏?主要的问题,应在非常第一流水线级是显而易见:

<预类= 郎-JS prettyprint-越权"> <代码> { $组":{_id":空,时代":{ $推": $时代"},孩子的":{ $推": $孩子的"}}},

我们要求做的,是团补的所有收集了我们想要的值和 $推的那些结果到一个数组英寸当事情是小那么这个作品,但现实世界中的集合会导致流水线超过这个单一文件"中所允许的16MB BSON限制.这是什么是坏的.

逻辑的其余遵循与每个阵列的工作的自然过程.但当然,现实世界中的场景几乎总是使这个站不住脚的.

您可以稍微避免这种情况,通过做这样的事情复制"的文件是类型",时代或‘儿童’和类型单独分组的文件.但是,这一切都有点‘过度复杂的’,而不是做事的固体的方法.

在自然的反应是怎么样一个UNION?",但因为MongoDB的没有做的加盟",那么如何处理呢?

<小时>

更好的方式(又名新的希望)

建筑都在这里你最好的办法,在性能方面是简单地在平行"通过您的客户端API提交既"查询(是二)到服务器.作为结果收到你再合并"成一个单一的响应可以再发回的数据,你最终的客户端"应用程序的来源.

不同的语言有不同的方法来此,但一般情况下是寻找一个异步处理"的API,允许你这样做串联.

我在这里的示例目的用途<代码>的node.js 为异步"侧基本上是内置"和合理直观的遵循.事物的组合"侧可以是任何类型的散列/图/字典"表的实施,只是做例如简单的方式仅:

var async = require('async'),MongoClient =要求( 'mongodb的');MongoClient.connect( '的mongodb://本地主机/测试',函数(ERR,分贝){VAR =收集db.collection( '文档');异步并行([功能(回调){collection.aggregate([{$组":{_id": $时代",类型":{ $第一":{ $文字": 年龄"},"计数": { "$sum": 1 }}},{ $排序":{ _id":-1}}],打回来);},功能(回调){collection.aggregate([{$组":{_id": $孩子的",类型":{ $第一":{ $文字": 孩子的"},"计数": { "$sum": 1 }}},{ $排序":{ _id":-1}}],打回来);}],功能(错误,结果){if (err) 抛出错误;VAR响应= {};results.forEach(函数(RES){res.forEach(函数(DOC){如果(!response.hasOwnProperty(doc.type))响应[doc.type] = [];响应[doc.type] .push({值":doc._id,算":doc.count});});});的console.log(JSON.stringify(响应,NULL,2));});});

这给可爱的结果:

{年龄": [{值": 50",计数":1},{值": 40",计数":3}],孩子的":{值": 2",计数":3},{值": 1",计数":1}]}

所以,这里要注意的关键一点是,独立的"聚集语句本身实际上是相当简单的.你所面临的唯一的事情是结合那些在您的最终结果.有许多方法为组合",特别是涉及处理从每个查询结果大,但是这是执行模型的基本示例.

<小时>

此处要点.

  • 在聚集管道洗牌的数据是可能的,但不是高性能对于大数据集.

  • 使用语言的实现和API的支持平行"和异步"的执行,所以你可以在加载"全部或大部分"的操作一次.

  • 在API应该支持的组合"某种方法或以其他方式允许一个单独的流"的写处理所接收到各一个结果集.

  • 忘掉SQL方式.NoSQL的方式代表这样的事情的处理加入"到你的数据逻辑层",这是包含的代码如下所示.它确实是这样,因为它是可扩展到非常大的数据集.这是相当的数据逻辑"在处理大型应用程序节点来提供这种到底API的工作.

这是的相比,争论"的任何其它形式的我可能描述.的NoSQL的"思维一部分是忘掉你所学到的"和看事物的不同方式.如果这种方式不表现得更好,然后用存储和查询的SQL办法坚持下去.

这就是为什么存在替代品.

I want to find all users named 'Hans' and aggregate their 'age' and number of 'childs' by grouping them. Assuming I have following in my database 'users'.

{
    "_id" : "01",
    "user" : "Hans",
    "age" : "50"
    "childs" : "2"
}
{
    "_id" : "02",
    "user" : "Hans",
    "age" : "40"
    "childs" : "2"
}
{
    "_id" : "03",
    "user" : "Fritz",
    "age" : "40"
    "childs" : "2"
}
{
    "_id" : "04",
    "user" : "Hans",
    "age" : "40"
    "childs" : "1"
}

The result should be something like this:

"result" : 
[
  { 
    "age" : 
      [
        {
          "value" : "50",
          "count" : "1"
        },
        {
          "value" : "40",
          "count" : "2"
        }
      ]
  },
  { 
    "childs" : 
      [
        {
          "value" : "2",
          "count" : "2"
        },
        {
          "value" : "1",
          "count" : "1"
        }
      ]
  }  
]

How can I achieve this?

解决方案

This should almost be a MongoDB FAQ, mostly because it is a real example concept of how you should be altering your thinking from SQL processing and embracing what engines like MongoDB do.

The basic principle here is "MongoDB does not do joins". Any way of "envisioning" how you would construct SQL to do this essentially requires a "join" operation. The typical form is "UNION" which is in fact a "join".

So how to do it under a different paradigm? Well first, let's approach how not to do it and understand the reasons why. Even if of course it will work for your very small sample:

The Hard Way

db.docs.aggregate([
    { "$group": {
        "_id": null,
        "age": { "$push": "$age" },
        "childs": { "$push": "$childs" }
    }},
    { "$unwind": "$age" },
    { "$group": {
        "_id": "$age",
        "count": { "$sum": 1  },
        "childs": { "$first": "$childs" }
    }},
    { "$sort": { "_id": -1 } },
    { "$group": {
        "_id": null,
        "age": { "$push": {
            "value": "$_id",
            "count": "$count"
        }},
        "childs": { "$first": "$childs" }
    }},
    { "$unwind": "$childs" },
    { "$group": {
        "_id": "$childs",
        "count": { "$sum": 1 },
        "age": { "$first": "$age" }
    }},
    { "$sort": { "_id": -1 } },
    { "$group": {
        "_id": null,
        "age": { "$first": "$age" },
        "childs": { "$push": {
            "value": "$_id",
            "count": "$count"
        }}
    }}
])

That will give you a result like this:

{
    "_id" : null,
    "age" : [
            {
                    "value" : "50",
                    "count" : 1
            },
            {
                    "value" : "40",
                    "count" : 3
            }
    ],
    "childs" : [
            {
                    "value" : "2",
                    "count" : 3
            },
            {
                    "value" : "1",
                    "count" : 1
            }
    ]
}

So why is this bad? The main problem should be apparent in the very first pipeline stage:

    { "$group": {
        "_id": null,
        "age": { "$push": "$age" },
        "childs": { "$push": "$childs" }
    }},

What we asked to do here is group up everything in the collection for the values we want and $push those results into an array. When things are small then this works, but real world collections would result in this "single document" in the pipeline that exceeds the 16MB BSON limit that is allowed. That is what is bad.

The rest of the logic follows the natural course by working with each array. But of course real world scenarios would almost always make this untenable.

You could avoid this somewhat, by doing things like "duplicating" the documents to be of "type" "age or "child" and grouping documents individually by type. But it's all a bit to "over complex" and not a solid way of doing things.

The natural response is "what about a UNION?", but since MongoDB does not do the "join" then how to approach that?


A Better Way ( aka A New Hope )

Your best approach here both architecturally and performance wise is to simply submit "both" queries ( yes two ) in "parallel" to the server via your client API. As the results are received you then "combine" them into a single response you can then send back as a source of data to your eventual "client" application.

Different languages have different approaches to this, but the general case is to look for an "asynchronous processing" API that allows you to do this in tandem.

My example purpose here uses node.js as the "asynchronous" side is basically "built in" and reasonably intuitive to follow. The "combination" side of things can be any type of "hash/map/dict" table implementation, just doing it the simple way for example only:

var async = require('async'),
    MongoClient = require('mongodb');

MongoClient.connect('mongodb://localhost/test',function(err,db) {

  var collection = db.collection('docs');

  async.parallel(
    [
      function(callback) {
        collection.aggregate(
          [
            { "$group": {
              "_id": "$age",
              "type": { "$first": { "$literal": "age" } },
              "count": { "$sum": 1 }
            }},
            { "$sort": { "_id": -1 } }
          ],
          callback
        );
      },
      function(callback) {
        collection.aggregate(
          [
            { "$group": {
              "_id": "$childs",
              "type": { "$first": { "$literal": "childs" } },
              "count": { "$sum": 1 }
            }},
            { "$sort": { "_id": -1 } }

          ],
          callback
        );
      }
    ],
    function(err,results) {
      if (err) throw err;
      var response = {};
      results.forEach(function(res) {
        res.forEach(function(doc) {
          if ( !response.hasOwnProperty(doc.type) )
            response[doc.type] = [];

          response[doc.type].push({
            "value": doc._id,
            "count": doc.count
          });
        });
      });

      console.log( JSON.stringify( response, null, 2 ) );
    }
  );
});

Which gives the cute result:

{
  "age": [
    {
      "value": "50",
      "count": 1
    },
    {
      "value": "40",
      "count": 3
    }
  ],
  "childs": [
    {
      "value": "2",
      "count": 3
    },
    {
      "value": "1",
      "count": 1
    }
  ]
}

So the key thing to note here is that the "separate" aggregation statements themselves are actually quite simple. The only thing you face is combining those in your final result. There are many approaches to "combining", particularly to deal with large results from each of the queries, but this is the basic example of the execution model.


Key points here.

  • Shuffling data in the aggregation pipeline is possible but not performant for large data sets.

  • Use a language implementation and API that support "parallel" and "asynchronous" execution so you can "load up" all or "most" of your operations at once.

  • The API should support some method of "combination" or otherwise allow a separate "stream" write to process each result set received into one.

  • Forget about the SQL way. The NoSQL way delegates the processing of such things as "joins" to your "data logic layer", which is what contains the code as shown here. It does it this way because it is scalable to very large datasets. It is rather the job of your "data logic" handling nodes in large applications to deliver this to the end API.

This is fast compared to any other form of "wrangling" I could possibly describe. Part of "NoSQL" thinking is to "Unlearn what you have learned" and look at things a different way. And if that way doesn't perform better, then stick with the SQL approach for storage and query.

That's why alternatives exist.

这篇关于如何按不同的领域分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆