分组聚合中的限制聚合 [英] Limit aggregation in grouped aggregation

查看:72
本文介绍了分组聚合中的限制聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的馆藏,但是有更多的数据.

I had a collection like this, but with much more data.

{
  _id: ObjectId("db759d014f70743495ef1000"),
  tracked_item_origin: "winword",
  tracked_item_type: "Software",
  machine_user: "mmm.mmm",
  organization_id: ObjectId("a91864df4f7074b33b020000"),
  group_id: ObjectId("20ea74df4f7074b33b520000"),
  tracked_item_id: ObjectId("1a050df94f70748419140000"),
  tracked_item_name: "Word",
  duration: 9540,
}

{
  _id: ObjectId("2b769d014f70743495fa1000"),
  tracked_item_origin: "http://www.facebook.com",
  tracked_item_type: "Site",
  machine_user: "gabriel.mello",
  organization_id: ObjectId("a91864df4f7074b33b020000"),
  group_id: ObjectId("3f6a64df4f7074b33b040000"),
  tracked_item_id: ObjectId("6f3466df4f7074b33b080000"),
  tracked_item_name: "Facebook",
  duration: 7920,
}

我进行汇总,然后返回这样的分组数据:

I do an aggregation, ho return grouped data like this:

{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Twitter"}, "duration"=>288540},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"ANoticia"}, "duration"=>237300},
{"_id"=>{"tracked_item_type"=>"Site", "tracked_item_name"=>"Facebook"}, "duration"=>203460},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Word"}, "duration"=>269760},
{"_id"=>{"tracked_item_type"=>"Software", "tracked_item_name"=>"Excel"}, "duration"=>204240}

简单的聚合代码:

AgentCollector.collection.aggregate(
  {'$match' => {group_id: '20ea74df4f7074b33b520000'}},
  {'$group' => {
    _id: {tracked_item_type: '$tracked_item_type', tracked_item_name: '$tracked_item_name'},
    duration: {'$sum' => '$duration'}
  }},
  {'$sort' => {
    '_id.tracked_item_type' => 1,
    duration: -1
  }}
)

有没有一种方法可以通过tracked_item_type键限制两个项目?前任. 2个站点和2个软件.

There is a way to limit only 2 items by tracked_item_type key? Ex. 2 Sites and 2 Softwares.

推荐答案

由于您的问题目前尚不清楚,我真的希望您的意思是您想指定两个Site键和2个Software键,因为这很不错,并且您可以将简单的答案添加到$ match阶段,如下所示:

As your question currently stands unclear, I really hope you mean that you want to specify two Site keys and 2 Software keys because that's a nice and simple answer that you can just add to your $match phase as in:

{$match: {
    group_id: "20ea74df4f7074b33b520000",
    tracked_item_name: {$in: ['Twitter', 'Facebook', 'Word', 'Excel' ] }
}},

我们都可以欢呼快乐!)

And we can all cheer and be happy ;)

但是,如果您的问题更具说服力,例如,按持续时间从结果中获得前2个SitesSoftware条目,那么我们非常感谢您产生了这种 amination .

If however your question is something more diabolical such as, getting the top 2 Sites and Software entries from the result by duration, then we thank you very much for spawning this abomination.

您的里程可能会因您实际想要执行的操作而有所不同,或者是否会因结果的庞大而爆炸.但这是您所要从事的工作的一个示例:

Your mileage may vary on what you actually want to do or whether this is going to blow up by the sheer size of your results. But this follows as an example of what you are in for:

db.collection.aggregate([

    // Match items first to reduce the set
    {$match: {group_id: "20ea74df4f7074b33b520000" }},

    // Group on the types and "sum" of duration
    {$group: {
        _id: {
            tracked_item_type: "$tracked_item_type",
            tracked_item_name: "$tracked_item_name"
         },
         duration: {$sum: "$duration"}
    }},

    // Sort by type and duration descending
    {$sort: { "_id.tracked_item_type": 1, duration: -1 }},

    /* The fun part */

    // Re-shape results to "sites" and "software" arrays 
    {$group: { 
        _id: null,
        sites: {$push:
            {$cond: [
                {$eq: ["$_id.tracked_item_type", "Site" ]},
                { _id: "$_id", duration: "$duration" },
                null
            ]}
        },
        software: {$push:
            {$cond: [
                {$eq: ["$_id.tracked_item_type", "Software" ]},
                { _id: "$_id", duration: "$duration" },
                null
            ]}
        }
    }},


    // Remove the null values for "software"
    {$unwind: "$software"},
    {$match: { software: {$ne: null} }},
    {$group: { 
        _id: "$_id",
        software: {$push: "$software"}, 
        sites: {$first: "$sites"} 
    }},

    // Remove the null values for "sites"
    {$unwind: "$sites"},
    {$match: { sites: {$ne: null} }},
    {$group: { 
        _id: "$_id",
        software: {$first: "$software"},
        sites: {$push: "$sites"} 
    }},


    // Project out software and limit to the *top* 2 results
    {$unwind: "$software"},
    {$project: { 
        _id: 0,
        _id: { _id: "$software._id", duration: "$software.duration" },
        sites: "$sites"
    }},
    {$limit : 2},


    // Project sites, grouping multiple software per key, requires a sort
    // then limit the *top* 2 results
    {$unwind: "$sites"},
    {$group: {
        _id: { _id: "$sites._id", duration: "$sites.duration" },
        software: {$push: "$_id" }
    }},
    {$sort: { "_id.duration": -1 }},
    {$limit: 2}

])  

现在产生的结果是* 不完全干净的结果集,这是理想的结果,但是它可以通过编程方式使用,并且比循环过滤以前的结果要好. (我的测试数据)

Now what that results in is *not exactly the clean set of results that would be ideal but it is something that can be programatically worked with, and better than filtering the previous results in a loop. (My data from testing)

{
    "result" : [
        {
            "_id" : {
                "_id" : {
                    "tracked_item_type" : "Site",
                    "tracked_item_name" : "Digital Blasphemy"
                 },
                 "duration" : 8000
            },
            "software" : [
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Word"
                    },
                    "duration" : 9540
                },

                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Notepad"
                    },
                    "duration" : 4000
                }
            ]
        },
        {
            "_id" : {
                "_id" : {
                    "tracked_item_type" : "Site",
                    "tracked_item_name" : "Facebook"
                 },
                 "duration" : 7920
            },
            "software" : [
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                         "tracked_item_name" : "Word"
                    },
                    "duration" : 9540
                },
                {
                    "_id" : {
                        "tracked_item_type" : "Software",
                        "tracked_item_name" : "Notepad"
                    },
                    "duration" : 4000
                }
            ]
        }
    ],
    "ok" : 1
}

因此,您看到的是数组中的前2个Sites项,每个项中都嵌入了前2个Software项.聚合本身无法进一步清除此问题,因为我们需要重新合并以便拆分的项目才能执行此操作,而到目前为止,还没有可用于执行此操作的运算符行动.

So you see you get the top 2 Sites in the array, with the top 2 Software items embedded in each. Aggregation itself, cannot clear this up any further, because we would need to re-merge the items we split apart in order to do this, and as yet there is no operator that we could use to perform this action.

但这很有趣.这不是全部的方式,而是的方式,将其放入4个文档的响应中将是相对琐碎的代码.但是我的头已经疼了.

But that was fun. It's not all the way done, but most of the way, and making that into a 4 document response would be relatively trivial code. But my head hurts already.

这篇关于分组聚合中的限制聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆