使用mongo计算所有文档中的数组出现次数 [英] count array occurrences across all documents with mongo

查看:65
本文介绍了使用mongo计算所有文档中的数组出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取类似于以下内容的文档集合上的数据:

Im trying to pull data on a collection of documents which looks like:

[
  {
    name: 'john',
    sex: 'male',
    hobbies: ['football', 'tennis', 'swimming']
  },
  {
    name: 'betty'
    sex: 'female',
    hobbies: ['football', 'tennis']
  },
  {
    name: 'frank'
    sex: 'male',
    hobbies: ['football', 'tennis']
  } 
]

我正在尝试使用聚合框架来呈现按性别划分的数据,并计算最常见的兴趣爱好.结果应该类似于.

I am trying to use the aggregation framework to present the data, split by sex, counting the most common hobbies. The results should look something like.

{ _id: 'male', 
  total: 2, 
  hobbies: {
    football: 2,
    tennis: 2,
    swimming: 1
  } 
},
{ _id: 'female', 
  total: 1, 
    hobbies: {
      football: 1,
      tennis: 1
    } 
}

到目前为止,我可以得到每种性别的总数,但是我不确定如何使用展开来获得爱好数组的总数.

So far I can get the total of each sex, but i'm not sure how I could possibly use unwind to get the totals of the hobbies array.

到目前为止,我的代码:

My code so far:

collection.aggregate([
        { 
            $group: { 
                _id: '$sex', 
                total: { $sum: 1 }
            }
        }
    ])

推荐答案

我个人并不喜欢将数据"转换为结果中键的名称.聚合框架原则倾向于一致,因为也不支持这种操作.

Personally I am not a big fan of transforming "data" as the names of keys in a result. The aggregation framework principles tend to aggree as this sort of operation is not supported either.

因此,个人喜好是将数据"保持为数据",并接受处理后的输出实际上对于一致的对象设计更好,更合逻辑:

So the personal preference is to maintain "data" as "data" and accept that the processed output is actually better and more logical to a consistent object design:

db.people.aggregate([
    { "$group": {
        "_id": "$sex",
        "hobbies": { "$push": "$hobbies" },
        "total": { "$sum": 1 }
    }},
    { "$unwind": "$hobbies" },
    { "$unwind": "$hobbies" },
    { "$group": {
        "_id": {
            "sex": "$_id",
            "hobby": "$hobbies"
        },
        "total": { "$first": "$total" },
        "hobbyCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.sex",
        "total": { "$first": "$total" },
        "hobbies": {
            "$push": { "name": "$_id.hobby", "count": "$hobbyCount" }
        }
    }}
])

哪个会产生这样的结果:

Which produces a result like this:

[
    {
            "_id" : "female",
            "total" : 1,
            "hobbies" : [
                {
                    "name" : "tennis",
                    "count" : 1
                },
                {
                    "name" : "football",
                    "count" : 1
                }
            ]
    },
    {
        "_id" : "male",
        "total" : 2,
        "hobbies" : [
            {
                "name" : "swimming",
                "count" : 1
            },
            {
                "name" : "tennis",
                "count" : 2
            },
            {
                "name" : "football",
                "count" : 2
            }
        ]
    }
]

因此,最初的$group按性别"进行计数,并将兴趣爱好堆叠到一个数组中.然后,要两次对$unwind进行归一化以获取奇异项,$group以获得在每种性别下每个爱好的总数,最后单独为每种性别重新排列一个数组.

So the initial $group does the count per "sex" and stacks up the hobbies into an array of arrays. Then to de-normalize you $unwind twice to get singular items, $group to get the totals per hobby under each sex and finally regroup an array for each sex alone.

这是相同的数据,具有易于处理的一致且有机的结构,MongoDB和聚合框架非常高兴产生此输出.

It's the same data, it has a consistent and organic structure that is easy to process, and MongoDB and the aggregation framework was quite happy in producing this output.

如果您确实必须将数据转换为键的名称(并且我仍然建议您不要这样做,因为这不是设计中遵循的好模式),那么对于客户端代码而言,从最终状态进行这样的转换是相当简单的加工.作为适合外壳的基本JavaScript示例:

If you really must convert your data to names of keys ( and I still recommend you do not as it is not a good pattern to follow in design ), then doing such a tranformation from the final state is fairly trivial for client code processing. As a basic JavaScript example suitable for the shell:

var out = db.people.aggregate([
    { "$group": {
        "_id": "$sex",
        "hobbies": { "$push": "$hobbies" },
        "total": { "$sum": 1 }
    }},
    { "$unwind": "$hobbies" },
    { "$unwind": "$hobbies" },
    { "$group": {
        "_id": {
            "sex": "$_id",
            "hobby": "$hobbies"
        },
        "total": { "$first": "$total" },
        "hobbyCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.sex",
        "total": { "$first": "$total" },
        "hobbies": {
            "$push": { "name": "$_id.hobby", "count": "$hobbyCount" }
        }
    }}
]).toArray();

out.forEach(function(doc) {
    var obj = {};
    doc.hobbies.sort(function(a,b) { return a.count < b.count });
    doc.hobbies.forEach(function(hobby) {
        obj[hobby.name] = hobby.count;
    });
    doc.hobbies = obj;
    printjson(doc);
});

然后您基本上将每个游标结果处理为所需的输出形式,这实际上并不是服务器上真正需要的聚合函数:

And then you are basically processing each cursor result into the desired output form, which really isn't an aggregation function that is really required on the server anyway:

{
    "_id" : "female",
    "total" : 1,
    "hobbies" : {
        "tennis" : 1,
        "football" : 1
    }
}
{
    "_id" : "male",
    "total" : 2,
    "hobbies" : {
        "tennis" : 2,
        "football" : 2,
        "swimming" : 1
    }
}

在这种情况下,将游标的流处理实施为按需转换以进行所需的转换也应该是相当琐碎的,因为它基本上是相同的逻辑.

Where that should also be fairly trival to implement that sort of manipulation into stream processing of the cursor result to tranform as required, as it is basically just the same logic.

另一方面,您始终可以使用mapReduce在服务器上实现所有操作:

On the other hand, you can always implement all the manipulation on the server using mapReduce instead:

db.people.mapReduce(
    function() {
        emit(
            this.sex,
            { 
                "total": 1,
                "hobbies": this.hobbies.map(function(key) {
                    return { "name": key, "count": 1 };
                })
            }
        );
    },
    function(key,values) {
        var obj  = {},
            reduced = {
                "total": 0,
                "hobbies": []
            };

        values.forEach(function(value) {
            reduced.total += value.total;
            value.hobbies.forEach(function(hobby) {
                if ( !obj.hasOwnProperty(hobby.name) )
                    obj[hobby.name] = 0;
                obj[hobby.name] += hobby.count;
            });
        });

        reduced.hobbies = Object.keys(obj).map(function(key) {
            return { "name": key, "count": obj[key] };
        }).sort(function(a,b) {
            return a.count < b.count;
        });

        return reduced;
    },
    { 
        "out": { "inline": 1 },
        "finalize": function(key,value) {
            var obj = {};
            value.hobbies.forEach(function(hobby) {
                obj[hobby.name] = hobby.count;
            });
            value.hobbies = obj;
            return value;
        }
    }
)

mapReduce拥有自己独特的输出样式,但是在累积和操作中使用相同的原理,即使效率不如聚合框架所能做到的那样:

Where mapReduce has it's own distinct style of output, but the same principles are used in accumulation and manipulation, if not likely as efficient as the aggregation framework can do:

   "results" : [
        {
            "_id" : "female",
            "value" : {
                "total" : 1,
                "hobbies" : {
                    "football" : 1,
                    "tennis" : 1
                }
            }
        },
        {
            "_id" : "male",
            "value" : {
                "total" : 2,
                "hobbies" : {
                    "football" : 2,
                    "tennis" : 2,
                    "swimming" : 1
                }
            }
        }
    ]

最后,我仍然要说,第一种处理方式是最有效的,它使我想到了数据输出的最自然和一致的工作,甚至没有尝试将数据点转换为名称键.最好考虑遵循这种模式,但是如果确实需要,那么可以通过多种处理方法来将结果操纵为所需的形式.

At the end of the day, I still say that the first form of processing is the most efficient and provides to my mind the most natural and consistent working of the data output, without even attempting to convert the data points into the names of keys. It's probably best to consider following that pattern, but if you really must, then there are ways to manipulate results into a desired form in various approaches to processing.

这篇关于使用mongo计算所有文档中的数组出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆