使用 mongo 计算所有文档中的数组出现次数 [英] count array occurrences across all documents with mongo

查看:20
本文介绍了使用 mongo 计算所有文档中的数组出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从一个看起来像这样的文档集合中提取数据:

<预><代码>[{姓名:'约翰',性别男',爱好:['足球'、'网球'、'游泳']},{名称:'贝蒂'性别:'女性',爱好:['足球','网球']},{姓名:'弗兰克'性别男',爱好:['足球','网球']}]

我正在尝试使用聚合框架来呈现数据,按性别拆分,计算最常见的爱好.结果应该类似于.

{ _id: '男',总计:2,爱好:{足球:2,网球:2,游泳:1}},{ _id: '女性',总计:1,爱好:{足球:1,网球:1}}

到目前为止,我可以获得每种性别的总数,但我不确定如何使用 unwind 来获取爱好数组的总数.

到目前为止我的代码:

collection.aggregate([{$组:{_id: '$sex',总计:{ $sum: 1 }}}])

解决方案

就我个人而言,我不太喜欢将数据"转换为结果中的键名.聚合框架原则趋于一致,因为也不支持此类操作.

因此,个人偏好是将数据"保持为数据",并接受处理后的输出实际上对一致的对象设计更好且更合乎逻辑:

db.people.aggregate([{$组":{"_id": "$sex","爱好": { "$push": "$hobbies" },总计":{$sum":1}}},{ "$unwind": "$hobbies" },{ "$unwind": "$hobbies" },{$组":{_ID": {"sex": "$_id","hobby": "$hobbies"},"total": { "$first": "$total" },"hobbyCount": { "$sum": 1 }}},{$组":{"_id": "$_id.sex","total": { "$first": "$total" },爱好":{"$push": { "name": "$_id.hobby", "count": "$hobbyCount" }}}}])

产生如下结果:

[{"_id": "女性",总":1,爱好":[{"name" : "网球",计数":1},{"name" : "足球",计数":1}]},{"_id": "男",总":2,爱好":[{"name": "游泳",计数":1},{"name" : "网球",计数":2},{"name" : "足球",计数":2}]}]

所以最初的 $group 按性别"进行计数,并将爱好堆积成一个数组数组.然后将你的 $unwind 反规范化两次以获得单个项目,$group 获得每个性别下每个爱好的总数,最后单独为每个性别重新组合一个数组.

这是相同的数据,它具有易于处理的一致且有机的结构,并且 MongoDB 和聚合框架在生成此输出时非常满意.

如果您真的必须将数据转换为键名(我仍然建议您不要这样做,因为这不是设计中遵循的好模式),那么从最终状态进行这样的转换对于客户端代码来说是相当简单的加工.作为适用于 shell 的基本 JavaScript 示例:

var out = db.people.aggregate([{$组":{"_id": "$sex","爱好": { "$push": "$hobbies" },总计":{$sum":1}}},{ "$unwind": "$hobbies" },{ "$unwind": "$hobbies" },{$组":{_ID": {"sex": "$_id","hobby": "$hobbies"},"total": { "$first": "$total" },"hobbyCount": { "$sum": 1 }}},{$组":{"_id": "$_id.sex","total": { "$first": "$total" },爱好":{"$push": { "name": "$_id.hobby", "count": "$hobbyCount" }}}}]).toArray();out.forEach(函数(文档){var obj = {};doc.hobbies.sort(function(a,b) { return a.count < b.count });doc.hobbies.forEach(功能(爱好){obj[hobby.name] = hobby.count;});doc.hobbies = obj;打印json(文档);});

然后您基本上将每个游标结果处理成所需的输出形式,这实际上并不是服务器上真正需要的聚合函数:

{"_id": "女性",总":1,爱好":{网球":1,足球":1}}{"_id": "男",总":2,爱好":{网球":2,足球":2,游泳":1}}

将这种操作实现到游标结果的流处理中以根据需要进行转换也应该相当简单,因为它基本上只是相同的逻辑.

另一方面,您始终可以使用 mapReduce 在服务器上实现所有操作:

db.people.mapReduce(功能() {发射(这性,{总":1,爱好":this.hobbies.map(function(key) {return { "name": key, "count": 1 };})});},功能(键,值){var obj = {},减少 = {总":0,爱好":[]};values.forEach(函数(值){减少.total += value.total;value.hobbies.forEach(函数(爱好){如果 (!obj.hasOwnProperty(hobby.name))对象[爱好名称] = 0;obj[hobby.name] += hobby.count;});});减少的.爱好 = Object.keys(obj).map(function(key) {return { "name": key, "count": obj[key] };}).sort(函数(a,b){返回 a.count 

mapReduce 有自己独特的输出风格,但在累积和操作中使用相同的原则,如果不像聚合框架那样高效:

 "results" : [{"_id": "女性",价值" : {总":1,爱好":{足球":1,网球":1}}},{"_id": "男",价值" : {总":2,爱好":{足球":2,网球":2,游泳":1}}}]

归根结底,我仍然说第一种处理形式是最有效的,并且为我提供了最自然和一致的数据输出工作,甚至没有尝试将数据点转换为名称的钥匙.最好考虑遵循该模式,但如果您真的必须这样做,那么可以通过各种处理方法将结果处理为所需的形式.

Im trying to pull data on a collection of documents which looks like:

[
  {
    name: 'john',
    sex: 'male',
    hobbies: ['football', 'tennis', 'swimming']
  },
  {
    name: 'betty'
    sex: 'female',
    hobbies: ['football', 'tennis']
  },
  {
    name: 'frank'
    sex: 'male',
    hobbies: ['football', 'tennis']
  } 
]

I am trying to use the aggregation framework to present the data, split by sex, counting the most common hobbies. The results should look something like.

{ _id: 'male', 
  total: 2, 
  hobbies: {
    football: 2,
    tennis: 2,
    swimming: 1
  } 
},
{ _id: 'female', 
  total: 1, 
    hobbies: {
      football: 1,
      tennis: 1
    } 
}

So far I can get the total of each sex, but i'm not sure how I could possibly use unwind to get the totals of the hobbies array.

My code so far:

collection.aggregate([
        { 
            $group: { 
                _id: '$sex', 
                total: { $sum: 1 }
            }
        }
    ])

解决方案

Personally I am not a big fan of transforming "data" as the names of keys in a result. The aggregation framework principles tend to aggree as this sort of operation is not supported either.

So the personal preference is to maintain "data" as "data" and accept that the processed output is actually better and more logical to a consistent object design:

db.people.aggregate([
    { "$group": {
        "_id": "$sex",
        "hobbies": { "$push": "$hobbies" },
        "total": { "$sum": 1 }
    }},
    { "$unwind": "$hobbies" },
    { "$unwind": "$hobbies" },
    { "$group": {
        "_id": {
            "sex": "$_id",
            "hobby": "$hobbies"
        },
        "total": { "$first": "$total" },
        "hobbyCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.sex",
        "total": { "$first": "$total" },
        "hobbies": {
            "$push": { "name": "$_id.hobby", "count": "$hobbyCount" }
        }
    }}
])

Which produces a result like this:

[
    {
            "_id" : "female",
            "total" : 1,
            "hobbies" : [
                {
                    "name" : "tennis",
                    "count" : 1
                },
                {
                    "name" : "football",
                    "count" : 1
                }
            ]
    },
    {
        "_id" : "male",
        "total" : 2,
        "hobbies" : [
            {
                "name" : "swimming",
                "count" : 1
            },
            {
                "name" : "tennis",
                "count" : 2
            },
            {
                "name" : "football",
                "count" : 2
            }
        ]
    }
]

So the initial $group does the count per "sex" and stacks up the hobbies into an array of arrays. Then to de-normalize you $unwind twice to get singular items, $group to get the totals per hobby under each sex and finally regroup an array for each sex alone.

It's the same data, it has a consistent and organic structure that is easy to process, and MongoDB and the aggregation framework was quite happy in producing this output.

If you really must convert your data to names of keys ( and I still recommend you do not as it is not a good pattern to follow in design ), then doing such a tranformation from the final state is fairly trivial for client code processing. As a basic JavaScript example suitable for the shell:

var out = db.people.aggregate([
    { "$group": {
        "_id": "$sex",
        "hobbies": { "$push": "$hobbies" },
        "total": { "$sum": 1 }
    }},
    { "$unwind": "$hobbies" },
    { "$unwind": "$hobbies" },
    { "$group": {
        "_id": {
            "sex": "$_id",
            "hobby": "$hobbies"
        },
        "total": { "$first": "$total" },
        "hobbyCount": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.sex",
        "total": { "$first": "$total" },
        "hobbies": {
            "$push": { "name": "$_id.hobby", "count": "$hobbyCount" }
        }
    }}
]).toArray();

out.forEach(function(doc) {
    var obj = {};
    doc.hobbies.sort(function(a,b) { return a.count < b.count });
    doc.hobbies.forEach(function(hobby) {
        obj[hobby.name] = hobby.count;
    });
    doc.hobbies = obj;
    printjson(doc);
});

And then you are basically processing each cursor result into the desired output form, which really isn't an aggregation function that is really required on the server anyway:

{
    "_id" : "female",
    "total" : 1,
    "hobbies" : {
        "tennis" : 1,
        "football" : 1
    }
}
{
    "_id" : "male",
    "total" : 2,
    "hobbies" : {
        "tennis" : 2,
        "football" : 2,
        "swimming" : 1
    }
}

Where that should also be fairly trival to implement that sort of manipulation into stream processing of the cursor result to tranform as required, as it is basically just the same logic.

On the other hand, you can always implement all the manipulation on the server using mapReduce instead:

db.people.mapReduce(
    function() {
        emit(
            this.sex,
            { 
                "total": 1,
                "hobbies": this.hobbies.map(function(key) {
                    return { "name": key, "count": 1 };
                })
            }
        );
    },
    function(key,values) {
        var obj  = {},
            reduced = {
                "total": 0,
                "hobbies": []
            };

        values.forEach(function(value) {
            reduced.total += value.total;
            value.hobbies.forEach(function(hobby) {
                if ( !obj.hasOwnProperty(hobby.name) )
                    obj[hobby.name] = 0;
                obj[hobby.name] += hobby.count;
            });
        });

        reduced.hobbies = Object.keys(obj).map(function(key) {
            return { "name": key, "count": obj[key] };
        }).sort(function(a,b) {
            return a.count < b.count;
        });

        return reduced;
    },
    { 
        "out": { "inline": 1 },
        "finalize": function(key,value) {
            var obj = {};
            value.hobbies.forEach(function(hobby) {
                obj[hobby.name] = hobby.count;
            });
            value.hobbies = obj;
            return value;
        }
    }
)

Where mapReduce has it's own distinct style of output, but the same principles are used in accumulation and manipulation, if not likely as efficient as the aggregation framework can do:

   "results" : [
        {
            "_id" : "female",
            "value" : {
                "total" : 1,
                "hobbies" : {
                    "football" : 1,
                    "tennis" : 1
                }
            }
        },
        {
            "_id" : "male",
            "value" : {
                "total" : 2,
                "hobbies" : {
                    "football" : 2,
                    "tennis" : 2,
                    "swimming" : 1
                }
            }
        }
    ]

At the end of the day, I still say that the first form of processing is the most efficient and provides to my mind the most natural and consistent working of the data output, without even attempting to convert the data points into the names of keys. It's probably best to consider following that pattern, but if you really must, then there are ways to manipulate results into a desired form in various approaches to processing.

这篇关于使用 mongo 计算所有文档中的数组出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆