需要查找集合中字段中最频繁出现的值 [英] Need to find the most frequently occurring value of a field in a aggregate

查看:94
本文介绍了需要查找集合中字段中最频繁出现的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过MongoDB聚合每小时返回一条记录,我还需要知道字段中的模式"或最频繁出现的值.

From a MongoDB aggregation returning a single record for each hour, I also need to know the 'mode' or most frequently occurring value in a field.

到目前为止,我已经选择了两个日期之间的记录集,并且每小时返回一条记录,其中包括字段值的平均值.但是我还需要最频繁的类别,其中类别编号字段包含1,2,3或4.

So far I have selected the set of records between two dates, and am returning a single record for each hour including an average of a field value. But I also need the most frequent category where category number field containing 1,2,3 or 4.

var myName = "CollectionName"
//schema for mongoose
var mySchema = new Schema({
    dt: Date,
    value: Number,
    category: Number
});

var myDB = mongoose.createConnection('mongodb://localhost:27017/MYDB');

myDBObj = myDB.model(myName, evalSchema, myName);

接下来的$ group中的日期数学将为一天中的每个小时创建一个记录,而$ avg则是价格字段的平均值....

The date math in the following $group creates a record for each hour in the day, and the $avg averages the price field....

但是我无法弄清楚如何返回类别字段中最频繁出现的1,2,3或4 ...没有$ mode聚合运算符,并且出现错误"exception: unknown group operator '$mode'"

but I cant figure how to return the most frequent occurrence of 1,2,3 or 4 in the category field... there is no $mode aggregation operator and I get the error "exception: unknown group operator '$mode'"

myDBObj.aggregate([
        {
            $match: { "dt": { $gt: new Date("October 13, 2010 12:00:00"), $lt: new Date("November 13, 2010 12:00:00") } }
        },{
            $group:  {
            "_id": {
                "dt": {
                    "$add": [
                        {
                            "$subtract": [
                                { "$subtract": ["$dt", new Date(0)] },
                                {
                                    "$mod": [
                                        { "$subtract": ["$dt", new Date(0)] },
                                        3600000//1000 * 60 * 60
                                    ]
                                }
                            ]
                        },
                        new Date(0)
                    ]
                }
            }, 
            "price": { "$avg": "$price" },
            "category" : { "$mode" : "$category"}
        }
        }], function (err, data) { if (err) { return next(err); } res.json(data); });

有没有一种方法可以返回字段中包含的最常见值?

Is there a way to return the most common value contained in a field?

我需要使用map-reduce函数吗?我如何将它们与上面的每小时汇总相结合?谢谢您的帮助.

Do I need to use map-reduce functions? How would I combine them with the hourly aggregation above? Thank you for any help.

推荐答案

嗯,你不能只是化妆".作为$mode的运算符不是聚合操作符,您只能使用实际存在的那些.

Well you cannot just "make up". operators as $mode is not an aggrgegation operator, and the only things you can use are those that actually exist.

因此,为了在出现次数最多的分组时间段内返回类别值,有必要首先对每个值进行分组并返回发生次数.然后,您可以按该计数对这些结果进行排序,并返回记录该时间段内最高计数的类别值:

So in order to return the category value within the grouped time period that occurs the most, it is necessary to group first on each of those values and return the count of occurances. Then you can order these results by that count, and return the category value that recorded the highest count within that period:

    // Filter dates
    { "$match": { 
        "dt": { 
            "$gt": new Date("October 13, 2010 12:00:00"), 
            "$lt": new Date("November 13, 2010 12:00:00")
        } 
    }},

    // Group by hour and category, with avg and count
    { "$group": {
        "_id": {
            "dt": {
                "$add": [
                    {
                        "$subtract": [
                            { "$subtract": ["$dt", new Date(0)] },
                            {
                                "$mod": [
                                    { "$subtract": ["$dt", new Date(0)] },
                                    3600000//1000 * 60 * 60
                                ]
                            }
                        ]
                    },
                    new Date(0)
                ]
            },
            "category": "$category"
        }, 
        "price": { "$avg": "$price" },
        "count": { "$sum": 1 }
    }},
    // Sort on date and count
    { "$sort": { "_id.dt": 1, "count": -1 }},

    // Group on just the date, keeping the avg and the first category
    { "$group": {
        "_id": "$_id.dt",
        "price": { "$avg": "$price"}
        "category": { "$first": "$_id.category" }
    }}

所以 $group 在日期和类别上并通过 $sum 保留类别计数.然后您 $sort ,这样最大的计数"是在每个分组日期的顶部.最后,当您应用另一个 $first >会自动应用于日期本身,以便返回每个日期的计数最高的类别.

So $group on both date and category and retain the category count via $sum. Then you $sort so the largest "count" is on top for each grouped date. And finally use $first when you apply another $group that is just applied to the date itself, in order to return that category with the largest count for each date.

不要被$max之类的运算符诱惑,因为它们在这里不起作用.关键区别在于为每个类别值生成的记录/文档"的绑定"关系.因此,不是所需的最大计数"或最大的类别"值,而是产生"最大计数的类别值.因此,这里需要一个$sort.

Don't be tempted by operators like $max as they do not work here. The key difference is the "tied" releation to the "record/document" produced for each category value. So it is not the maximim "count" you want or the maximum "category" value, but instead the category value that "produced" the largest count. Hence there is a $sort needed here.

最后,您应该打破"一些习惯:

Finally some habits you "should" break:

  • 除非您真的知道自己在做什么,否则请勿将非UTC格式的日期实例数据用作输入.日期将始终转换为UTC,因此至少在测试列表中,您应该习惯于以这种方式指定日期值.

  • Don't use non UTC format date instance data as input unless you really know what you are doing. Dates are going to be converted to UTC always, so at least in test listings, you should get used to specifying the date value that way.

相反,它看起来更干净一些,但是像1000 * 60 * 60这样的东西比3600000更具描述性.值相同,但一种形式可以一目了然地表示其时间单位.

It might look a bit cleaner the other way but things like 1000 * 60 * 60 are a lot more descrpitive code of what it is doing than 3600000. Same value, but one form is indicative of it's time units at a glance.

只有一个值时复合_id也会使问题混淆.因此,如果这是唯一的值,那么访问_id.dt毫无意义.如果_id中有多个属性,则可以.但是,应该仅将单个值直接分配回_id.否则,一无所获,而且单身很清楚.

Compounding _id when there is only a single value can also confuse issues. So there is little point in accesssing _id.dt if that was the only value present. When is more than a single property within _id then it is fine. But single values should just be assigned right back to _id alone. Nothing gained otherwise, and single is quite clear.

这篇关于需要查找集合中字段中最频繁出现的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆