MongoDB:聚合框架:获取每个分组 ID 的最后日期文档 [英] MongoDB : Aggregation framework : Get last dated document per grouping ID

查看:30
本文介绍了MongoDB:聚合框架:获取每个分组 ID 的最后日期文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获取包含所有其他字段的每个站的最后一个文档:

I want to get the last document for each station with all other fields :

{
        "_id" : ObjectId("535f5d074f075c37fff4cc74"),
        "station" : "OR",
        "t" : 86,
        "dt" : ISODate("2014-04-29T08:02:57.165Z")
}
{
        "_id" : ObjectId("535f5d114f075c37fff4cc75"),
        "station" : "OR",
        "t" : 82,
        "dt" : ISODate("2014-04-29T08:02:57.165Z")
}
{
        "_id" : ObjectId("535f5d364f075c37fff4cc76"),
        "station" : "WA",
        "t" : 79,
        "dt" : ISODate("2014-04-29T08:02:57.165Z")
}

我需要有 t 和 station 以获得每个站的最新 dt.使用聚合框架:

I need to have t and station for the latest dt per station. With the aggregation framework :

db.temperature.aggregate([{$sort:{"dt":1}},{$group:{"_id":"$station", result:{$last:"$dt"}, t:{$last:"$t"}}}])

返回

{
        "result" : [
                {
                        "_id" : "WA",
                        "result" : ISODate("2014-04-29T08:02:57.165Z"),
                        "t" : 79
                },
                {
                        "_id" : "OR",
                        "result" : ISODate("2014-04-29T08:02:57.165Z"),
                        "t" : 82
                }
        ],
        "ok" : 1
}

这是最有效的方法吗?

谢谢

推荐答案

直接回答你的问题,是的,这是最有效的方式.但我确实认为我们需要澄清为什么会这样.

To directly answer your question, yes it is the most efficient way. But I do think we need to clarify why this is so.

正如在替代方案中所建议的那样,人们正在查看的一件事是在传递到 $group 阶段之前对您的结果进行排序",他们正在查看的是时间戳"值,因此您需要确保所有内容都按时间戳"顺序排列,因此格式如下:

As was suggested in alternatives, the one thing people are looking at is "sorting" your results before passing to a $group stage and what they are looking at is the "timestamp" value, so you would want to make sure that everything is in "timestamp" order, so hence the form:

db.temperature.aggregate([
    { "$sort": { "station": 1, "dt": -1 } },
    { "$group": {
        "_id": "$station", 
        "result": { "$first":"$dt"}, "t": {"$first":"$t"} 
    }}
])

并且如上所述,您当然需要一个索引来反映这一点,以提高排序效率:

And as stated you will of course want an index to reflect that in order to make the sort efficient:

然而,这才是真正的重点.其他人似乎忽略了(如果你自己不是这样),所有这些数据很可能已经按时间顺序插入,因为每个读数都被记录为添加的.

However, and this is the real point. What seems have been overlooked by others ( if not so for yourself ) is that all of this data is likely being inserted already in time order, in that each reading is recorded as added.

所以它的美妙之处在于 _id 字段(带有默认的 ObjectId )已经在时间戳"顺序中,因为它本身实际上包含一个时间值这使得声明成为可能:

So the beauty of this is the the _id field ( with a default ObjectId ) is already in "timestamp" order, as it does itself actually contain a time value and this makes the statement possible:

db.temperature.aggregate([
    { "$group": {
        "_id": "$station", 
        "result": { "$last":"$dt"}, "t": {"$last":"$t"} 
    }}
])

而且它更快.为什么?好吧,您不需要选择索引(要调用的附加代码),除了文档之外,您也不需要加载"索引.

And it is faster. Why? Well you don't need to select an index ( additional code to invoke) you also don't need to "load" the index in addition to the document.

我们已经知道文档是有序的(通过 _id )所以 $last 边界是完全有效的.无论如何,您都在扫描所有内容,并且您还可以对 _id 值进行范围"查询,该值对于两个日期之间同样有效.

We already know the documents are in order ( by _id ) so the $last boundaries are perfectly valid. You are scanning everything anyway, and you could also "range" query on the _id values as equally valid for between two dates.

这里唯一要说的是,在现实世界"的用法中,在日期范围之间$match 对您来说可能更实用进行这种累积,而不是获取第一个"和最后一个"_id 值来定义范围"或实际使用中的类似内容.

The only real thing to say here, is that in "real world" usage, it might just be more practical for you to $match between ranges of dates when doing this sort of accumulation as opposed to getting the "first" and "last" _id values to define a "range" or something similar in your actual usage.

那么这个证据在哪里呢?嗯,重现相当容易,所以我只是通过生成一些示例数据来做到这一点:

So where is the proof of this? Well it is fairly easy to reproduce, so I just did so by generating some sample data:

var stations = [ 
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL",
    "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA",
    "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE",
    "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK",
    "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT",
    "VA", "WA", "WV", "WI", "WY"
];


for ( i=0; i<200000; i++ ) {

    var station = stations[Math.floor(Math.random()*stations.length)];
    var t = Math.floor(Math.random() * ( 96 - 50 + 1 )) +50;
    dt = new Date();

    db.temperatures.insert({
        station: station,
        t: t,
        dt: dt
    });

}

在我的硬件上(8GB 笔记本电脑,带旋转磁盘,虽然不是很好,但肯定足够了)运行每种形式的语句清楚地显示了使用索引和排序的版本的显着暂停(索引上的键与排序相同陈述).这只是一个小小的停顿,但差异很大,足以引起注意.

On my hardware (8GB laptop with spinny disk, which is not stellar, but certainly adequate) running each form of the statement clearly shows a notable pause with the version using an index and a sort ( same keys on index as the sort statement). It is only a minor pause, but the difference is significant enough to notice.

即使查看解释输出(版本 2.6 及更高版本,或者实际上在 2.4.9 中虽然没有记录),您也可以看到其中的差异,尽管 $sort 由于存在索引而被优化,所花费的时间似乎与索引选择然后加载索引条目有关.为 covered" 索引查询包含所有字段没有任何区别.

Even looking at the explain output ( version 2.6 and up, or actually is there in 2.4.9 though not documented ) you can see the difference in that, though the $sort is optimized out due to the presence of an index, the time taken appears to be with index selection and then loading the indexed entries. Including all fields for a "covered" index query makes no difference.

同样对于记录,纯粹索引日期并且仅对日期值进行排序给出相同的结果.可能稍微快一点,但仍然比没有排序的自然索引形式慢.

Also for the record, purely indexing the date and only sorting on the date values gives the same result. Possibly slightly faster, but still slower than the natural index form without the sort.

所以只要你能愉快地对 firstlast _id 值进行范围",那么使用自然插入顺序上的索引实际上是执行此操作的最有效方法.您的现实世界里程可能会因这对您是否实用而有所不同,并且最终可能会更方便地在日期上实施索引和排序.

So as long as you can happily "range" on the first and last _id values, then it is true that using the natural index on the insertion order is actually the most efficient way to do this. Your real world mileage may vary on whether this is practical for you or not and it might simply end up being more convenient to implement the index and sorting on the date.

但是,如果您对在查询中使用 _id 范围或大于最后一个"_id 感到满意,那么也许可以进行调整以获取值您的结果,以便您实际上可以在后续查询中存储和使用该信息:

But if you were happy with using _id ranges or greater than the "last" _id in your query, then perhaps one tweak in order to get the values along with your results so you can in fact store and use that information in successive queries:

db.temperature.aggregate([
    // Get documents "greater than" the "highest" _id value found last time
    { "$match": {
        "_id": { "$gt":  ObjectId("536076603e70a99790b7845d") }
    }},

    // Do the grouping with addition of the returned field
    { "$group": {
        "_id": "$station", 
        "result": { "$last":"$dt"},
        "t": {"$last":"$t"},
        "lastDoc": { "$last": "$_id" } 
    }}
])

如果您确实在跟踪"这样的结果,那么您可以从结果中确定 ObjectId 的最大值,并在下一个查询中使用它.

And if you were actually "following on" the results like that then you can determine the maximum value of ObjectId from your results and use it in the next query.

无论如何,玩得开心,但再次是的,在这种情况下,查询是最快的方式.

Anyhow, have fun playing with that, but again Yes, in this case that query is the fastest way.

这篇关于MongoDB:聚合框架:获取每个分组 ID 的最后日期文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆