MongoDB:聚合框架:按分组ID获取最新日期的文档 [英] MongoDB : Aggregation framework : Get last dated document per grouping ID

查看:80
本文介绍了MongoDB:聚合框架:按分组ID获取最新日期的文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获取每个站点的最后一个文档以及所有其他字段:

I want to get the last document for each station with all other fields :

{
        "_id" : ObjectId("535f5d074f075c37fff4cc74"),
        "station" : "OR",
        "t" : 86,
        "dt" : ISODate("2014-04-29T08:02:57.165Z")
}
{
        "_id" : ObjectId("535f5d114f075c37fff4cc75"),
        "station" : "OR",
        "t" : 82,
        "dt" : ISODate("2014-04-29T08:02:57.165Z")
}
{
        "_id" : ObjectId("535f5d364f075c37fff4cc76"),
        "station" : "WA",
        "t" : 79,
        "dt" : ISODate("2014-04-29T08:02:57.165Z")
}

我需要拥有t站,才能获得每个站的最新dt. 使用聚合框架:

I need to have t and station for the latest dt per station. With the aggregation framework :

db.temperature.aggregate([{$sort:{"dt":1}},{$group:{"_id":"$station", result:{$last:"$dt"}, t:{$last:"$t"}}}])

返回

{
        "result" : [
                {
                        "_id" : "WA",
                        "result" : ISODate("2014-04-29T08:02:57.165Z"),
                        "t" : 79
                },
                {
                        "_id" : "OR",
                        "result" : ISODate("2014-04-29T08:02:57.165Z"),
                        "t" : 82
                }
        ],
        "ok" : 1
}

这是最有效的方法吗?

谢谢

推荐答案

要直接回答您的问题,是的,这是最有效的方法.但是我确实认为我们需要澄清为什么会这样.

To directly answer your question, yes it is the most efficient way. But I do think we need to clarify why this is so.

正如备选方案中所建议的那样,人们正在看的一件事是在进入 $group 阶段之前对结果进行排序",而他们所看的是时间戳"值,因此,您需要确保所有内容均按时间戳"顺序排列,因此格式为:

As was suggested in alternatives, the one thing people are looking at is "sorting" your results before passing to a $group stage and what they are looking at is the "timestamp" value, so you would want to make sure that everything is in "timestamp" order, so hence the form:

db.temperature.aggregate([
    { "$sort": { "station": 1, "dt": -1 } },
    { "$group": {
        "_id": "$station", 
        "result": { "$first":"$dt"}, "t": {"$first":"$t"} 
    }}
])

并且如上所述,您当然希望有一个索引来反映该索引,以使排序高效:

And as stated you will of course want an index to reflect that in order to make the sort efficient:

但是,这才是真正的重点.似乎被其他人忽略了(如果您自己不是这样),所有这些数据很可能已经按时间顺序已经插入了,因为每次读取都被记录为已添加.

However, and this is the real point. What seems have been overlooked by others ( if not so for yourself ) is that all of this data is likely being inserted already in time order, in that each reading is recorded as added.

所以它的优点是_id字段(默认为ObjectId)已经按时间戳"顺序,因为它本身实际上包含一个时间值,这使得该语句成为可能:

So the beauty of this is the the _id field ( with a default ObjectId ) is already in "timestamp" order, as it does itself actually contain a time value and this makes the statement possible:

db.temperature.aggregate([
    { "$group": {
        "_id": "$station", 
        "result": { "$last":"$dt"}, "t": {"$last":"$t"} 
    }}
])

更快.为什么?好吧,您不需要选择索引(要调用的其他代码),也不需要除文档之外加载"索引.

And it is faster. Why? Well you don't need to select an index ( additional code to invoke) you also don't need to "load" the index in addition to the document.

我们已经知道文档是按顺序排列的(按_id),因此 $last 边界是完全有效的.无论如何,您都在扫描所有内容,也可以对两个日期之间的有效期_id值进行范围"查询.

We already know the documents are in order ( by _id ) so the $last boundaries are perfectly valid. You are scanning everything anyway, and you could also "range" query on the _id values as equally valid for between two dates.

这里唯一要说的是,在现实世界"用法中,当您进行这种类型的累加时,在日期范围之间 $match 可能更实用.相对于获取第一个"和最后一个" _id值来定义范围"或您实际用法中的类似内容.

The only real thing to say here, is that in "real world" usage, it might just be more practical for you to $match between ranges of dates when doing this sort of accumulation as opposed to getting the "first" and "last" _id values to define a "range" or something similar in your actual usage.

那么,这的证据在哪里?很好,它很容易复制,因此我只是通过生成一些样本数据来做到这一点:

So where is the proof of this? Well it is fairly easy to reproduce, so I just did so by generating some sample data:

var stations = [ 
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL",
    "GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA",
    "ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE",
    "NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK",
    "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT",
    "VA", "WA", "WV", "WI", "WY"
];


for ( i=0; i<200000; i++ ) {

    var station = stations[Math.floor(Math.random()*stations.length)];
    var t = Math.floor(Math.random() * ( 96 - 50 + 1 )) +50;
    dt = new Date();

    db.temperatures.insert({
        station: station,
        t: t,
        dt: dt
    });

}

在我的硬件(8GB笔记本电脑,带有旋转磁盘,虽然不是很好,但肯定足够)上运行每种形式的语句,都清楚地显示了使用索引和排序(该索引上的键与排序相同)的版本明显暂停陈述).只是一个小小的停顿,但差异足以引起注意.

On my hardware (8GB laptop with spinny disk, which is not stellar, but certainly adequate) running each form of the statement clearly shows a notable pause with the version using an index and a sort ( same keys on index as the sort statement). It is only a minor pause, but the difference is significant enough to notice.

即使查看了说明输出(版本2.6及更高版本,或实际上在2.4.9中,尽管没有记录),您仍然可以看到其中的区别,尽管 $sort 已被优化由于存在索引,因此花费的时间似乎与选择索引然后加载索引条目有关.包含已覆盖" 索引查询的所有字段都没有区别.

Even looking at the explain output ( version 2.6 and up, or actually is there in 2.4.9 though not documented ) you can see the difference in that, though the $sort is optimized out due to the presence of an index, the time taken appears to be with index selection and then loading the indexed entries. Including all fields for a "covered" index query makes no difference.

也为记录起见,仅对日期建立索引,仅对日期值进行排序会得出相同的结果.可能比不进行排序的自然索引格式稍快一些,但仍然较慢.

Also for the record, purely indexing the date and only sorting on the date values gives the same result. Possibly slightly faster, but still slower than the natural index form without the sort.

只要您可以在 first last _id值上愉快地范围",那么在插入顺序上使用自然索引确实是正确的实际上是最有效的方法.您的实际里程可能会因您是否实际而异,最终可能会更方便地实现索引和按日期排序.

So as long as you can happily "range" on the first and last _id values, then it is true that using the natural index on the insertion order is actually the most efficient way to do this. Your real world mileage may vary on whether this is practical for you or not and it might simply end up being more convenient to implement the index and sorting on the date.

但是,如果您对查询中使用_id范围或大于最后一个" _id范围感到满意,则可能需要进行一些调整以获取值和结果,以便实际上可以存储和使用在连续查询中得到的信息:

But if you were happy with using _id ranges or greater than the "last" _id in your query, then perhaps one tweak in order to get the values along with your results so you can in fact store and use that information in successive queries:

db.temperature.aggregate([
    // Get documents "greater than" the "highest" _id value found last time
    { "$match": {
        "_id": { "$gt":  ObjectId("536076603e70a99790b7845d") }
    }},

    // Do the grouping with addition of the returned field
    { "$group": {
        "_id": "$station", 
        "result": { "$last":"$dt"},
        "t": {"$last":"$t"},
        "lastDoc": { "$last": "$_id" } 
    }}
])

如果您实际上是跟踪"这样的结果,则可以从结果中确定ObjectId的最大值,并在下一个查询中使用它.

And if you were actually "following on" the results like that then you can determine the maximum value of ObjectId from your results and use it in the next query.

无论如何,玩得开心,但是再一次,是的,在这种情况下,查询是最快的方法.

Anyhow, have fun playing with that, but again Yes, in this case that query is the fastest way.

这篇关于MongoDB:聚合框架:按分组ID获取最新日期的文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆