Mongodb汇总,如何按间隔标准计数文档? [英] Mongodb aggregate, How to count documents by interval criteria?

查看:60
本文介绍了Mongodb汇总,如何按间隔标准计数文档?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的MongoDB文档看起来像这样:

my MongoDB document looks like this:

{StatCode : "...", LoadTime  : "..."}

例如,数据可能如下所示:

For example, the data might look like this:

+-----+----------+------------+
| _id | StatCode |  LoadTime  |
+-----+----------+------------+
|   1 |      200 |      0,345 |
|   2 |      200 |      0,234 |
|   3 |      200 |      0,396 |
|   4 |      200 |      1,234 |
|   5 |      200 |      2,564 |
|   6 |      200 |      0,437 |
|   7 |      301 |      0,523 |
|   8 |      301 |      0,628 |
|   9 |      301 |      0,712 |
|  10 |      200 |      1,784 | 
+-----+----------+------------+

我想按如下所示的LoadTime值获取计数组: 最慢(大于2),慢(在1和2之间),中(在0.5和1之间)和快速(在0.5以下)

I want to get the count group by LoadTime values like this : Slowest (more than 2), Slow (between 1 and 2), Medium (between 0,5 and 1) and Fast (below 0,5)

使用数据示例,结果将如下所示:

With Data example, the result would look like this:

+----------+-------+
| Info_id  | Count |
+----------+-------+
| Slowest  |     1 |
| Slow     |     2 |
| Medium   |     3 |
| Fast     |     4 |
+----------+-------+

编辑:注释Neil Lunn,来自MongoDB的示例文档

Remark Neil Lunn, sample documents from MongoDB

{
    "_id" : 1,
    "LoadTime" : NumberLong(345),
    "StatCode" : 200
}
{
    "_id" : 2,
    "LoadTime" : NumberLong(234),
    "StatCode" : 200
}
....
{
    "_id" : 9,
    "LoadTime" : NumberLong(712),
    "StatCode" : 301
}
{
    "_id" : 10,
    "LoadTime" : NumberLong( 1784),
    "StatCode" : 200
}

如何使用MongoDB聚合框架完成此任务?

How can I accomplish this using the MongoDB aggregation framework?

推荐答案

您想要的是 $ cond 运算符和许多嵌套条件,其中 $和.但这应该可以为您提供所需的一切.

What you do want is the $cond operator and quite a few nested conditions with $and. But this should give you exactly what you want.

db.collection.aggregate([
    {"$group": {
      "_id": {"$cond": [
          {"$gte": ["$LoadTime", 2000] },
          "Slowest",                                   // return "Slowest" where true
          {"$cond": [
              {"$and": [
                  {"$lt": ["$LoadTime", 2000] },
                  {"$gte": ["$LoadTime", 1000] }
              ]},
              "Slow",                                  // then "Slow" here where true
              {"$cond": [
                  {"$and": [
                      {"$lt": ["$LoadTime", 1000] },
                      {"$gte": ["$LoadTime", 500 ] }
                  ]},
                  "Medium",                            // then "Medium" where true
                  "Fast"                               // and finally "Fast" < 500
              ]}
          ]}
      ]},
      "count": {"$sum": 1}
    }},
    {"$sort": { "count": 1 }}
])

由于您的时间为毫秒,因此可以看到我要求进行修改的原因.

As your time is whole milliseconds you can see why I asked for the edit.

$ cond 三元 运算符,它需要三个参数:

So as $cond is a ternary operator, it takes three arguments being:

  • 要评估的条件,该条件返回布尔值
  • 条件为 true
  • 的返回值
  • 条件为 false
  • 的返回值
  • A condition to evaluate which returns a boolean
  • A return value where the condition is true
  • A return value where the condition is false

因此,您的想法是在整个过程中嵌套,对 false 进行 next 测试,直到找到匹配的条件,以及要返回的值.

Therefore the idea is that you nest the conditions throughout, moving to the next test on false until you have found a condition to match, and an value to return.

$ and 部分是要包含的条件的数组.这为您提供了范围.因此,在最长的部分中:

The $and part is a array of conditions to include. This gives you the ranges. So in the longest parts:

          {"$cond": [                             // Evaluate here
              {"$and": [                          // Within the range of the next 2
                  {"$lt": ["$LoadTime", 2000] },
                  {"$gte": ["$LoadTime", 1000] }
              ]},
              "Slow",                            // true condition - return
              {"$cond": [                        // false - move to next eval

级联后,您会在500毫秒内将快速"保留为times.

Cascading through you are left with "Fast" for times under 500 milliseconds.

这些keys的每一个都被发送到该组,而我们{ $sum: 1 }只是将它们分组在一起而获得计数.

Each of these keys is emitted to the group and we just { $sum: 1 } to get a count as they are grouped together.

如果您需要在自己的语言实现中使用,请在

If you need that in your own language implementation, the whole pipeline content within

aggregate(..)

只是JSON,因此如果您不愿手工翻译,或者像我一样懒惰,则可以将其解析为您的本机数据结构.

is just JSON, so you can parse that into your native data structure if translating by hand eludes you, or if like me you're just lazy.

由于评论,似乎有必要解释所显示查询的表格.因此,这里有用于澄清的编辑附录.

Due to the comments it seems necessary to explain the form of the presented query. So here the edit addendum for clarification.

学习使用聚合管道时,并且确实是良好实践,用于写出并测试一系列复杂的阶段或逻辑时,我发现通过一次一步一步实现可视化结果很有用.因此,在编写此类内容时,我的第一步步骤如下:

When learning use of the aggregation pipeline, and indeed good practice for writing out and testing a complex series of stages or logic, I find that it useful to visualise the results by implementing parts one step at a time. So in the case of writing such a thing my first step would be as follows:

db.collection.aggregate([
    {"$group": {
      "_id": {"$cond": [
          {"$gte": ["$LoadTime", 2000] },
          "Slowest",
          null
       ]}
    }}
])

现在,这将使我得到最慢"的计数,然后将 bucket 的所有其他内容放入null.因此,有一个阶段可以看到到目前为止的结果.但是,当进行 testing 测试时,我实际上会做这样的事情,然后继续构建链:

Now that would give me the count of "Slowest" as I would expect and then bucket everything else into null. So there is a stage where I see the results so far. But when testing I would actually do something like this before moving on to build up a chain:

db.collection.aggregate([
    {"$group": {
      "_id": {"$cond": [
          {"$and": [
              {"$lt": ["$LoadTime", 2000] },
              {"$gte": ["$LoadTime", 1000] }
          ]},
          "Slow",
          null
      ]}
    }}
])

所以我只是将慢"(介于 2000和1000之间)的结果与null存储桶中的所有其他内容一起使用.所以我的总人数保持不变.

So I am just getting the results for "Slow" (between 2000 and 1000) with everything else in the null bucket. So my overall count remains the same.

在指出的最终查询中, 这样嵌套的三元 条件,第一阶段已经已经评估了false测试的项目下一个运算符.这意味着它们 大于第一阶段已经测试过的值,从而无需测试该条件因此,该可以编写如下:

In the final query, as was pointed out, in a ternary condition that is nested such as this, the first stage has already evaluated false for the items being tested by the next operator. This means that they are not greater than the value which was already tested in the first stage, and that obviates the need to test for that condition so this could be written as follows:

db.collection.aggregate([
    {"$group": {
      "_id": {"$cond": [
          {"$gte": ["$LoadTime", 2000] },       // Caught everything over 2000
          "Slowest",
          {"$cond": [
              {"$gte": ["$LoadTime", 1000] }    // Catch things still over 1000
              "Slow",
              {"$cond": [                       // Things under 1000 go here

              // and so on

并且短路是评估,因为没有实际需要测试不会经历下一个逻辑条件的事物.

And that short circuits the evaluation as there is no real need to test for things that won't come through to the next logical condition.

因此完全是出于视觉原因,并且由于剪切和粘贴逻辑的绝对惰性,我们最终使用 $和条件以包装范围.但是对于那些不习惯的人来说, 三元的用法形式有一个清晰的视觉提示,表明在此阶段匹配的结果将介于2000ms1000ms的值之间,依此类推,这就是您希望在每个范围内得到的结果.

So purely for visual reasons and for sheer laziness of cut and paste logic, we end up with the expanded form using the $and condition to wrap the range. But for those not used to the usage of the ternary form there is a clear visual cue that the results being matched in this phase will fall between the values of 2000ms and 1000ms,and so on, which is what you want as a result in each range.

正如我说的那样,由于逻辑的工作原理而不必要,但它发展阶段,对于尚未的人来说很清楚 > 三元 表格的用法 $ cond 提供的内容.

As I said, unnecessary to have because of how the logic works, but it was a development phase, and is clear to the people who are yet to get their heads around usage of the ternary form that $cond provides.

这篇关于Mongodb汇总,如何按间隔标准计数文档?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆