在聚合期间围绕匹配查询对数据进行分区 [英] Partition data around a match query during aggregation

查看:44
本文介绍了在聚合期间围绕匹配查询对数据进行分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直试图解决的是在 mongo 查询中执行某种分区(按谓词拆分).我当前的查询看起来像:

What I have been trying to get my head around is to perform some kind of partitioning(split by predicate) in a mongo query. My current query looks like:

db.posts.aggregate([
 {"$match": { $and:[ {$or:[{"toggled":false},{"toggled":true, "status":"INACTIVE"}]}  ,  {"updatedAt":{$gte:1549786260000}} ] }},
 {"$unwind" :"$interests"},
 {"$group" : {"_id": {"iid": "$interests", "pid":"$publisher"}, "count": {"$sum" : 1}}},
 {"$project":{ _id: 0, "iid": "$_id.iid", "pid": "$_id.pid", "count": 1 }}
])

结果如下:

{
    "count" : 3.0,
    "iid" : "INT456",
    "pid" : "P789"
}
{
    "count" : 2.0,
    "iid" : "INT789",
    "pid" : "P789"
}
{
    "count" : 1.0,
    "iid" : "INT123",
    "pid" : "P789"
}
{
    "count" : 1.0,
    "iid" : "INT123",
    "pid" : "P123"
}

<小时>

到目前为止一切都很好,但后来我意识到对于匹配特定过滤器 {"toggled":true, "status":"INACTIVE"} 的文档,我宁愿 减少计数(-1).(考虑到最终值也可能为负.)


All good so far, but then I had realized that for the documents that match the specific filter {"toggled":true, "status":"INACTIVE"}, I would rather decrement the count (-1). (considering the eventual value can be negative as well.)

有没有办法在match之后以某种方式对数据进行分区,以确保对两个文档集合执行不同的grouping操作?

Is there a way to somehow partition the data after match to make sure different grouping operations are performed for both the collection of documents?

听起来与我正在寻找的东西相似的是$mergeObjects,或者$reduce,但与文档示例相关的内容不多.

Something that sounds similar to what I am looking for is $mergeObjects, or maybe $reduce, but not much that I can relate from the documentation examples.

注意:我能感觉到,处理这个问题的一种直接方法是执行两个查询,但我正在寻找一个查询来执行操作.

Note: I can sense, one straightforward way to deal with this would be to perform two queries, but I am looking for a single query to perform the operation.

上述输出的示例文档为:

Sample documents for the above output would be:

/* 1 */
{
    "_id" : ObjectId("5d1f7******"),
    "id" : "CON123",
    "title" : "Game",
    "content" : {},
    "status" : "ACTIVE",
    "toggle":false,
    "publisher" : "P789",
    "interests" : [ 
        "INT456"
    ],
    "updatedAt" : NumberLong(1582078628264)
}

/* 2 */
{
    "_id" : ObjectId("5d1f8******"),
    "id" : "CON456",
    "title" : "Home",
    "content" : {},
    "status" : "INACTIVE",
    "toggle":true,
    "publisher" : "P789",
    "interests" : [ 
        "INT456",
        "INT789"
    ],
    "updatedAt" : NumberLong(1582078628264)
}

/* 3 */
{
    "_id" : ObjectId("5d0e9******"),
    "id" : "CON654",
    "title" : "School",
    "content" : {},
    "status" : "ACTIVE",
    "toggle":false,
    "publisher" : "P789",
    "interests" : [ 
        "INT123",
        "INT456",
        "INT789"
    ],
    "updatedAt" : NumberLong(1582078628264)
}

/* 4 */
{
    "_id" : ObjectId("5d207*******"),
    "id" : "CON789",
    "title":"Stack",
    "content" : { },
    "status" : "ACTIVE",
    "toggle":false,
    "publisher" : "P123",
    "interests" : [ 
        "INT123"
    ],
    "updatedAt" : NumberLong(1582078628264)
}

<小时>

我期待的结果是


What I am looking forward to as a result though is

{
    "count" : 1.0, (2-1)
    "iid" : "INT456",
    "pid" : "P789"
}
{
    "count" : 0.0, (1-1)
    "iid" : "INT789",
    "pid" : "P789"
}
{
    "count" : 1.0,
    "iid" : "INT123",
    "pid" : "P789"
}
{
    "count" : 1.0,
    "iid" : "INT123",
    "pid" : "P123"
}

推荐答案

这种聚合给出了想要的结果.

This aggregation gives the desired result.

db.posts.aggregate( [
{ $match:  { updatedAt: { $gte: 1549786260000 } } },
{ $facet: {
        FALSE: [
            { $match: { toggle: false } },
            { $unwind : "$interests" },
            { $group : { _id : { iid: "$interests", pid: "$publisher" }, count: { $sum : 1 } } },
        ],
        TRUE: [
            { $match: { toggle: true, status: "INACTIVE" } },
            { $unwind : "$interests" },
            { $group : { _id : { iid: "$interests", pid: "$publisher" }, count: { $sum : -1 } } },
        ]
} },
{ $project: { result: { $concatArrays: [ "$FALSE", "$TRUE" ] } } },
{ $unwind: "$result" },
{ $replaceRoot: { newRoot: "$result" } },
{ $group : { _id : "$_id", count: { $sum : "$count" } } },
{ $project:{ _id: 0, iid: "$_id.iid", pid: "$_id.pid", count: 1 } }
] )


使用来自问题帖子的输入数据的查询输出:

The output from the query using the input data from the question post:

{ "count" : 1, "iid" : "INT123", "pid" : "P789" }
{ "count" : 1, "iid" : "INT123", "pid" : "P123" }
{ "count" : 0, "iid" : "INT789", "pid" : "P789" }
{ "count" : 1, "iid" : "INT456", "pid" : "P789" }


<小时>

这个查询用不同的方法(代码)得到相同的结果:

This query gets the same result with different approach (code):

db.posts.aggregate( [
  { 
      $match:  { updatedAt: { $gte: 1549786260000 } } 
  },
  { 
      $unwind : "$interests" 
  },
  { 
      $group : { 
          _id : { 
              iid: "$interests", 
              pid: "$publisher" 
          }, 
          count: { 
              $sum: {
                  $switch: {
                      branches: [
                        { case: { $eq: [ "$toggle", false ] },
                           then: 1 },
                        { case: { $and: [ { $eq: [ "$toggle", true] },  { $eq: [ "$status", "INACTIVE" ] } ] },
                           then: -1 }
                      ]
                  }          
              } 
          }
      } 
  },
  { 
      $project:{
           _id: 0, 
           iid: "$_id.iid", 
           pid: "$_id.pid", 
           count: 1 
      } 
  }
] )


注意:

facet 查询在同一组文档上运行两个 facet(TRUE 和 FALSE);它就像两个并行运行的查询.但是,存在一些重复的代码以及用于在管道中形成文档以获得所需输出的附加阶段.

The facet query runs the two facets (TRUE and FALSE) on the same set of documents; it is like two queries running in parallel. But, there is some duplication of code as well as additional stages for shaping the documents down the pipeline to get the desired output.

第二个查询避免了代码重复,并且聚合管道中的阶段要少得多.当输入数据集有大量文档要处理时,这将在性能方面有所不同.一般来说,较少的阶段意味着较少的文档迭代(因为一个阶段必须扫描从前一个阶段输出的文档).

The second query avoids the code duplication, and there are much lesser stages in the aggregation pipeline. This will make difference when the input dataset has a large number of documents to process - in terms of performance. In general, lesser stages means lesser iterations of the documents (as a stage has to scan the documents which are output from the previous stage).

这篇关于在聚合期间围绕匹配查询对数据进行分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆