MongoDB中的子查询 [英] Sub-query in MongoDB

查看:2348
本文介绍了MongoDB中的子查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在MongoDB中有两个集合,一个与用户有关,一个与操作有关.用户大致像:

I have two collections in MongoDB, one with users and one with actions. Users look roughly like:

{_id: ObjectId("xxxxx"), country: "UK",...} 

和类似的动作

{_id: ObjectId("yyyyy"), createdAt: ISODate(), user: ObjectId("xxxxx"),...}

我正在尝试统计事件和按国家/地区划分的不同用户.前半部分工作正常,但是当我尝试添加子查询以提取国家/地区时,我只会得到国家/地区的空值

I am trying to count events and distinct users split by country. The first half of which is working fine, however when I try to add in a sub-query to pull the country I only get nulls out for country

db.events.aggregate({
    $match: {
        createdAt: { $gte: ISODate("2013-01-01T00:00:00Z") },
        user: { $exists: true }
    }
},
{
    $group: {
        _id: {
            year: { $year: "$createdAt" },
            user_obj: "$user"
        },
        count: { $sum: 1 }
    }
},
{
    $group: {
        _id: {
            year: "$_id.year",
            country: db.users.findOne({ 
                _id: { $eq: "$_id.user_obj" },
                country: { $exists: true } 
            }).country
        },
        total: { $sum: "$count" },
        distinct: { $sum: 1 }
    }
})

推荐答案

没有人加入,只有我们负担


因此MongoDB不做连接".例如,您可能在shell中尝试过类似的操作:

No Joins in here, just us bears


So MongoDB "does not do joins". You might have tried something like this in the shell for example:

db.events.find().forEach(function(event) {
    event.user = db.user.findOne({ "_id": eventUser });
    printjson(event)
})

但是,这并不像您认为的那样起作用.它实际上按照它的样子进行操作,并在用户"集合上对从事件"集合返回的每个项目(往返"客户端")返回的所有查询运行一个查询,并且该查询不在服务器上运行.

But this does not do what you seem to think it does. It actually does exactly what it looks like and, runs a query on the "user" collection for every item that is returned from the "events" collection, both "to and from" the "client" and is not run on the server.

出于同样的原因,您在聚合管道中的嵌入"语句也不能那样工作.与上述不同,整个管道"逻辑在执行之前被发送到服务器.因此,如果您这样做是为了选择英国"用户:

For the same reasons your 'embedded' statement within an aggregation pipeline does not work like that. Unlike the above the "whole pipeline" logic is sent to the server before execution. So if you did something like this to 'select "UK" users:

db.events.aggregate([
    { "$match": {
        "user": { 
            "$in": db.users.distinct("_id",{ "country": "UK" })
        }
    }}
])

然后,该.distinct()查询实际上是在客户端"而不是服务器上评估的,因此对聚合管道中的任何文档值均不可用.因此,.distinct()首先运行,将其数组作为参数返回,然后将整个管道发送到服务器.这就是执行顺序.

Then that .distinct() query is actually evaluated on the "client" and not the server and therefore not having availability to any document values in the aggregation pipeline. So the .distinct() runs first, returns it's array as an argument and then the whole pipeline is sent to the server. That is the order of execution.

对于要运行的查询类型,您至少需要某种程度的反规范化.因此,您通常有两种选择:

You need at least some level of de-normalization for the sort of query you want to run to work. So you generally have two choices:

  1. 将整个用户对象数据嵌入事件数据中.

  1. Embed your whole user object data within the event data.

至少将一些"用户对象数据嵌入事件数据中.在这种情况下,国家"是因为您将要使用它.

At least embed "some" of the user object data within the event data. In this case "country" becasue you are going to use it.

因此,如果您在此遵循第二种"情况,并且至少将现有数据至少扩展"一下,以包括国家",如下所示:

So then if you follow the "second" case there and at least "extend" your existing data a little to include the "country" like this:

{
    "_id": ObjectId("yyyyy"), 
    "createdAt": ISODate(), 
    "user": {
        "_id": ObjectId("xxxxx"),
        "country": "UK"
    }
}

然后,聚合"过程变得简单:

Then the "aggregation" process becomes simple:

db.events.aggregate([
    { "$match": {
        "createdAt": { "$gte": ISODate("2013-01-01T00:00:00Z") },
        "user": { "$exists": true }
    }},
    { "$group": {
        "_id": {
            "year": { "$year": "$createdAt" },
            "user_id": "$user._id"
            "country": "$user.country"
        },
        "count": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id.country",
        "total": { "$sum": "$count" },
        "distinct": { "$sum": 1 }
    }}
])

我们不正常


修复数据以将其所需的信息包含在我们不进行联接"的单个集合中是一个相对简单的过程.确实是上面原始查询示例的一个变体:

We're not normal


Fixing your data to include the information it needs on a single collection where we "do not do joins" is a relatively simple process. Just really a variant on the original query sample above:

var bulk = db.events.intitializeUnorderedBulkOp(),
    count = 0;

db.users.find().forEach(function(user) {
    // update multiple events for user
    bulk.find({ "user": user._id }).update({
        "$set": { "user": { "_id": user._id, "country": user.country } }
    });
    count++;

    // Send batch every 1000 
    if ( count % 1000 == 0 ) {
        bulk.execute();
        bulk = db.events.intitializeUnorderedBulkOp();
    }
});

// Clear any queued
if ( count % 1000 != 0 )
    bulk.execute();

这就是全部内容.对MongoDB服务器的单个查询可使用一个集合"和仅一个集合".甚至上面显示的出色的批量操作"仍然只能分批"到一个集合中.

So that's what it's all about. Individual queries to a MongoDB server get "one collection" and "one collection only" to work with. Even the fantastic "Bulk Operations" as shown above can still only be "batched" on a single collection.

如果要执行聚合相关属性"之类的操作,则必须"将这些属性包含在要为其聚合数据的集合中.将数据放在不同的集合中是完全可以的,例如,用户"通常会附加比"_id"和国家/地区"更多的信息.

If you want to do things like "aggregate on related properties", then you "must" contain those properties in the collection you are aggregating data for. It is perfectly okay to live with having data sitting in separate collections, as for instance "users" would generally have more information attached to them than just and "_id" and a "country".

但是这里的要点是,如果您需要国家"来分析用户"对事件"数据的分析,然后也将其包括在数据中.最有效的服务器联接是预联接",这是通常在此处实践中的理论.

But the point here is if you need "country" for analysis of "event" data by "user", then include it in the data as well. The most efficient server join is a "pre-join", which is the theory in practice here in general.

这篇关于MongoDB中的子查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆