返回有限数量的某种类型的记录,但是无限数量的其他记录? [英] Return limited number of records of a certain type, but unlimited number of other records?

查看:57
本文介绍了返回有限数量的某种类型的记录,但是无限数量的其他记录?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个查询,我需要在返回所有其他记录的同时返回10条"Type A"记录.我该怎么做?

I have a query where I need to return 10 of "Type A" records, while returning all other records. How can I accomplish this?

更新:诚然,我可以通过两个查询来执行此操作,但我想避免这种情况,如果可能的话,认为这样做会减少开销,并且可能会提高性能.我的查询已经是一个将两种记录都考虑在内的聚合查询,我只需要限制结果中一种记录的数量即可.

Update: Admittedly, I could do this with two queries, but I wanted to avoid that, if possible, thinking it would be less overhead, and possibly more performant. My query already is an aggregation query that takes both kinds of records into account, I just need to limit the number of the one type of record in the results.

更新:以下是突出显示问题的示例查询:

Update: the following is an example query that highlights the problem:

db.books.aggregate([
    {$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
    {$match:    {"type": "Fiction"}},
    {$project:  {
        'title': 1,
        'author': 1,
        'type': 1,
        'typeSortOrder': 
            {$add: [
                {$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
                {$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
                {$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
        ]},
    }},
    {$sort: {'typeSortOrder'}},
    {$limit: 10}
])

db.books.aggregate([
    {$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
    {$match:    {"type": "Horror"}},
    {$project:  {
        'title': 1,
        'author': 1,
        'type': 1,
        'typeSortOrder': 
            {$add: [
                {$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
                {$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
                {$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
        ]},
    }},
    {$sort: {'typeSortOrder'}},
    {$limit: 10}
])

db.books.aggregate([
    {$geoNear: {near: [-118.09771, 33.89244], distanceField: "distance", spherical: true}},
    {$match:    {"type": "Science"}},
    {$project:  {
        'title': 1,
        'author': 1,
        'type': 1,
        'typeSortOrder': 
            {$add: [
                {$cond: [{$eq: ['$type', "Fiction"]}, 1, 0]},
                {$cond: [{$eq: ['$type', "Science"]}, 0, 0]},
                {$cond: [{$eq: ['$type', "Horror"]}, 3, 0]}
        ]},
    }},
    {$sort: {'typeSortOrder'}},
    {$limit: 10}
])

我想在一次查询中返回所有这些记录,但是将类型限制为最多10个任何类别. 我意识到当这样分解查询时,typeSortOrder不必是有条件的,当查询最初是一个查询时(我想回到这里),我就可以使用它.

I would like to have all these records returned in one query, but limit the type to at most 10 of any category. I realize that the typeSortOrder doesn't need to be conditional when the queries are broken out like this, I had it there for when the queries were one query, originally (which is where I would like to get back to).

推荐答案

问题


这里的结果并非不可能,但也可能不切实际.进行了一般性注释,您不能切片"数组或以其他方式限制"推送到一个数组上的结果量.而且,按类型"执行此操作的方法本质上是使用数组.

Problem


The results here are not impossible but are also possibly impractical. The general notes have been made that you cannot "slice" an array or otherwise "limit" the amount of results pushed onto one. And the method for doing this per "type" is essentially to use arrays.

不切实际"部分通常是关于结果数的,当分组"时结果集太大会破坏BSON文档限制.但是,我将在您的地理位置搜索"中考虑其他建议,并最终目的是最多返回每种类型"的10个结果.

The "impractical" part is usually about the number of results, where too large a result set is going to blow up the BSON document limit when "grouping". But, I'm going to consider this with some other recommendations on your "geo search" along with the ultimate goal to return 10 results of each "type" at most.

首先要考虑和理解问题,让我们看一下简化的数据集"和返回每种类型的前2个结果"所必需的管道代码:

To first consider and understand the problem, let's look at a simplified "set" of data and the pipeline code necessary to return the "top 2 results" from each type:

{ "title": "Title 1", "author": "Author 1", "type": "Fiction", "distance": 1 },
{ "title": "Title 2", "author": "Author 2", "type": "Fiction", "distance": 2 },
{ "title": "Title 3", "author": "Author 3", "type": "Fiction", "distance": 3 },
{ "title": "Title 4", "author": "Author 4", "type": "Science", "distance": 1 },
{ "title": "Title 5", "author": "Author 5", "type": "Science", "distance": 2 },
{ "title": "Title 6", "author": "Author 6", "type": "Science", "distance": 3 },
{ "title": "Title 7", "author": "Author 7", "type": "Horror", "distance": 1 }

这是数据的简化视图,在某种程度上代表了初始查询后的文档状态.现在来介绍如何使用聚合管道为每个类型"获得最近"两个结果的技巧:

That's a simplified view of the data and somewhat representative of the state of documents after an initial query. Now comes the trick of how to use the aggregation pipeline to get the "nearest" two results for each "type":

db.books.aggregate([
    { "$sort": { "type": 1, "distance": 1 } },
    { "$group": {
        "_id": "$type",
        "1": { 
            "$first": {
                "_id": "$_id",
                "title": "$title",
                "author": "$author",
                "distance": "$distance"
            }
         },
         "books": {
             "$push": {
                "_id": "$_id",
                "title": "$title",
                "author": "$author",
                "distance": "$distance"
              }
         }
    }},
    { "$project": {
        "1": 1,
        "books": {
            "$cond": [
                { "$eq": [ { "$size": "$books" }, 1 ] },
                { "$literal": [false] },
                "$books"
            ]
        }
    }},
    { "$unwind": "$books" },
    { "$project": {
        "1": 1,
        "books": 1,
        "seen": { "$eq": [ "$1", "$books" ] }
    }},
    { "$sort": { "_id": 1, "seen": 1 } },
    { "$group": {
        "_id": "$_id",
        "1": { "$first": "$1" },
        "2": { "$first": "$books" },
        "books": {
            "$push": {
                "$cond": [ { "$not": "$seen" }, "$books", false ]
            }
        }
    }},
    { "$project": {
        "1": 1,
        "2": 2,
        "pos": { "$literal": [1,2] }
    }},
    { "$unwind": "$pos" },
    { "$group": {
        "_id": "$_id",
        "books": {
            "$push": {
                "$cond": [
                    { "$eq": [ "$pos", 1 ] },
                    "$1",
                    { "$cond": [
                        { "$eq": [ "$pos", 2 ] },
                        "$2",
                        false
                    ]}
                ]
            }
        }
    }},
    { "$unwind": "$books" },
    { "$match": { "books": { "$ne": false } } },
    { "$project": {
        "_id": "$books._id",
        "title": "$books.title",
        "author": "$books.author",
        "type": "$_id",
        "distance": "$books.distance",
        "sortOrder": {
            "$add": [
                { "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
                { "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
                { "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] }
            ]
        }
    }},
    { "$sort": { "sortOrder": 1 } }
])

当然这只是两个结果,但是它概述了获取n结果的过程,这自然是在生成的管道代码中完成的.在进入代码之前,该过程值得一游.

Of course that is just two results, but it outlines the process for getting n results, which naturally is done in generated pipeline code. Before moving onto the code the process deserves a walk through.

在进行任何查询之后,要做的第一件事是 $sort 结果,您基本上希望同时通过类型"的分组键"和距离"进行操作,以使最近"项位于顶部.

After any query, the first thing to do here is $sort the results, and this you want to basically do by both the "grouping key" which is the "type" and by the "distance" so that the "nearest" items are on top.

$group 阶段将重复进行.实质上是弹出 $first 会导致每个分组堆栈丢失,因此不会丢失其他文档,而是使用

The reason for this is shown in the $group stages that will repeat. What is done is essentially "popping the $first result off of each grouping stack. So other documents are not lost, they are placed in an array using $push.

为了安全起见,实际上仅在第一步"之后才需要进行下一阶段,但可以选择在重复中添加类似的过滤条件.此处的主要检查是所得的数组"大于仅一个项目.如果不是,则将内容替换为单个值false.其原因将变得显而易见.

Just to be safe, the next stage is really only required after the "first step", but could optionally be added for similar filtering in the repetition. The main check here is that the resulting "array" is larger than just one item. Where it is not, the contents are replaced with a single value of false. The reason for which is about to become evident.

在此第一步"之后,便是真正的重复周期,然后使用

After this "first step" the real repetition cycle beings, where that array is then "de-normalized" with $unwind and then a $project made in order to "match" the document that has been last "seen".

由于只有一个文档将符合此条件,因此结果将再次进行排序",以便将看不见的"文档浮动到顶部,同时当然还要保持分组顺序.接下来的事情与第一步$group相似,但是将保留所有保留的位置,并再次从堆栈中弹出"第一个看不见的文档".

As only one of the documents will match this condition the results are again "sorted" in order to float the "unseen" documents to the top, while of course maintaining the grouping order. The next thing is similar to the first $group step, but where any kept positions are maintained and the "first unseen" document is "popped off the stack" again.

然后将看到"的文档不按本身而是按false的值推回数组.这不会匹配保留的值,这通常是在不破坏"数组内容的情况下处理此值的方法,在此情况下,如果没有足够的匹配项来覆盖n结果,则您不希望操作失败必填.

The document that was "seen" is then pushed back to the array not as itself but as a value of false. This is not going to match the kept value and this is generally the way to handle this without being "destructive" to the array contents where you don't want the operations to fail should there not be enough matches to cover the n results required.

清理完成后,下一个投影"会将数组添加到现在按类型"分组的最终文档中,这些数组表示所需的n结果中的每个位置.解开阵列后,可以再次将文档重新组合在一起,但现在全部集中在一个阵列中 可能包含多个false值,但长度为n个元素.

Cleaning up when complete, the next "projection" adds an array to the final documents now grouped by "type" representing each position in the n results required. When this array is unwound, the documents can again be grouped back together, but now all in a single array that possibly contains several false values but is n elements long.

最后再次展开阵列,使用 $match 过滤出false值,并投影到所需的文档格式.

Finally unwind the array again, use $match to filter out the false values, and project to the required document form.

前面提到的问题是要过滤的结果数,因为可以推送到数组中的结果数存在实际限制.那主要是BSON限制,但是即使该限制仍在该限制内,您也并不真正想要1000个项目.

The problem as stated earlier is with the number of results being filtered as there is a real limit on the number of results that can be pushed into an array. That is mostly the BSON limit, but you also don't really want 1000's of items even if that is still under the limit.

这里的窍门是将初始匹配"保持足够小,以使切片操作"变得可行. $geoNear 可以使之成为可能的管道过程.

The trick here is keeping the initial "match" small enough that the "slicing operations" becomes practical. There are some things with the $geoNear pipeline process that can make this a possibility.

显而易见的是limit.默认情况下,它是100,但是您显然希望包含以下内容:

The obvious is limit. By default this is 100 but you clearly want to have something in the range of:

(您可以匹配的类别数)X(必填项)

(the number of categories you can possibly match) X ( required matches )

但是,如果这实际上不是1000的数字,那么这里已经有一些帮助了.

But if this is essentially a number not in the 1000's then there is already some help here.

其他的是maxDistanceminDistance,实际上,您在搜索的远距"上设置了上限和下限.最大界限是一般的限制器,而最小界限在分页"时是有用的,这是下一个帮助器.

The others are maxDistance and minDistance, where essentially you put upper and lower bounds on how "far out" to search. The max bound is the general limiter while the min bound is useful when "paging", which is the next helper.

向上分页"时,可以使用query参数,以便使用

When "upwardly paging", you can use the query argument in order to exclude the _id values of documents "already seen" using the $nin query. In much the same way, the minDistance can be populated with the "last seen" largest distance, or at least the smallest largest distance by "type". This allows some concept of filtering out things that have already been "seen" and getting another page.

本身只是一个主题,但是为了使该过程切实可行,这是在减少初始匹配项时要寻找的一般内容.

Really a topic in itself, but those are the general things to look for in reducing that initial match in order to make the process practical.

返回每个类型最多10个结果"的一般问题显然是要生成流水线阶段需要一些代码.没有人想要输入该数字,实际上,您可能会希望在某个时候更改该数字.

The general problem of returning "10 results at most, per type" is clearly going to want some code in order to generate the pipeline stages. No-one wants to type that out, and practically speaking you will probably want to change that number at some point.

因此,现在可以生成怪物管道的代码.所有代码均使用JavaScript,但原则上易于转换:

So now to the code that can generate the monster pipeline. All code in JavaScript, but easy to translate in principles:

var coords = [-118.09771, 33.89244];

var key = "$type";
var val = {
    "_id": "$_id",
    "title": "$title",
    "author": "$author",
    "distance": "$distance"
};
var maxLen = 10;

var stack = [];
var pipe = [];
var fproj = { "$project": { "pos": { "$literal": []  } } };

pipe.push({ "$geoNear": {
    "near": coords, 
    "distanceField": "distance", 
    "spherical": true
}});

pipe.push({ "$sort": {
    "type": 1, "distance": 1
}});

for ( var x = 1; x <= maxLen; x++ ) {

    fproj["$project"][""+x] = 1;
    fproj["$project"]["pos"]["$literal"].push( x );

    var rec = {
        "$cond": [ { "$eq": [ "$pos", x ] }, "$"+x ]
    };
    if ( stack.length == 0 ) {
        rec["$cond"].push( false );
    } else {
        lval = stack.pop();
        rec["$cond"].push( lval );
    }

    stack.push( rec );

    if ( x == 1) {
        pipe.push({ "$group": {
           "_id": key,
           "1": { "$first": val },
           "books": { "$push": val }
        }});
        pipe.push({ "$project": {
           "1": 1,
           "books": {
               "$cond": [
                    { "$eq": [ { "$size": "$books" }, 1 ] },
                    { "$literal": [false] },
                    "$books"
               ]
           }
        }});
    } else {
        pipe.push({ "$unwind": "$books" });
        var proj = {
            "$project": {
                "books": 1
            }
        };

        proj["$project"]["seen"] = { "$eq": [ "$"+(x-1), "$books" ] };

        var grp = {
            "$group": {
                "_id": "$_id",
                "books": {
                    "$push": {
                        "$cond": [ { "$not": "$seen" }, "$books", false ]
                    }
                }
            }
        };

        for ( n=x; n >= 1; n-- ) {
            if ( n != x ) 
                proj["$project"][""+n] = 1;
            grp["$group"][""+n] = ( n == x ) ? { "$first": "$books" } : { "$first": "$"+n };
        }

        pipe.push( proj );
        pipe.push({ "$sort": { "_id": 1, "seen": 1 } });
        pipe.push(grp);
    }
}

pipe.push(fproj);
pipe.push({ "$unwind": "$pos" });
pipe.push({
    "$group": {
        "_id": "$_id",
        "msgs": { "$push": stack[0] }
    }
});
pipe.push({ "$unwind": "$books" });
pipe.push({ "$match": { "books": { "$ne": false } }});
pipe.push({
    "$project": {
        "_id": "$books._id",
        "title": "$books.title",
        "author": "$books.author",
        "type": "$_id",
        "distance": "$books",
        "sortOrder": {
            "$add": [
                { "$cond": [ { "$eq": [ "$_id", "Fiction" ] }, 1, 0 ] },
                { "$cond": [ { "$eq": [ "$_id", "Science" ] }, 0, 0 ] },
                { "$cond": [ { "$eq": [ "$_id", "Horror" ] }, 3, 0 ] },
            ]
        }
    }
});
pipe.push({ "$sort": { "sortOrder": 1, "distance": 1 } });

备用


当然,这里的最终结果以及上述所有方面的普遍问题是,您实际上只希望返回每种类型"的前10名".聚合管道可以做到这一点,但要付出的代价是保持10个以上,然后弹出堆栈"直到达到10个.

Alternate


Of course the end result here and the general problem with all above is that you really only want the "top 10" of each "type" to return. The aggregation pipeline will do it, but at the cost of keeping more than 10 and then "popping off the stack" until 10 is reached.

另一种方法是使用mapReduce和全局作用域"变量来强力"执行此操作.由于结果全部放在数组中,所以效果不佳,但这可能是一种实用的方法:

An alternate approach is to "brute force" this with mapReduce and "globally scoped" variables. Not as nice since the results all in arrays, but it may be a practical approach:

db.collection.mapReduce(
    function () {

        if ( !stash.hasOwnProperty(this.type) ) {
            stash[this.type] = [];
        }

        if ( stash[this.type.length < maxLen ) {
            stash[this.type].push({
                "title": this.title,
                "author": this.author,
                "type": this.type,
                "distance": this.distance
            });
            emit( this.type, 1 );
        }

    },
    function(key,values) {
        return 1;   // really just want to keep the keys
    },
    { 
        "query": {
            "location": {
                "$nearSphere": [-118.09771, 33.89244]
            }
        },
        "scope": { "stash": {}, "maxLen": 10 },
        "finalize": function(key,value) {
            return { "msgs": stash[key] };                
        },
        "out": { "inline": 1 }
    }
)

这是一个真正的作弊手段,它仅使用全局范围"来保留其键为分组键的单个对象.将结果压入该全局对象中的数组,直到达到最大长度.结果已经按最接近的顺序进行了排序,因此在每个键达到10个之后,映射器就放弃了对当前文档执行任何操作.

This is a real cheat which just uses the "global scope" to keep a single object whose keys are the grouping keys. The results are pushed onto an array in that global object until the maximum length is reached. Results are already sorted by nearest, so the mapper just gives up doing anything with the current document after the 10 are reached per key.

由于每个密钥仅发出1个文档,因此不会调用reducer.然后,finalize仅从全局变量中拉出"值并将其返回结果中.

The reducer wont be called since only 1 document per key is emitted. The finalize then just "pulls" the value from the global and returns it in the result.

简单,但是如果您确实需要它们,当然也没有所有$geoNear选项,并且此表单的硬限制是100个文档作为初始查询的输出.

Simple, but of course you don't have all the $geoNear options if you really need them, and this form has the hard limit of 100 document as the output from the initial query.

这篇关于返回有限数量的某种类型的记录,但是无限数量的其他记录?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆