MongoDB从两个数组(排序和限制)计算值 [英] MongoDB Calculate Values from Two Arrays, Sort and Limit

查看:84
本文介绍了MongoDB从两个数组(排序和限制)计算值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个存储浮点数组的MongoDB数据库.假定采用以下格式的文档集合:

{
    "id" : 0,
    "vals" : [ 0.8, 0.2, 0.5 ]
}

具有一个查询数组(例如,值[ 0.1, 0.3, 0.4 ]),我想为集合中的所有元素计算距离(例如,差异之和;对于给定的文档和查询,该距离将由abs(0.8 - 0.1) + abs(0.2 - 0.3) + abs(0.5 - 0.4) = 0.9计算) ).

我尝试使用MongoDB的聚合函数来实现此目的,但是我无法弄清楚如何遍历数组. (我没有使用MongoDB的内置geo操作,因为数组可能很长)

我还需要对结果进行排序并将其限制在前100位,因此不需要在读取数据后进行计算.

解决方案

当前处理为mapReduce

如果您需要在服务器上执行此操作并对最高结果进行排序并仅保持前100名,则可以使用mapReduce进行如下操作:

 db.test.mapReduce(
    function() {
        var input = [0.1,0.3,0.4];
        var value = Array.sum(this.vals.map(function(el,idx) {
            return Math.abs( el - input[idx] )
        }));

        emit(null,{ "output": [{ "_id": this._id, "value": value }]});
    },
    function(key,values) {
        var output = [];

        values.forEach(function(value) {
            value.output.forEach(function(item) {
                output.push(item);
            });
        });

        output.sort(function(a,b) {
            return a.value < b.value;
        });

        return { "output": output.slice(0,100) };
    },
    { "out": { "inline": 1 } }
)
 

因此,映射器函数在同一键下进行计算并输出所有内容,从而将所有结果发送到reducer.最终输出将包含在单个输出文档中的数组中,因此,重要的是,所有结果都以相同的键值发出,并且每个发出的输出本身都是数组,因此mapReduce可以正常工作. /p>

排序和归约是在化简器中完成的,在检查每个发出的文档时,将元素放入单个临时数组中,进行排序,然后返回最重要的结果.

这很重要,这也是发射器将其生成为数组的原因,即使起初只有一个元素也是如此. MapReduce通过处理块"中的结果来工作,因此即使所有发出的文档都具有相同的键,也不会一次全部处理它们.相反,精简器将其结果放回要减少的发射结果队列中,直到只剩下用于该特定键的单个文档为止.

为了简短起见,我将此处的切片"输出限制为10,并包括统计要点,因为可以看到在此10000个样本上调用的100个减少循环:

 {
    "results" : [
        {
            "_id" : null,
            "value" : {
                "output" : [
                    {
                        "_id" : ObjectId("56558d93138303848b496cd4"),
                        "value" : 2.2
                    },
                    {
                        "_id" : ObjectId("56558d96138303848b49906e"),
                        "value" : 2.2
                    },
                    {
                        "_id" : ObjectId("56558d93138303848b496d9a"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d93138303848b496ef2"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d94138303848b497861"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d94138303848b497b58"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d94138303848b497ba5"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d94138303848b497c43"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d95138303848b49842b"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d96138303848b498db4"),
                        "value" : 2.1
                    }
                ]
            }
        }
    ],
    "timeMillis" : 1758,
    "counts" : {
            "input" : 10000,
            "emit" : 10000,
            "reduce" : 100,
            "output" : 1
    },
    "ok" : 1
}
 

因此这是单个文档输出,采用特定的mapReduce格式,其中值"包含一个元素,该元素是排序和限制结果的数组.

将来的处理已汇总

在撰写本文时,MongoDB的当前最新稳定版本是3.0,并且缺少使您的操作成为可能的功能.但是即将发布的3.2版本引入了新的运算符,使之成为可能:

 db.test.aggregate([
    { "$unwind": { "path": "$vals", "includeArrayIndex": "index" }},
    { "$group": {
        "_id": "$_id",
        "result": {
            "$sum": {
                "$abs": {
                    "$subtract": [ 
                        "$vals", 
                        { "$arrayElemAt": [ { "$literal": [0.1,0.3,0.4] }, "$index" ] } 
                    ]
                }
            }
        }
    }},
    { "$sort": { "result": -1 } },
    { "$limit": 100 }
])
 

为简洁起见,同样限制在相同的10个结果中,您将获得如下输出:

 { "_id" : ObjectId("56558d96138303848b49906e"), "result" : 2.2 }
{ "_id" : ObjectId("56558d93138303848b496cd4"), "result" : 2.2 }
{ "_id" : ObjectId("56558d96138303848b498e31"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497c43"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497861"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499037"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b498db4"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496ef2"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496d9a"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499182"), "result" : 2.1 }
 

之所以能够实现,很大程度上是因为 $unwind 修改为在包含数组索引的结果中投影一个字段,也由于 $arrayElemAt 是一种新的运算符,可以从提供的索引中提取数组元素作为奇异值.

这允许从输入数组中按索引位置查找"值,以便将数学应用于每个元素.现有的 $literal 运算符有助于输入数组,因此$arrayElemAt不会抱怨并将其重新确认为数组((目前似乎是一个小错误,因为其他数组函数没有直接输入的问题))并通过使用"index"字段获取了合适的匹配索引值由$unwind进行比较.

数学是由 $subtract 完成的,当然 $abs 中的另一个新运算符来满足您的功能.同样,由于必须首先展开数组,所有这些操作都在 $sort 和然后 $limit 才返回最高结果.

摘要

即使MongoDB的聚合框架即将使用新功能,也有争议的是哪种方法实际上对结果更有效.这主要是由于仍然需要$unwind数组内容,它可以有效地为要处理的管道中的每个数组成员生成每个文档的副本,并且通常会导致开销.

因此,尽管mapReduce是直到新发行版才存在的唯一方法,但根据要处理的数据量,它实际上可能胜过聚合语句,尽管聚合框架实际上是在本机编码的运算符上运行的,而不是比翻译的JavaScript操作要好.

与所有事物一样,始终建议进行测试,以查看哪种情况更适合您的目的,哪种情况可以为您的预期处理提供最佳性能.


样本

当然,根据所应用的数学运算,问题中提供的示例文档的预期结果为0.9.但是,仅出于测试目的,这是一个简短的清单,用于生成一些示例数据,我至少希望验证mapReduce代码是否可以正常工作:

 var bulk = db.test.initializeUnorderedBulkOp();

var x = 10000;

while ( x-- ) {
    var vals = [0,0,0];

    vals = vals.map(function(val) {
        return Math.round((Math.random()*10),1)/10;
    });

    bulk.insert({ "vals": vals });

    if ( x % 1000 == 0) {
        bulk.execute();
        bulk = db.test.initializeUnorderedBulkOp();
    }
}
 

这些数组是完全随机的单个小数点值,因此我作为示例输出给出的列出结果中没有很多分布.

I have a MongoDB database storing float arrays. Assume a collection of documents in the following format:

{
    "id" : 0,
    "vals" : [ 0.8, 0.2, 0.5 ]
}

Having a query array, e.g., with values [ 0.1, 0.3, 0.4 ], I would like to compute for all elements in the collection a distance (e.g., sum of differences; for the given document and query it would be computed by abs(0.8 - 0.1) + abs(0.2 - 0.3) + abs(0.5 - 0.4) = 0.9).

I tried to use the aggregation function of MongoDB to achieve this, but I can't work out how to iterate over the array. (I am not using the built-in geo operations of MongoDB, as the arrays can be rather long)

I also need to sort the results and limit to the top 100, so calculation after reading the data is not desired.

解决方案

Current Processing is mapReduce

If you need to execute this on the server and sort the top results and just keep the top 100, then you could use mapReduce for this like so:

db.test.mapReduce(
    function() {
        var input = [0.1,0.3,0.4];
        var value = Array.sum(this.vals.map(function(el,idx) {
            return Math.abs( el - input[idx] )
        }));

        emit(null,{ "output": [{ "_id": this._id, "value": value }]});
    },
    function(key,values) {
        var output = [];

        values.forEach(function(value) {
            value.output.forEach(function(item) {
                output.push(item);
            });
        });

        output.sort(function(a,b) {
            return a.value < b.value;
        });

        return { "output": output.slice(0,100) };
    },
    { "out": { "inline": 1 } }
)

So the mapper function does the calculation and output's everything under the same key so all results are sent to the reducer. The end output is going to be contained in an array in a single output document, so it is both important that all results are emitted with the same key value and that the output of each emit is itself an array so mapReduce can work properly.

The sorting and reduction is done in the reducer itself, as each emitted document is inspected the elements are put into a single tempory array, sorted, and the top results are returned.

That is important, and just the reason why the emitter produces this as an array even if a single element at first. MapReduce works by processing results in "chunks", so even if all emitted documents have the same key, they are not all processed at once. Rather the reducer puts it's results back into the queue of emitted results to be reduced until there is only a single document left for that particular key.

I'm restricting the "slice" output here to 10 for brevity of listing, and including the stats to make a point, as the 100 reduce cycles called on this 10000 sample can be seen:

{
    "results" : [
        {
            "_id" : null,
            "value" : {
                "output" : [
                    {
                        "_id" : ObjectId("56558d93138303848b496cd4"),
                        "value" : 2.2
                    },
                    {
                        "_id" : ObjectId("56558d96138303848b49906e"),
                        "value" : 2.2
                    },
                    {
                        "_id" : ObjectId("56558d93138303848b496d9a"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d93138303848b496ef2"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d94138303848b497861"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d94138303848b497b58"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d94138303848b497ba5"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d94138303848b497c43"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d95138303848b49842b"),
                        "value" : 2.1
                    },
                    {
                        "_id" : ObjectId("56558d96138303848b498db4"),
                        "value" : 2.1
                    }
                ]
            }
        }
    ],
    "timeMillis" : 1758,
    "counts" : {
            "input" : 10000,
            "emit" : 10000,
            "reduce" : 100,
            "output" : 1
    },
    "ok" : 1
}

So this is a single document output, in the specific mapReduce format, where the "value" contains an element which is an array of the sorted and limitted result.

Future Processing is Aggregate

As of writing, the current latest stable release of MongoDB is 3.0, and this lacks the functionality to make your operation possible. But the upcoming 3.2 release introduces new operators that make this possible:

db.test.aggregate([
    { "$unwind": { "path": "$vals", "includeArrayIndex": "index" }},
    { "$group": {
        "_id": "$_id",
        "result": {
            "$sum": {
                "$abs": {
                    "$subtract": [ 
                        "$vals", 
                        { "$arrayElemAt": [ { "$literal": [0.1,0.3,0.4] }, "$index" ] } 
                    ]
                }
            }
        }
    }},
    { "$sort": { "result": -1 } },
    { "$limit": 100 }
])

Also limitting to the same 10 results for brevity, you get output like this:

{ "_id" : ObjectId("56558d96138303848b49906e"), "result" : 2.2 }
{ "_id" : ObjectId("56558d93138303848b496cd4"), "result" : 2.2 }
{ "_id" : ObjectId("56558d96138303848b498e31"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497c43"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497861"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499037"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b498db4"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496ef2"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496d9a"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499182"), "result" : 2.1 }

This is made possible largely due to $unwind being modified to project a field in results that contains the array index, and also due to $arrayElemAt which is a new operator that can extract an array element as a singular value from a provided index.

This allows the "look-up" of values by index position from your input array in order to apply the math to each element. The input array is facilitated by the existing $literal operator so $arrayElemAt does not complain and recongizes it as an array, ( seems to be a small bug at present, as other array functions don't have the problem with direct input ) and gets the appropriate matching index value by using the "index" field produced by $unwind for comparison.

The math is done by $subtract and of course another new operator in $abs to meet your functionality. Also since it was necessary to unwind the array in the first place, all of this is done inside a $group stage accumulating all array members per document and applying the addition of entries via the $sum accumulator.

Finally all result documents are processed with $sort and then the $limit is applied to just return the top results.

Summary

Even with the new functionallity about to be availble to the aggregation framework for MongoDB it is debatable which approach is actually more efficient for results. This is largely due to there still being a need to $unwind the array content, which effectively produces a copy of each document per array member in the pipeline to be processed, and that generally causes an overhead.

So whilst mapReduce is the only present way to do this until a new release, it may actually outperform the aggregation statement depending on the amount of data to be processed, and despite the fact that the aggregation framework works on native coded operators rather than translated JavaScript operations.

As with all things, testing is always recommended to see which case suits your purposes better and which gives the best performance for your expected processing.


Sample

Of course the expected result for the sample document provided in the question is 0.9 by the math applied. But just for my testing purposes, here is a short listing used to generate some sample data that I wanted to at least verify the mapReduce code was working as it should:

var bulk = db.test.initializeUnorderedBulkOp();

var x = 10000;

while ( x-- ) {
    var vals = [0,0,0];

    vals = vals.map(function(val) {
        return Math.round((Math.random()*10),1)/10;
    });

    bulk.insert({ "vals": vals });

    if ( x % 1000 == 0) {
        bulk.execute();
        bulk = db.test.initializeUnorderedBulkOp();
    }
}

The arrays are totally random single decimal point values, so there is not a lot of distribution in the listed results I gave as sample output.

这篇关于MongoDB从两个数组(排序和限制)计算值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆