MongoDB Map/Reduce Array 聚合问题 [英] MongoDB Map/Reduce Array aggregation question

查看:12
本文介绍了MongoDB Map/Reduce Array 聚合问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 MongoDB 集合,其文档使用多个嵌套级别,我想从中提取从其字段子集编译的多维数组.我现在有一个适合我的解决方案,但我想更好地理解幂等性"的概念及其与 reduce 函数相关的后果.

I have a MongoDB collection, whose docs use several levels of nesting, from which I would like to extract a multidimensional array compiled from a subset of their fields. I have a solution that works for me right now, but I want to better understand this concept of 'idempotency' and its consequences related to the reduce function.

{
  "host_name" : "gateway",
  "service_description" : "PING",
  "last_update" : 1305777787,
  "performance_object" : [
    [ "rta", 0.105, "ms", 100, 500, 0 ],
    [ "pl", 0, "%", 20, 60, 0 ]
  ]
}

这里是 map/reduce 函数

And here are the map/reduce functions

var M = function() {
  var hn = this.host_name, 
      sv = this.service_description, 
      ts = this.last_update;
  this.performance_object.forEach(function(P){
    emit( { 
      host: hn, 
      service: sv, 
      metric: P[0] 
    }, { 
      time: ts, 
      value: P[1] 
    } );
  });
}
var R = function(key,values) {
  var result = { 
    time: [], 
    value: [] 
  };
  values.forEach(function(V){
    result.time.push(V.time);
    result.value.push(V.value);
  });
  return result;
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R
});

数据以有用的结构返回,我使用 finalize 重新格式化/排序以进行绘图.

Data is returned in a useful structure, which I reformat/sort with finalize for graphing.

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/var/bck"
  },
  "value" : {
    "time" : [
      [ 1306719302, 1306719601, 1306719903, ... ],
      [ 1306736404, 1306736703, 1306737002, ... ],
      [ 1306766401, 1306766701, 1306767001, ... ]
    ],
    "value" : [
      [ 122, 23423, 25654, ... ],
      [ 336114, 342511, 349067, ... ],
      [ 551196, 551196, 551196, ... ]
    ]
  }
}

终于……

 [ [1306719302,122], [1306719601,23423], [1306719903,25654], ... ]

TL;DR:观察到的数组结果分块"的预期行为是什么?

我知道 reduce 函数可能会在发出值的数组上被多次调用,这就是为什么有几个完整数组的块"而不是单个数组的原因.数组块通常是 25-50 个项目,在 finalize() 中很容易清理它.我 concat() 数组,将它们作为 [time,value] 交错并排序.但我真正想知道的是这是否会变得更复杂:

I understand that the reduce function may be called multiple times on array(s) of emitted values, which is why there are several "chunks" of the complete arrays, rather than a single array. The array chunks are typically 25-50 items and it's easy enough to clean this up in finalize(). I concat() the arrays, interleave them as [time,value] and sort. But what I really want to know is if this can get more complex:

1) 观察到分块是因为我的代码、MongoDB 的实现还是 Map/Reduce 算法本身?

1) Is the chunking observed because of my code, MongoDB's implementation or the Map/Reduce algorithm itself?

2) 在分片配置中是否会有更深(递归)的数组块嵌套,甚至只是因为我的仓促实施?这会破坏 concat() 方法.

2) Will there ever be deeper (recursive) nesting of array chunks in sharded configurations or even just because of my hasty implementation? This would break the concat() method.

3) 是否有更好的策略来获取如上所示的数组结果?

3) Is there simply a better strategy for getting array results as shown above?

我接受了 Thomas 的建议并重写了它以发出数组.拆分值绝对没有任何意义.

I took Thomas' advise and re-wrote it to emit arrays. It absolutely doesn't make any sense to split up the values.

var M = function() {
  var hn = this.host_name, 
      sv = this.service_description, 
      ts = this.last_update;
  this.performance_object.forEach(function(P){
    emit( { 
      host: hn, 
      service: sv, 
      metric: P[0] 
    }, { 
      value: [ ts, P[1] ] 
    } );
  });
}
var R = function(key,values) {
  var result = {
    value: [] 
  };
  values.forEach(function(V){
    result.value.push(V.value);
  });
  return result;
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R
});

现在的输出类似于:

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/var/bck"
  },
  "value" : {
    "value" : [
      [ [1306736404,336114],[1306736703,342511],[1306737002,349067], ... ],
      [ [1306766401,551196],[1306766701,551196],[1306767001,551196], ... ],
      [ [1306719302,122],[1306719601,122],[1306719903,122], ... ]
    ]
  }
}

我使用这个 finalize 函数来连接数组块并对它们进行排序.

And I used this finalize function to concatenate the array chunks and sort them.

...
var F = function(key,values) {
  return (Array.concat.apply([],values.value)).sort(function(a,b){ 
    if (a[0] < b[0]) return -1;
    if (a[0] > b[0]) return 1;
    return 0;
  });
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R,
  finalize: F
});

效果很好:

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/mnt/bck"
  },
  "value" : [ [1306719302,122],[1306719601,122],[1306719903,122],, ... ]
}

我想唯一困扰我的问题是这个 Array.concat.apply([],values.value) 是否可以一直被信任来清理 reduce 的输出.

I guess the only question that's gnawing at me is whether this Array.concat.apply([],values.value) can be trusted to clean up the output of reduce all of the time.

自从上面给出的原始示例以来,我已经修改了文档结构,但这只是通过使 map 函数变得非常简单来改变示例.

I have modified the document structure since the original example given above, but this only changes the example by making the map function really simple.

我仍在努力思考为什么 Array.prototype.push.apply(result, V.data) 与 result.push(V.data) 的工作方式如此不同......但它确实有效.

I'm still trying to wrap my brain around why Array.prototype.push.apply(result, V.data) works so differently from result.push(V.data)... but it works.

var M = function() {
  emit( { 
    host: this.host, 
    service: this.service, 
    metric: this.metric
  } , { 
    data: [ [ this.timestamp, this.data ] ] 
  } );
}
var R = function(key,values) {
  var result = [];
  values.forEach(function(V){
    Array.prototype.push.apply(result, V.data);
  });
  return { data: result };
}
var F = function(key,values) {
  return values.data.sort(function(a,b){
    return (a[0]<b[0]) ? -1 : (a[0]>b[0]) ? 1 : 0;
  });
}

它具有与 LAST EDIT 标题上方显示的相同的输出.

It has the same output as shown just above the LAST EDIT heading.

谢谢,托马斯!

推荐答案

  1. 分块";来自您的代码:您的 reduce 函数的 values 参数可以包含从您的 map 函数发出的 {time:<timestamp>,value:<value>}{time:[<timestamps>],value:[ 从上一次调用你的 reduce 函数返回.

  1. The "chunking" comes from your code: your reduce function's values parameter can contain either {time:<timestamp>,value:<value>} emitted from your map function, or {time:[<timestamps>],value:[<values]} returned from a previous call to your reduce function.

我不知道它是否会在实践中发生,但它可以在理论上发生.

I don't know if it will happen in practice, but it can happen in theory.

只需让 map 函数发出与 reduce 函数返回相同类型的对象,即 emit(<id>, {time: [ts], value: [P[1]]}),并相应地更改您的 reduce 函数,即 Array.push.apply(result.time, V.time)result.value 类似.

Simply have your map function emit the same kind of objects that your reduce function returns, i.e. emit(<id>, {time: [ts], value: [P[1]]}), and change your reduce function accordingly, i.e. Array.push.apply(result.time, V.time) and similarly for result.value.

好吧,我实际上不明白你为什么不使用时间/值对数组,而不是一对数组,即 emit(<id>, { pairs: [ {time: ts, value: P[1] ] })emit(<id>, { pairs: [ [ts, P[1]] ] }) 在map函数中,和Array.push.apply(result.pairs, V.pairs) 在reduce函数中.这样,你甚至不需要 finalize 函数(除了可能从 pairs 属性中解包"数组:因为 reduce 函数不能返回数组,你必须以这种方式包装它在一个对象中)

Well I actually don't understand why you're not using an array of time/value pairs, instead of a pair of arrays, i.e. emit(<id>, { pairs: [ {time: ts, value: P[1] ] }) or emit(<id>, { pairs: [ [ts, P[1]] ] }) in the map function, and Array.push.apply(result.pairs, V.pairs) in the reduce function. That way, you won't even need the finalize function (except maybe to "unwrap" the array from the pairs property: because the reduce function cannot return an array, your have to wrap it that way in an object)

这篇关于MongoDB Map/Reduce Array 聚合问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆