MongoDB Map/Reduce Array聚合问题 [英] MongoDB Map/Reduce Array aggregation question

查看:65
本文介绍了MongoDB Map/Reduce Array聚合问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个MongoDB集合,其文档使用了多个层次的嵌套,我希望从中提取从其字段子集编译的多维数组.我有一个适合我的解决方案,但我想更好地理解幂等"的概念及其与reduce函数相关的后果.

I have a MongoDB collection, whose docs use several levels of nesting, from which I would like to extract a multidimensional array compiled from a subset of their fields. I have a solution that works for me right now, but I want to better understand this concept of 'idempotency' and its consequences related to the reduce function.

{
  "host_name" : "gateway",
  "service_description" : "PING",
  "last_update" : 1305777787,
  "performance_object" : [
    [ "rta", 0.105, "ms", 100, 500, 0 ],
    [ "pl", 0, "%", 20, 60, 0 ]
  ]
}

这是map/reduce函数

And here are the map/reduce functions

var M = function() {
  var hn = this.host_name, 
      sv = this.service_description, 
      ts = this.last_update;
  this.performance_object.forEach(function(P){
    emit( { 
      host: hn, 
      service: sv, 
      metric: P[0] 
    }, { 
      time: ts, 
      value: P[1] 
    } );
  });
}
var R = function(key,values) {
  var result = { 
    time: [], 
    value: [] 
  };
  values.forEach(function(V){
    result.time.push(V.time);
    result.value.push(V.value);
  });
  return result;
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R
});

数据以有用的结构返回,我将其格式化/排序并最终确定图形.

Data is returned in a useful structure, which I reformat/sort with finalize for graphing.

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/var/bck"
  },
  "value" : {
    "time" : [
      [ 1306719302, 1306719601, 1306719903, ... ],
      [ 1306736404, 1306736703, 1306737002, ... ],
      [ 1306766401, 1306766701, 1306767001, ... ]
    ],
    "value" : [
      [ 122, 23423, 25654, ... ],
      [ 336114, 342511, 349067, ... ],
      [ 551196, 551196, 551196, ... ]
    ]
  }
}

最后...

 [ [1306719302,122], [1306719601,23423], [1306719903,25654], ... ]

TL; DR:如果阵列结果出现块状",那么预期的行为是什么?

我知道可以在发射值数组上多次调用reduce函数,这就是为什么完整数组有多个块"而不是单个数组的原因.数组块通常为25至50个项目,在finalize()中清理起来很容易.我concat()数组,将它们交织为[time,value]并排序.但是我真正想知道的是,这是否会变得更加复杂:

I understand that the reduce function may be called multiple times on array(s) of emitted values, which is why there are several "chunks" of the complete arrays, rather than a single array. The array chunks are typically 25-50 items and it's easy enough to clean this up in finalize(). I concat() the arrays, interleave them as [time,value] and sort. But what I really want to know is if this can get more complex:

1)是否因为我的代码,MongoDB的实现或Map/Reduce算法本身而发现了分块?

1) Is the chunking observed because of my code, MongoDB's implementation or the Map/Reduce algorithm itself?

2)在分片配置中,甚至由于我的匆忙执行,是否还会有更深的(递归)嵌套的数组块?这会破坏concat()方法.

2) Will there ever be deeper (recursive) nesting of array chunks in sharded configurations or even just because of my hasty implementation? This would break the concat() method.

3)是否有一种更好的策略来获得如上所述的数组结果?

3) Is there simply a better strategy for getting array results as shown above?

我接受了Thomas的建议,并重新编写了它以发出阵列.拆分值绝对没有任何意义.

I took Thomas' advise and re-wrote it to emit arrays. It absolutely doesn't make any sense to split up the values.

var M = function() {
  var hn = this.host_name, 
      sv = this.service_description, 
      ts = this.last_update;
  this.performance_object.forEach(function(P){
    emit( { 
      host: hn, 
      service: sv, 
      metric: P[0] 
    }, { 
      value: [ ts, P[1] ] 
    } );
  });
}
var R = function(key,values) {
  var result = {
    value: [] 
  };
  values.forEach(function(V){
    result.value.push(V.value);
  });
  return result;
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R
});

现在输出类似于:

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/var/bck"
  },
  "value" : {
    "value" : [
      [ [1306736404,336114],[1306736703,342511],[1306737002,349067], ... ],
      [ [1306766401,551196],[1306766701,551196],[1306767001,551196], ... ],
      [ [1306719302,122],[1306719601,122],[1306719903,122], ... ]
    ]
  }
}

然后我使用了这个finalize函数来连接数组块并对它们进行排序.

And I used this finalize function to concatenate the array chunks and sort them.

...
var F = function(key,values) {
  return (Array.concat.apply([],values.value)).sort(function(a,b){ 
    if (a[0] < b[0]) return -1;
    if (a[0] > b[0]) return 1;
    return 0;
  });
}
db.runCommand({
  mapreduce: <colname>,
  out: <col2name>,
  map: M,
  reduce: R,
  finalize: F
});

哪个效果很好:

{
  "_id" : {
    "host" : "localhost",
    "service" : "Disk Space",
    "metric" : "/mnt/bck"
  },
  "value" : [ [1306719302,122],[1306719601,122],[1306719903,122],, ... ]
}

我猜唯一一个困扰我的问题是,是否可以信任此Array.concat.apply([],values.value)来始终清除reduce的输出.

I guess the only question that's gnawing at me is whether this Array.concat.apply([],values.value) can be trusted to clean up the output of reduce all of the time.

自从上面给出原始示例以来,我已经修改了文档结构,但这仅通过使map函数变得非常简单来更改示例.

I have modified the document structure since the original example given above, but this only changes the example by making the map function really simple.

我仍在努力思考为什么Array.prototype.push.apply(result,V.data)的工作原理与result.push(V.data)如此不同...但是有效.

I'm still trying to wrap my brain around why Array.prototype.push.apply(result, V.data) works so differently from result.push(V.data)... but it works.

var M = function() {
  emit( { 
    host: this.host, 
    service: this.service, 
    metric: this.metric
  } , { 
    data: [ [ this.timestamp, this.data ] ] 
  } );
}
var R = function(key,values) {
  var result = [];
  values.forEach(function(V){
    Array.prototype.push.apply(result, V.data);
  });
  return { data: result };
}
var F = function(key,values) {
  return values.data.sort(function(a,b){
    return (a[0]<b[0]) ? -1 : (a[0]>b[0]) ? 1 : 0;
  });
}

与LAST EDIT标题上方显示的输出相同.

It has the same output as shown just above the LAST EDIT heading.

谢谢托马斯!

推荐答案

  1. 分块"来自您的代码:reduce函数的values参数可以包含map函数发出的{time:<timestamp>,value:<value>},也可以包含先前调用reduce函数返回的{time:[<timestamps>],value:[<values]}.

  1. The "chunking" comes from your code: your reduce function's values parameter can contain either {time:<timestamp>,value:<value>} emitted from your map function, or {time:[<timestamps>],value:[<values]} returned from a previous call to your reduce function.

我不知道它是否会在实践中发生,但它可以在理论上发生.

I don't know if it will happen in practice, but it can happen in theory.

只需让您的map函数发出与reduce函数返回的对象相同的对象(即emit(<id>, {time: [ts], value: [P[1]]})),并相应地更改reduce函数(即Array.push.apply(result.time, V.time))和类似的result.value.

Simply have your map function emit the same kind of objects that your reduce function returns, i.e. emit(<id>, {time: [ts], value: [P[1]]}), and change your reduce function accordingly, i.e. Array.push.apply(result.time, V.time) and similarly for result.value.

我实际上不明白为什么您不使用时间/值对数组,而不是使用一对数组,即map函数中的emit(<id>, { pairs: [ {time: ts, value: P[1] ] })emit(<id>, { pairs: [ [ts, P[1]] ] })以及减少功能.这样,您甚至不需要finalize函数(除了可能要从 pairs 属性解包"该数组:因为reduce函数无法返回数组,因此您必须以这种方式包装它)一个对象)

Well I actually don't understand why you're not using an array of time/value pairs, instead of a pair of arrays, i.e. emit(<id>, { pairs: [ {time: ts, value: P[1] ] }) or emit(<id>, { pairs: [ [ts, P[1]] ] }) in the map function, and Array.push.apply(result.pairs, V.pairs) in the reduce function. That way, you won't even need the finalize function (except maybe to "unwrap" the array from the pairs property: because the reduce function cannot return an array, your have to wrap it that way in an object)

这篇关于MongoDB Map/Reduce Array聚合问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆