MongoDB选择索引列上的count(distinct x)-计算大型数据集的唯一结果 [英] MongoDB select count(distinct x) on an indexed column - count unique results for large data sets

查看:90
本文介绍了MongoDB选择索引列上的count(distinct x)-计算大型数据集的唯一结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经阅读了几篇文章和示例,但还没有找到在MongoDB(那里有成千上万的文档)中进行此SQL查询的有效方法

I have gone through several articles and examples, and have yet to find an efficient way to do this SQL query in MongoDB (where there are millions of rows documents)

首次尝试

(例如,来自这个几乎重复的问题- Mongo是否等同于SQL的SELECT DISTINCT?)

(e.g. from this almost duplicate question - Mongo equivalent of SQL's SELECT DISTINCT?)

db.myCollection.distinct("myIndexedNonUniqueField").length

很明显,由于我的数据集很大,我遇到了这个错误

Obviously I got this error as my dataset is huge

Thu Aug 02 12:55:24 uncaught exception: distinct failed: {
        "errmsg" : "exception: distinct too big, 16mb cap",
        "code" : 10044,
        "ok" : 0
}

第二次尝试

我决定尝试做一个小组

db.myCollection.group({key: {myIndexedNonUniqueField: 1},
                initial: {count: 0}, 
                 reduce: function (obj, prev) { prev.count++;} } );

但我却收到此错误消息:

But I got this error message instead:

exception: group() can't handle more than 20000 unique keys

第三次尝试

我还没有尝试过,但是有一些涉及mapReduce

I haven't tried yet but there are several suggestions that involve mapReduce

例如

  • this one how to do distinct and group in mongodb? (not accepted, answer author / OP didn't test it)
  • this one MongoDB group by Functionalities (seems similar to Second Attempt)
  • this one http://blog.emmettshear.com/post/2010/02/12/Counting-Uniques-With-MongoDB
  • this one https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/trDn3jJjqtE
  • this one http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

似乎在GitHub上有一个拉取请求,修复了.distinct方法,提到它只应返回一个计数,但它仍处于打开状态:

It seems there is a pull request on GitHub fixing the .distinct method to mention it should only return a count, but it's still open: https://github.com/mongodb/mongo/pull/34

但是在这一点上,我认为值得在这里问一下,关于这个问题的最新情况是什么?我应该转移到SQL或另一个NoSQL DB以获得不同的计数吗?还是有一种有效的方法?

But at this point I thought it's worth to ask here, what is the latest on the subject? Should I move to SQL or another NoSQL DB for distinct counts? or is there an efficient way?

更新:

对MongoDB官方文档的评论并不令人鼓舞,这是正确的吗?

This comment on the MongoDB official docs is not encouraging, is this accurate?

http://www.mongodb.org/display/DOCS/Aggregation# comment-430445808

Update2:

似乎新的Aggregation Framework回答了以上评论...(MongoDB 2.1/2.2及更高版本,提供了开发预览,不适用于生产)

Seems the new Aggregation Framework answers the above comment... (MongoDB 2.1/2.2 and above, development preview available, not for production)

http://docs.mongodb.org/manual/applications/aggregation/

推荐答案

1)最简单的方法是通过聚合框架.这需要两个"$ group"命令:第一个按不同的值分组,第二个对所有不同的值进行计数

1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values

pipeline = [ 
    { $group: { _id: "$myIndexedNonUniqueField"}  },
    { $group: { _id: 1, count: { $sum: 1 } } }
];

//
// Run the aggregation command
//
R = db.runCommand( 
    {
    "aggregate": "myCollection" , 
    "pipeline": pipeline
    }
);
printjson(R);

2)如果要使用映射/缩小"执行此操作,则可以.这也是一个分为两个阶段的过程:在第一阶段,我们将构建一个新集合,其中包含键的每个不同值的列表.在第二个步骤中,我们对新集合执行count().

2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.

var SOURCE = db.myCollection;
var DEST = db.distinct
DEST.drop();


map = function() {
  emit( this.myIndexedNonUniqueField , {count: 1});
}

reduce = function(key, values) {
  var count = 0;

  values.forEach(function(v) {
    count += v['count'];        // count each distinct value for lagniappe
  });

  return {count: count};
};

//
// run map/reduce
//
res = SOURCE.mapReduce( map, reduce, 
    { out: 'distinct', 
     verbose: true
    }
    );

print( "distinct count= " + res.counts.output );
print( "distinct count=", DEST.count() );

请注意,您无法返回map/reduce内联的结果,因为这可能会超出16MB的文档大小限制.您可以 将计算结果保存在一个集合中,然后对集合的大小进行count()运算,也可以从mapReduce()的返回值中获取结果数.

Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().

这篇关于MongoDB选择索引列上的count(distinct x)-计算大型数据集的唯一结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆