MongoDB select count(distinct x) on an indexed column - 计算大数据集的唯一结果 [英] MongoDB select count(distinct x) on an indexed column - count unique results for large data sets

查看:17
本文介绍了MongoDB select count(distinct x) on an indexed column - 计算大数据集的唯一结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经浏览了几篇文章和示例,但还没有找到一种在 MongoDB 中执行此 SQL 查询的有效方法(其中有数百万个 rows 文档)

I have gone through several articles and examples, and have yet to find an efficient way to do this SQL query in MongoDB (where there are millions of rows documents)

第一次尝试

(例如,来自这个几乎重复的问题 - Mongo 相当于 SQL 的 SELECT DISTINCT?)

(e.g. from this almost duplicate question - Mongo equivalent of SQL's SELECT DISTINCT?)

db.myCollection.distinct("myIndexedNonUniqueField").length

显然我收到了这个错误,因为我的数据集很大

Obviously I got this error as my dataset is huge

Thu Aug 02 12:55:24 uncaught exception: distinct failed: {
        "errmsg" : "exception: distinct too big, 16mb cap",
        "code" : 10044,
        "ok" : 0
}

第二次尝试

我决定尝试做一个小组

db.myCollection.group({key: {myIndexedNonUniqueField: 1},
                initial: {count: 0}, 
                 reduce: function (obj, prev) { prev.count++;} } );

但我收到了这个错误信息:

But I got this error message instead:

exception: group() can't handle more than 20000 unique keys

第三次尝试

我还没试过,但有几个建议涉及mapReduce

I haven't tried yet but there are several suggestions that involve mapReduce

例如

  • this one how to do distinct and group in mongodb? (not accepted, answer author / OP didn't test it)
  • this one MongoDB group by Functionalities (seems similar to Second Attempt)
  • this one http://blog.emmettshear.com/post/2010/02/12/Counting-Uniques-With-MongoDB
  • this one https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/trDn3jJjqtE
  • this one http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

还有

似乎在 GitHub 上有一个修复 .distinct 方法的拉取请求,提到它应该只返回一个计数,但它仍然是开放的:https://github.com/mongodb/mongo/pull/34

It seems there is a pull request on GitHub fixing the .distinct method to mention it should only return a count, but it's still open: https://github.com/mongodb/mongo/pull/34

但此时我认为值得在这里问一下,关于这个主题的最新消息是什么?我应该转移到 SQL 或另一个 NoSQL DB 以获得不同的计数吗?或者有什么有效的方法?

But at this point I thought it's worth to ask here, what is the latest on the subject? Should I move to SQL or another NoSQL DB for distinct counts? or is there an efficient way?

更新:

MongoDB 官方文档上的这个评论并不令人鼓舞,这是否准确?

This comment on the MongoDB official docs is not encouraging, is this accurate?

http://www.mongodb.org/display/DOCS/Aggregation#评论-430445808

更新 2:

似乎新的聚合框架回答了上述评论......(MongoDB 2.1/2.2 及更高版本,开发预览可用,不适用于生产)

Seems the new Aggregation Framework answers the above comment... (MongoDB 2.1/2.2 and above, development preview available, not for production)

http://docs.mongodb.org/manual/applications/aggregation/

推荐答案

1) 最简单的方法是通过聚合框架.这需要两个$group"命令:第一个按不同的值分组,第二个对所有不同的值进行计数

1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values

pipeline = [ 
    { $group: { _id: "$myIndexedNonUniqueField"}  },
    { $group: { _id: 1, count: { $sum: 1 } } }
];

//
// Run the aggregation command
//
R = db.runCommand( 
    {
    "aggregate": "myCollection" , 
    "pipeline": pipeline
    }
);
printjson(R);

2) 如果你想用 Map/Reduce 做到这一点,你可以.这也是一个两阶段的过程:在第一阶段,我们构建一个新集合,其中包含键的每个不同值的列表.在第二个中,我们对新集合执行 count().

2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.

var SOURCE = db.myCollection;
var DEST = db.distinct
DEST.drop();


map = function() {
  emit( this.myIndexedNonUniqueField , {count: 1});
}

reduce = function(key, values) {
  var count = 0;

  values.forEach(function(v) {
    count += v['count'];        // count each distinct value for lagniappe
  });

  return {count: count};
};

//
// run map/reduce
//
res = SOURCE.mapReduce( map, reduce, 
    { out: 'distinct', 
     verbose: true
    }
    );

print( "distinct count= " + res.counts.output );
print( "distinct count=", DEST.count() );

请注意,您不能内联返回 map/reduce 的结果,因为这可能会超出 16MB 的文档大小限制.你可以将计算保存在一个集合中然后count()这个集合的大小,或者你可以从mapReduce()的返回值中得到结果的数量.

Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().

这篇关于MongoDB select count(distinct x) on an indexed column - 计算大数据集的唯一结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆