MongoDB选择索引列上的count(distinct x)-计算大型数据集的唯一结果 [英] MongoDB select count(distinct x) on an indexed column - count unique results for large data sets

查看：90 发布时间：2020/5/10 22:03:27 mongodb

本文介绍了MongoDB选择索引列上的count(distinct x)-计算大型数据集的唯一结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经阅读了几篇文章和示例，但还没有找到在MongoDB(那里有成千上万的行文档)中进行此SQL查询的有效方法

I have gone through several articles and examples, and have yet to find an efficient way to do this SQL query in MongoDB (where there are millions of ~~rows~~ documents)

首次尝试

(例如，来自这个几乎重复的问题- Mongo是否等同于SQL的SELECT DISTINCT?)

(e.g. from this almost duplicate question - Mongo equivalent of SQL's SELECT DISTINCT?)

db.myCollection.distinct("myIndexedNonUniqueField").length

很明显，由于我的数据集很大，我遇到了这个错误

Obviously I got this error as my dataset is huge

Thu Aug 02 12:55:24 uncaught exception: distinct failed: {
        "errmsg" : "exception: distinct too big, 16mb cap",
        "code" : 10044,
        "ok" : 0
}

第二次尝试

我决定尝试做一个小组

db.myCollection.group({key: {myIndexedNonUniqueField: 1},
                initial: {count: 0}, 
                 reduce: function (obj, prev) { prev.count++;} } );

但我却收到此错误消息:

But I got this error message instead:

exception: group() can't handle more than 20000 unique keys

第三次尝试

我还没有尝试过，但是有一些涉及mapReduce

I haven't tried yet but there are several suggestions that involve mapReduce

例如

这一个如何在mongodb中进行区分和分组?(不接受，回答作者/OP未对其进行测试)

这一个 MongoDB按功能分组(似乎类似于第二次尝试")
这一个 http://blog.emmettshear. com/post/2010/02/12/Counting-Uniques-With-MongoDB
这一个 https://groups.google.com/论坛/?fromgroups#！topic/mongodb-user/trDn3jJjqtE
这一个 http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

this one how to do distinct and group in mongodb? (not accepted, answer author / OP didn't test it)
this one MongoDB group by Functionalities (seems similar to Second Attempt)
this one http://blog.emmettshear.com/post/2010/02/12/Counting-Uniques-With-MongoDB
this one https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/trDn3jJjqtE
this one http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

也

似乎在GitHub上有一个拉取请求，修复了.distinct方法，提到它只应返回一个计数，但它仍处于打开状态:

It seems there is a pull request on GitHub fixing the .distinct method to mention it should only return a count, but it's still open: https://github.com/mongodb/mongo/pull/34

但是在这一点上，我认为值得在这里问一下，关于这个问题的最新情况是什么?我应该转移到SQL或另一个NoSQL DB以获得不同的计数吗?还是有一种有效的方法?

But at this point I thought it's worth to ask here, what is the latest on the subject? Should I move to SQL or another NoSQL DB for distinct counts? or is there an efficient way?

更新:

对MongoDB官方文档的评论并不令人鼓舞，这是正确的吗?

This comment on the MongoDB official docs is not encouraging, is this accurate?

http://www.mongodb.org/display/DOCS/Aggregation# comment-430445808

Update2:

似乎新的Aggregation Framework回答了以上评论...(MongoDB 2.1/2.2及更高版本，提供了开发预览，不适用于生产)

Seems the new Aggregation Framework answers the above comment... (MongoDB 2.1/2.2 and above, development preview available, not for production)

http://docs.mongodb.org/manual/applications/aggregation/

推荐答案

1)最简单的方法是通过聚合框架.这需要两个"$ group"命令:第一个按不同的值分组，第二个对所有不同的值进行计数

1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values

pipeline = [ 
    { $group: { _id: "$myIndexedNonUniqueField"}  },
    { $group: { _id: 1, count: { $sum: 1 } } }
];

//
// Run the aggregation command
//
R = db.runCommand( 
    {
    "aggregate": "myCollection" , 
    "pipeline": pipeline
    }
);
printjson(R);

2)如果要使用映射/缩小"执行此操作，则可以.这也是一个分为两个阶段的过程:在第一阶段，我们将构建一个新集合，其中包含键的每个不同值的列表.在第二个步骤中，我们对新集合执行count().

2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.

var SOURCE = db.myCollection;
var DEST = db.distinct
DEST.drop();


map = function() {
  emit( this.myIndexedNonUniqueField , {count: 1});
}

reduce = function(key, values) {
  var count = 0;

  values.forEach(function(v) {
    count += v['count'];        // count each distinct value for lagniappe
  });

  return {count: count};
};

//
// run map/reduce
//
res = SOURCE.mapReduce( map, reduce, 
    { out: 'distinct', 
     verbose: true
    }
    );

print( "distinct count= " + res.counts.output );
print( "distinct count=", DEST.count() );

请注意，您无法返回map/reduce内联的结果，因为这可能会超出16MB的文档大小限制.您可以将计算结果保存在一个集合中，然后对集合的大小进行count()运算，也可以从mapReduce()的返回值中获取结果数.

Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().

这篇关于MongoDB选择索引列上的count(distinct x)-计算大型数据集的唯一结果的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

MongoDB选择索引列上的count(distinct x)-计算大型数据集的唯一结果 [英] MongoDB select count(distinct x) on an indexed column - count unique results for large data sets

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

MongoDB选择索引列上的count(distinct x)-计算大型数据集的唯一结果 [英] MongoDB select count(distinct x) on an indexed column - count unique results for large data sets

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭