Mongo查询按非重复计数排序 [英] Mongo query to sort by distinct count

查看:63
本文介绍了Mongo查询按非重复计数排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个字段"company"和"url".我想按不同的公司"发生的次数对其进行排序,然后显示与该特定公司相对应的三个"URL".数据存储如下:

I have two fields 'company' and 'url'. I want to sort it by number of times distinct 'company' is occurring and then display three 'url' corresponding to that particular company. Data is stored like this:

{
    "_id" : ObjectId("56c4f73664af6f7305f3670f"),
    "title" : "Full Stack Software Developer",
    "url" : "http://www.indeed.com/cmp/Upside-Commerce,-Inc./jobs/Full-Stack-Software-Developer-6e93e36ea5d0e57e?sjdu=QwrRXKrqZ3CNX5W-O9jEvRQls7y2xdBHzhqWkvhd5FFfs8wS9wesfMWXjNNFaUXen2pO-kyc_Qbr7-_3Gf40AvyEQT3jn6IRxIwvw9-aFy8",
    "company" : "Upside Commerce, Inc."
}

以下查询会计算不同公司的数量.

following query counts the number of distinct companies.

db.Books.aggregate({$group : { _id : '$company', count : {$sum : 1}}})

以下是输出:

{ "_id" : "Microsoft", "count" : 14 }
{ "_id" : "Tableau", "count" : 64 }
{ "_id" : "Amazon", "count" : 64 }
{ "_id" : "Dropbox", "count" : 64 }
{ "_id" : "Amazon Corporate LLC", "count" : 64 }
{ "_id" : "Electronic Arts", "count" : 64 }
{ "_id" : "CDK Global", "count" : 65 }
{ "_id" : "IDC Technologies", "count" : 64 }
{ "_id" : "Concur", "count" : 64 }
{ "_id" : "Microsoft", "count" : 14 }
{ "_id" : "Tableau", "count" : 64 }
{ "_id" : "Amazon", "count" : 64 }
{ "_id" : "Dropbox", "count" : 64 }
{ "_id" : "Amazon Corporate LLC", "count" : 64 }
{ "_id" : "Electronic Arts", "count" : 64 }
{ "_id" : "CDK Global", "count" : 65 }
{ "_id" : "IDC Technologies", "count" : 64 }
{ "_id" : "Concur", "count" : 64 }

但是,我希望它按不同公司的数量排序(将其限制为排名前10位的最高公司),然后显示与不同公司相对应的三个网址(如果不同公司的数量至少为三个).像这样:

However I want it sort by count of distinct companies (limit it to Top 10 highest occurring companies) and then display three urls corresponding to distinct company (if count for distinct company is atleast three). Something like:

{for microsoft:
    {"url" : "https://careers.microsoft.com/jobdetails.aspx?jid=216571&memid=1071484607&utm_source=Indeed"}
    {"url" : "https://careers.microsoft.com/jobdetails.aspx?jid=216571&memid=1695844082&utm_source=Indeed" }
    { "url" : "https://careers.microsoft.com/jobdetails.aspx?jid=216571&memid=932148152&utm_source=Indeed"}}

其他公司也一样

推荐答案

这确实是(仍然)最好由多个查询处理的原因,因为MongoDB确实仍然"还没有真正高效的运算符来执行此操作.

This really is (still) best handled by multiple queries, since MongoDB really "still" does not have the really efficient operators to do this yet.

虽然您可以使用MongoDB 3.2进行类似的操作,但是明显存在陷阱":

You can do something like this with MongoDB 3.2 though, but there are obvious "catches":

db.Books.aggregate([
    { "$group": {
        "_id": "$company",
        "count": { "$sum": 1 },
        "urls": {
            "$push": "$url"
        }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 10 },
    { "$project": {
        "count": 1,
        "urls": { "$slice": ["$urls",0, 3] }
    }}
])

一个明显的问题是,无论如何,您仍将全部内容添加到分组数组中.这有可能超过16MB的BSON限制.可能没有,但是当您只希望其中的三个"内容时,添加所有"内容仍然有点浪费.

And the obvious problem is that no matter what, you are still adding all of the "url" content into the grouped array. This has the potential to exceed the BSON limit of 16MB. It might not, but it's still a bit wasteful to add "all" content when you only want "three" of them.

因此,即使那样,实际上只对前10个结果中的每个结果分别查询"URL"也可能更实用.

So even then it's probably more practical to just actually query for the "urls" seperately on each of the top 10 results.

以下是node.js的清单,演示:

Here's a listing for node.js that demonstrates:

var async = require('async'),
    mongodb = require('mongodb'),
    MongoClient = mongodb.MongoClient;

MongoClient.connect("mongodb://localhost/test",function(err,db) {

    if (err) throw err;

    // Get the top 10
    db.collection("Books").aggregate(
        [
            { "$group": {
                "_id": "$company",
                "count": { "$sum": 1 }
             }},
             { "$sort": { "count": -1 } },
             { "$limit": 10 }
        ],function(err,results) {
            if (err) throw err;

            // Query for each result and map query response as urls
            async.map(
                results,
                function(result,callback) {
                    db.collection("Books").find({ 
                       "company": result.company 
                    }).limit(3).toArray(function(err,items) {
                        result.urls = items.map(function(item) { 
                            return item.url;
                        });
                        callback(err,result);
                    })
                },
                function(err,results) {
                    if (err) throw err;
                    // each result entry has 3 urls
                }
            );
        }
     )

});

是的,这是对数据库的更多调用,但实际上它只是只有十个,因此并不是一个问题.

Yes it's more calls to the database, but it really is only ten and therefore not really an issue.

真实解决方案已在 SERVER-9377-扩展中进行了介绍$ push或$ max,以允许在$ group阶段为每个_id键收集前" N个值.它具有令人鼓舞的进行中"状态,因此正在积极研究中.

The real resolution for this is covered in SERVER-9377 - Extend $push or $max to allow collecting "top" N values per _id key in $group phase. This has the promising "In Progress" status, so it is actively being worked on.

该问题一旦解决,一条聚合语句就变得可行,因为这样您就可以将初始$push中的结果"URL"限制"为仅三个条目,而不必删除除三个之后的所有条目.事实.

Once that is resolved, then a single aggregation statement becomes viable, since then you would be able to "limit" the resulting "urls" in the intial $push to just three entries, rather than remove all but three after the fact.

这篇关于Mongo查询按非重复计数排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆