Mongo查询按不同计数排序 [英] Mongo query to sort by distinct count

查看:41
本文介绍了Mongo查询按不同计数排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个字段公司"和网址".我想按不同公司"发生的次数对其进行排序,然后显示与该特定公司相对应的三个网址".数据是这样存储的:

I have two fields 'company' and 'url'. I want to sort it by number of times distinct 'company' is occurring and then display three 'url' corresponding to that particular company. Data is stored like this:

{
    "_id" : ObjectId("56c4f73664af6f7305f3670f"),
    "title" : "Full Stack Software Developer",
    "url" : "http://www.indeed.com/cmp/Upside-Commerce,-Inc./jobs/Full-Stack-Software-Developer-6e93e36ea5d0e57e?sjdu=QwrRXKrqZ3CNX5W-O9jEvRQls7y2xdBHzhqWkvhd5FFfs8wS9wesfMWXjNNFaUXen2pO-kyc_Qbr7-_3Gf40AvyEQT3jn6IRxIwvw9-aFy8",
    "company" : "Upside Commerce, Inc."
}

以下查询计算不同公司的数量.

following query counts the number of distinct companies.

db.Books.aggregate({$group : { _id : '$company', count : {$sum : 1}}})

以下是输出:

{ "_id" : "Microsoft", "count" : 14 }
{ "_id" : "Tableau", "count" : 64 }
{ "_id" : "Amazon", "count" : 64 }
{ "_id" : "Dropbox", "count" : 64 }
{ "_id" : "Amazon Corporate LLC", "count" : 64 }
{ "_id" : "Electronic Arts", "count" : 64 }
{ "_id" : "CDK Global", "count" : 65 }
{ "_id" : "IDC Technologies", "count" : 64 }
{ "_id" : "Concur", "count" : 64 }
{ "_id" : "Microsoft", "count" : 14 }
{ "_id" : "Tableau", "count" : 64 }
{ "_id" : "Amazon", "count" : 64 }
{ "_id" : "Dropbox", "count" : 64 }
{ "_id" : "Amazon Corporate LLC", "count" : 64 }
{ "_id" : "Electronic Arts", "count" : 64 }
{ "_id" : "CDK Global", "count" : 65 }
{ "_id" : "IDC Technologies", "count" : 64 }
{ "_id" : "Concur", "count" : 64 }

但是我希望它按不同公司的数量排序(将其限制为前 10 名出现率最高的公司),然后显示与不同公司相对应的三个网址(如果不同公司的计数至少为三个).比如:

However I want it sort by count of distinct companies (limit it to Top 10 highest occurring companies) and then display three urls corresponding to distinct company (if count for distinct company is atleast three). Something like:

{for microsoft:
    {"url" : "https://careers.microsoft.com/jobdetails.aspx?jid=216571&memid=1071484607&utm_source=Indeed"}
    {"url" : "https://careers.microsoft.com/jobdetails.aspx?jid=216571&memid=1695844082&utm_source=Indeed" }
    { "url" : "https://careers.microsoft.com/jobdetails.aspx?jid=216571&memid=932148152&utm_source=Indeed"}}

其他公司也一样

推荐答案

这确实(仍然)最好由多个查询处理,因为 MongoDB 确实仍然"没有真正有效的运算符来执行此操作.

This really is (still) best handled by multiple queries, since MongoDB really "still" does not have the really efficient operators to do this yet.

你可以用 MongoDB 3.2 做这样的事情,但是有明显的问题":

You can do something like this with MongoDB 3.2 though, but there are obvious "catches":

db.Books.aggregate([
    { "$group": {
        "_id": "$company",
        "count": { "$sum": 1 },
        "urls": {
            "$push": "$url"
        }
    }},
    { "$sort": { "count": -1 } },
    { "$limit": 10 },
    { "$project": {
        "count": 1,
        "urls": { "$slice": ["$urls",0, 3] }
    }}
])

显而易见的问题是,无论如何,您仍然将 all 的url"内容添加到分组数组中.这有可能超过 16MB 的 BSON 限制.可能不会,但是当您只想要三个"内容时添加所有"内容仍然有点浪费.

And the obvious problem is that no matter what, you are still adding all of the "url" content into the grouped array. This has the potential to exceed the BSON limit of 16MB. It might not, but it's still a bit wasteful to add "all" content when you only want "three" of them.

因此,即便如此,在前 10 个结果中分别实际查询url"可能更实用.

So even then it's probably more practical to just actually query for the "urls" seperately on each of the top 10 results.

这是一个 node.js 的清单,用于演示:

Here's a listing for node.js that demonstrates:

var async = require('async'),
    mongodb = require('mongodb'),
    MongoClient = mongodb.MongoClient;

MongoClient.connect("mongodb://localhost/test",function(err,db) {

    if (err) throw err;

    // Get the top 10
    db.collection("Books").aggregate(
        [
            { "$group": {
                "_id": "$company",
                "count": { "$sum": 1 }
             }},
             { "$sort": { "count": -1 } },
             { "$limit": 10 }
        ],function(err,results) {
            if (err) throw err;

            // Query for each result and map query response as urls
            async.map(
                results,
                function(result,callback) {
                    db.collection("Books").find({ 
                       "company": result.company 
                    }).limit(3).toArray(function(err,items) {
                        result.urls = items.map(function(item) { 
                            return item.url;
                        });
                        callback(err,result);
                    })
                },
                function(err,results) {
                    if (err) throw err;
                    // each result entry has 3 urls
                }
            );
        }
     )

});

是的,它对数据库的调用次数更多,但确实只有十次,因此不是真正的问题.

Yes it's more calls to the database, but it really is only ten and therefore not really an issue.

SERVER-9377 - Extend$push 或 $max 允许在 $group 阶段为每个 _id 键收集前"N 个值.这具有有希望的进行中"状态,因此正在积极进行中.

The real resolution for this is covered in SERVER-9377 - Extend $push or $max to allow collecting "top" N values per _id key in $group phase. This has the promising "In Progress" status, so it is actively being worked on.

一旦解决,单个聚合语句就变得可行,从那时起,您就可以将初始 $push 中的结果url"限制"为三个条目,而不是事后删除除三个以外的所有内容.

Once that is resolved, then a single aggregation statement becomes viable, since then you would be able to "limit" the resulting "urls" in the intial $push to just three entries, rather than remove all but three after the fact.

这篇关于Mongo查询按不同计数排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆