Mongo:计算一组文档中单词出现的次数 [英] Mongo: count the number of word occurrences in a set of documents

查看:75
本文介绍了Mongo:计算一组文档中单词出现的次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Mongo中有一组文档.说:

I have a set of documents in Mongo. Say:

[
    { summary:"This is good" },
    { summary:"This is bad" },
    { summary:"Something that is neither good nor bad" }
]

我想计算每个单词的出现次数(不区分大小写),然后以降序排列.结果应该是这样的:

I'd like to count the number of occurrences of each word (case insensitive), then sort in descending order. The result should be something like:

[
    "is": 3,
    "bad": 2,
    "good": 2,
    "this": 2,
    "neither": 1,
    "nor": 1,
    "something": 1,
    "that": 1
]

任何想法如何做到这一点?聚合框架将是首选,因为据我所知,它已经::

Any idea how to do this? Aggregation framework would be preferred, as I understand it to some degree already :)

推荐答案

MapReduce 可能是非常适合在不对客户端进行任何操作的情况下处理服务器上的文档(因为没有在DB服务器上拆分字符串的功能(

MapReduce might be a good fit that can process the documents on the server without doing manipulation on the client (as there isn't a feature to split a string on the DB server (open issue).

map功能开始.在下面的示例中(可能需要更强大),每个文档都传递给map函数(作为this).代码查找summary字段,如果存在,则将其小写,在空格上分割,然后为找到的每个单词发出1.

Start with the map function. In the example below (which likely needs to be more robust), each document is passed to the map function (as this). The code looks for the summary field and if it's there, lowercases it, splits on a space, and then emits a 1 for each word found.

var map = function() {  
    var summary = this.summary;
    if (summary) { 
        // quick lowercase to normalize per your requirements
        summary = summary.toLowerCase().split(" "); 
        for (var i = summary.length - 1; i >= 0; i--) {
            // might want to remove punctuation, etc. here
            if (summary[i])  {      // make sure there's something
               emit(summary[i], 1); // store a 1 for each word
            }
        }
    }
};

然后,在reduce函数中,它将对map函数找到的所有结果求和,并为上面emit列出的每个单词返回离散的总数.

Then, in the reduce function, it sums all of the results found by the map function and returns a discrete total for each word that was emitted above.

var reduce = function( key, values ) {    
    var count = 0;    
    values.forEach(function(v) {            
        count +=v;    
    });
    return count;
}

最后,执行mapReduce:

Finally, execute the mapReduce:

> db.so.mapReduce(map, reduce, {out: "word_count"})

带有样本数据的结果:

> db.word_count.find().sort({value:-1})
{ "_id" : "is", "value" : 3 }
{ "_id" : "bad", "value" : 2 }
{ "_id" : "good", "value" : 2 }
{ "_id" : "this", "value" : 2 }
{ "_id" : "neither", "value" : 1 }
{ "_id" : "or", "value" : 1 }
{ "_id" : "something", "value" : 1 }
{ "_id" : "that", "value" : 1 }

这篇关于Mongo:计算一组文档中单词出现的次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆