Mongo:统计一组文档中单词出现的次数 [英] Mongo: count the number of word occurrences in a set of documents

查看:14
本文介绍了Mongo:统计一组文档中单词出现的次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Mongo 中有一组文档.说:

I have a set of documents in Mongo. Say:

[
    { summary:"This is good" },
    { summary:"This is bad" },
    { summary:"Something that is neither good nor bad" }
]

我想计算每个单词的出现次数(不区分大小写),然后按降序排序.结果应该是这样的:

I'd like to count the number of occurrences of each word (case insensitive), then sort in descending order. The result should be something like:

[
    "is": 3,
    "bad": 2,
    "good": 2,
    "this": 2,
    "neither": 1,
    "nor": 1,
    "something": 1,
    "that": 1
]

知道怎么做吗?聚合框架将是首选,因为我已经在某种程度上理解它:)

Any idea how to do this? Aggregation framework would be preferred, as I understand it to some degree already :)

推荐答案

MapReduce 可能是一个非常合适,可以在服务器上处理文档而无需在客户端上进行操作(因为在数据库服务器上没有拆分字符串的功能(未解决问题).

MapReduce might be a good fit that can process the documents on the server without doing manipulation on the client (as there isn't a feature to split a string on the DB server (open issue).

map 函数开始.在下面的示例中(可能需要更健壮),每个文档都被传递给 map 函数(作为 this).代码查找 summary 字段,如果存在,则将其小写,在空格上拆分,然后为找到的每个单词发出 1.

Start with the map function. In the example below (which likely needs to be more robust), each document is passed to the map function (as this). The code looks for the summary field and if it's there, lowercases it, splits on a space, and then emits a 1 for each word found.

var map = function() {  
    var summary = this.summary;
    if (summary) { 
        // quick lowercase to normalize per your requirements
        summary = summary.toLowerCase().split(" "); 
        for (var i = summary.length - 1; i >= 0; i--) {
            // might want to remove punctuation, etc. here
            if (summary[i])  {      // make sure there's something
               emit(summary[i], 1); // store a 1 for each word
            }
        }
    }
};

然后,在 reduce 函数中,它将 map 函数找到的所有结果相加,并为 emit<的每个单词返回一个离散的总数/code>上面写的.

Then, in the reduce function, it sums all of the results found by the map function and returns a discrete total for each word that was emitted above.

var reduce = function( key, values ) {    
    var count = 0;    
    values.forEach(function(v) {            
        count +=v;    
    });
    return count;
}

最后,执行 mapReduce:

Finally, execute the mapReduce:

> db.so.mapReduce(map, reduce, {out: "word_count"})

您的样本数据的结果:

> db.word_count.find().sort({value:-1})
{ "_id" : "is", "value" : 3 }
{ "_id" : "bad", "value" : 2 }
{ "_id" : "good", "value" : 2 }
{ "_id" : "this", "value" : 2 }
{ "_id" : "neither", "value" : 1 }
{ "_id" : "or", "value" : 1 }
{ "_id" : "something", "value" : 1 }
{ "_id" : "that", "value" : 1 }

这篇关于Mongo:统计一组文档中单词出现的次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆