MongoDB:糟糕的 MapReduce 性能 [英] MongoDB: Terrible MapReduce Performance

查看:15
本文介绍了MongoDB:糟糕的 MapReduce 性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在关系数据库方面有着悠久的历史,但我是 MongoDB 和 MapReduce 的新手,所以我几乎可以肯定我一定做错了什么.我会直接进入这个问题.很抱歉,如果它很长.

I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm almost positive I must be doing something wrong. I'll jump right into the question. Sorry if it's long.

我在 MySQL 中有一个数据库表,用于跟踪每天的会员资料查看次数.对于测试,它有 10,000,000 行.

I have a database table in MySQL that tracks the number of member profile views for each day. For testing it has 10,000,000 rows.

CREATE TABLE `profile_views` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `username` varchar(20) NOT NULL,
  `day` date NOT NULL,
  `views` int(10) unsigned default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `username` (`username`,`day`),
  KEY `day` (`day`)
) ENGINE=InnoDB;

典型数据可能如下所示.

Typical data might look like this.

+--------+----------+------------+------+
| id     | username | day        | hits |
+--------+----------+------------+------+
| 650001 | Joe      | 2010-07-10 |    1 |
| 650002 | Jane     | 2010-07-10 |    2 |
| 650003 | Jack     | 2010-07-10 |    3 |
| 650004 | Jerry    | 2010-07-10 |    4 |
+--------+----------+------------+------+

我使用此查询来获取自 2010 年 7 月 16 日以来查看次数最多的前 5 个个人资料.

I use this query to get the top 5 most viewed profiles since 2010-07-16.

SELECT username, SUM(hits)
FROM profile_views
WHERE day > '2010-07-16'
GROUP BY username
ORDER BY hits DESC
LIMIT 5G

此查询在一分钟内完成.还不错!

This query completes in under a minute. Not bad!

现在进入 MongoDB 的世界.我使用 3 个服务器设置了一个分片环境.服务器 M、S1 和 S2.我使用以下命令来设置装备(注意:我已经掩盖了 IP 地址).

Now moving onto the world of MongoDB. I setup a sharded environment using 3 servers. Servers M, S1, and S2. I used the following commands to set the rig up (Note: I've obscured the IP addys).

S1 => 127.20.90.1
./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log

S2 => 127.20.90.7
./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log

M => 127.20.4.1
./mongod --fork --configsvr --dbpath=/data/db --logpath=/data/log
./mongos --fork --configdb 127.20.4.1 --chunkSize 1 --logpath=/data/slog

一旦它们启动并运行,我就跳到服务器 M 上,并启动了 mongo.我发出了以下命令:

Once those were up and running, I hopped on server M, and launched mongo. I issued the following commands:

use admin
db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } );
db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } );
db.runCommand( { enablesharding : "profiles" } );
db.runCommand( { shardcollection : "profiles.views", key : {day : 1} } );
use profiles
db.views.ensureIndex({ hits: -1 });

然后我从 MySQL 导入了相同的 10,000,000 行,这给了我如下所示的文档:

I then imported the same 10,000,000 rows from MySQL, which gave me documents that look like this:

{
    "_id" : ObjectId("4cb8fc285582125055295600"),
    "username" : "Joe",
    "day" : "Fri May 21 2010 00:00:00 GMT-0400 (EDT)",
    "hits" : 16
}

现在真正的肉和土豆来了……我的 map 和 reduce 函数.回到 shell 中的服务器 M,我设置了查询并像这样执行它.

Now comes the real meat and potatoes here... My map and reduce functions. Back on server M in the shell I setup the query and execute it like this.

use profiles;
var start = new Date(2010, 7, 16);
var map = function() {
    emit(this.username, this.hits);
}
var reduce = function(key, values) {
    var sum = 0;
    for(var i in values) sum += values[i];
    return sum;
}
res = db.views.mapReduce(
    map,
    reduce,
    {
        query : { day: { $gt: start }}
    }
);

这就是我遇到的问题.完成此查询需要 15 多分钟! MySQL 查询用时不到一分钟.这是输出:

And here's were I run into problems. This query took over 15 minutes to complete! The MySQL query took under a minute. Here's the output:

{
        "result" : "tmp.mr.mapreduce_1287207199_6",
        "shardCounts" : {
                "127.20.90.7:10000" : {
                        "input" : 4917653,
                        "emit" : 4917653,
                        "output" : 1105648
                },
                "127.20.90.1:10000" : {
                        "input" : 5082347,
                        "emit" : 5082347,
                        "output" : 1150547
                }
        },
        "counts" : {
                "emit" : NumberLong(10000000),
                "input" : NumberLong(10000000),
                "output" : NumberLong(2256195)
        },
        "ok" : 1,
        "timeMillis" : 811207,
        "timing" : {
                "shards" : 651467,
                "final" : 159740
        },
}

不仅运行需要很长时间,而且结果似乎也不正确.

Not only did it take forever to run, but the results don't even seem to be correct.

db[res.result].find().sort({ hits: -1 }).limit(5);
{ "_id" : "Joe", "value" : 128 }
{ "_id" : "Jane", "value" : 2 }
{ "_id" : "Jerry", "value" : 2 }
{ "_id" : "Jack", "value" : 2 }
{ "_id" : "Jessy", "value" : 3 }

我知道这些价值数字应该更高.

I know those value numbers should be much higher.

我对整个 MapReduce 范式的理解是,执行此查询的任务应该在所有分片成员之间拆分,这应该会提高性能.我一直等到 Mongo 在导入后在两个分片服务器之间分发文档.当我开始这个查询时,每个人都有将近 5,000,000 个文档.

My understanding of the whole MapReduce paradigm is the task of performing this query should be split between all shard members, which should increase performance. I waited till Mongo was done distributing the documents between the two shard servers after the import. Each had almost exactly 5,000,000 documents when I started this query.

所以我一定做错了什么.谁能给我指点?

So I must be doing something wrong. Can anyone give me any pointers?

IRC 上有人提到在 day 字段上添加索引,但据我所知,这是由 MongoDB 自动完成的.

Someone on IRC mentioned adding an index on the day field, but as far as I can tell that was done automatically by MongoDB.

推荐答案

摘自 O'Reilly 的 MongoDB Definitive Guide:

excerpts from MongoDB Definitive Guide from O'Reilly:

使用 MapReduce 的代价是速度:组不是特别快,但MapReduce 速度较慢,但​​不是应该在实时"中使用.您将 MapReduce 作为后台运行作业,它创建了一个集合结果,然后您可以查询实时采集.

The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in "real time." You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.

options for map/reduce:

"keeptemp" : boolean 
If the temporary result collection should be saved when the connection is closed. 

"output" : string 
Name for the output collection. Setting this option implies keeptemp : true. 

这篇关于MongoDB:糟糕的 MapReduce 性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆