MongoDB:糟糕的MapReduce性能 [英] MongoDB: Terrible MapReduce Performance

查看:58
本文介绍了MongoDB:糟糕的MapReduce性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在关系数据库方面拥有悠久的历史,但是我是MongoDB和MapReduce的新手,所以我几乎肯定我一定在做错事.我将直接进入这个问题.很抱歉,如果很长.

I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm almost positive I must be doing something wrong. I'll jump right into the question. Sorry if it's long.

我在MySQL中有一个数据库表,该表跟踪每天的成员个人资料视图数量.为了进行测试,它有1000万行.

I have a database table in MySQL that tracks the number of member profile views for each day. For testing it has 10,000,000 rows.

CREATE TABLE `profile_views` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `username` varchar(20) NOT NULL,
  `day` date NOT NULL,
  `views` int(10) unsigned default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `username` (`username`,`day`),
  KEY `day` (`day`)
) ENGINE=InnoDB;

典型数据可能看起来像这样.

Typical data might look like this.

+--------+----------+------------+------+
| id     | username | day        | hits |
+--------+----------+------------+------+
| 650001 | Joe      | 2010-07-10 |    1 |
| 650002 | Jane     | 2010-07-10 |    2 |
| 650003 | Jack     | 2010-07-10 |    3 |
| 650004 | Jerry    | 2010-07-10 |    4 |
+--------+----------+------------+------+

我使用此查询来获取自2010年7月16日以来访问量最高的5个个人资料.

I use this query to get the top 5 most viewed profiles since 2010-07-16.

SELECT username, SUM(hits)
FROM profile_views
WHERE day > '2010-07-16'
GROUP BY username
ORDER BY hits DESC
LIMIT 5\G

此查询将在一分钟内完成.还不错!

This query completes in under a minute. Not bad!

现在进入MongoDB的世界.我使用3台服务器设置了分片环境.服务器M,S1和S2.我使用以下命令来设置绑定(注意:我已经遮盖了IP addy).

Now moving onto the world of MongoDB. I setup a sharded environment using 3 servers. Servers M, S1, and S2. I used the following commands to set the rig up (Note: I've obscured the IP addys).

S1 => 127.20.90.1
./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log

S2 => 127.20.90.7
./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log

M => 127.20.4.1
./mongod --fork --configsvr --dbpath=/data/db --logpath=/data/log
./mongos --fork --configdb 127.20.4.1 --chunkSize 1 --logpath=/data/slog

一旦这些启动并运行,我就跳上服务器M,并启动了mongo.我发出了以下命令:

Once those were up and running, I hopped on server M, and launched mongo. I issued the following commands:

use admin
db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } );
db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } );
db.runCommand( { enablesharding : "profiles" } );
db.runCommand( { shardcollection : "profiles.views", key : {day : 1} } );
use profiles
db.views.ensureIndex({ hits: -1 });

然后我从MySQL导入了同样的10,000,000行,这给了我如下所示的文档:

I then imported the same 10,000,000 rows from MySQL, which gave me documents that look like this:

{
    "_id" : ObjectId("4cb8fc285582125055295600"),
    "username" : "Joe",
    "day" : "Fri May 21 2010 00:00:00 GMT-0400 (EDT)",
    "hits" : 16
}

现在这里是真正的肉和土豆了……我的地图和简化功能.回到外壳中的服务器M上,我设置查询并像这样执行它.

Now comes the real meat and potatoes here... My map and reduce functions. Back on server M in the shell I setup the query and execute it like this.

use profiles;
var start = new Date(2010, 7, 16);
var map = function() {
    emit(this.username, this.hits);
}
var reduce = function(key, values) {
    var sum = 0;
    for(var i in values) sum += values[i];
    return sum;
}
res = db.views.mapReduce(
    map,
    reduce,
    {
        query : { day: { $gt: start }}
    }
);

这是我遇到的问题. 此查询花费了超过15分钟的时间! .MySQL查询花费了不到一分钟的时间.这是输出:

And here's were I run into problems. This query took over 15 minutes to complete! The MySQL query took under a minute. Here's the output:

{
        "result" : "tmp.mr.mapreduce_1287207199_6",
        "shardCounts" : {
                "127.20.90.7:10000" : {
                        "input" : 4917653,
                        "emit" : 4917653,
                        "output" : 1105648
                },
                "127.20.90.1:10000" : {
                        "input" : 5082347,
                        "emit" : 5082347,
                        "output" : 1150547
                }
        },
        "counts" : {
                "emit" : NumberLong(10000000),
                "input" : NumberLong(10000000),
                "output" : NumberLong(2256195)
        },
        "ok" : 1,
        "timeMillis" : 811207,
        "timing" : {
                "shards" : 651467,
                "final" : 159740
        },
}

不仅要花很长时间才能运行,而且结果甚至看起来都不正确.

Not only did it take forever to run, but the results don't even seem to be correct.

db[res.result].find().sort({ hits: -1 }).limit(5);
{ "_id" : "Joe", "value" : 128 }
{ "_id" : "Jane", "value" : 2 }
{ "_id" : "Jerry", "value" : 2 }
{ "_id" : "Jack", "value" : 2 }
{ "_id" : "Jessy", "value" : 3 }

我知道这些值应该更高.

I know those value numbers should be much higher.

我对整个MapReduce范例的理解是执行此查询的任务应该在所有分片成员之间分配,这应该提高性能.我等到Mongo在导入后完成了在两个分片服务器之间分发文档的工作.当我开始此查询时,每个文档都几乎有5,000,000个文档.

My understanding of the whole MapReduce paradigm is the task of performing this query should be split between all shard members, which should increase performance. I waited till Mongo was done distributing the documents between the two shard servers after the import. Each had almost exactly 5,000,000 documents when I started this query.

所以我一定做错了.谁能给我任何指示?

So I must be doing something wrong. Can anyone give me any pointers?

IRC上有人提到在日字段中添加索引,但据我所知,它是由MongoDB自动完成的.

Someone on IRC mentioned adding an index on the day field, but as far as I can tell that was done automatically by MongoDB.

推荐答案

摘录自O'Reilly的MongoDB权威指南:

excerpts from MongoDB Definitive Guide from O'Reilly:

使用MapReduce的代价是速度: 小组不是特别快,但是 MapReduce较慢而不是 应该实时使用. 您将MapReduce作为背景运行 工作,它创建了一个集合 结果,然后您可以查询 实时收集.

The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in "real time." You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.

options for map/reduce:

"keeptemp" : boolean 
If the temporary result collection should be saved when the connection is closed. 

"output" : string 
Name for the output collection. Setting this option implies keeptemp : true. 

这篇关于MongoDB:糟糕的MapReduce性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆