MongoDB:糟糕的MapReduce性能 [英] MongoDB: Terrible MapReduce Performance
问题描述
我在关系数据库方面拥有悠久的历史,但是我是MongoDB和MapReduce的新手,所以我几乎肯定我一定在做错事.我将直接进入这个问题.很抱歉,如果很长.
I have a long history with relational databases, but I'm new to MongoDB and MapReduce, so I'm almost positive I must be doing something wrong. I'll jump right into the question. Sorry if it's long.
我在MySQL中有一个数据库表,该表跟踪每天的成员个人资料视图数量.为了进行测试,它有1000万行.
I have a database table in MySQL that tracks the number of member profile views for each day. For testing it has 10,000,000 rows.
CREATE TABLE `profile_views` (
`id` int(10) unsigned NOT NULL auto_increment,
`username` varchar(20) NOT NULL,
`day` date NOT NULL,
`views` int(10) unsigned default '0',
PRIMARY KEY (`id`),
UNIQUE KEY `username` (`username`,`day`),
KEY `day` (`day`)
) ENGINE=InnoDB;
典型数据可能看起来像这样.
Typical data might look like this.
+--------+----------+------------+------+
| id | username | day | hits |
+--------+----------+------------+------+
| 650001 | Joe | 2010-07-10 | 1 |
| 650002 | Jane | 2010-07-10 | 2 |
| 650003 | Jack | 2010-07-10 | 3 |
| 650004 | Jerry | 2010-07-10 | 4 |
+--------+----------+------------+------+
我使用此查询来获取自2010年7月16日以来访问量最高的5个个人资料.
I use this query to get the top 5 most viewed profiles since 2010-07-16.
SELECT username, SUM(hits)
FROM profile_views
WHERE day > '2010-07-16'
GROUP BY username
ORDER BY hits DESC
LIMIT 5\G
此查询将在一分钟内完成.还不错!
This query completes in under a minute. Not bad!
现在进入MongoDB的世界.我使用3台服务器设置了分片环境.服务器M,S1和S2.我使用以下命令来设置绑定(注意:我已经遮盖了IP addy).
Now moving onto the world of MongoDB. I setup a sharded environment using 3 servers. Servers M, S1, and S2. I used the following commands to set the rig up (Note: I've obscured the IP addys).
S1 => 127.20.90.1
./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log
S2 => 127.20.90.7
./mongod --fork --shardsvr --port 10000 --dbpath=/data/db --logpath=/data/log
M => 127.20.4.1
./mongod --fork --configsvr --dbpath=/data/db --logpath=/data/log
./mongos --fork --configdb 127.20.4.1 --chunkSize 1 --logpath=/data/slog
一旦这些启动并运行,我就跳上服务器M,并启动了mongo.我发出了以下命令:
Once those were up and running, I hopped on server M, and launched mongo. I issued the following commands:
use admin
db.runCommand( { addshard : "127.20.90.1:10000", name: "M1" } );
db.runCommand( { addshard : "127.20.90.7:10000", name: "M2" } );
db.runCommand( { enablesharding : "profiles" } );
db.runCommand( { shardcollection : "profiles.views", key : {day : 1} } );
use profiles
db.views.ensureIndex({ hits: -1 });
然后我从MySQL导入了同样的10,000,000行,这给了我如下所示的文档:
I then imported the same 10,000,000 rows from MySQL, which gave me documents that look like this:
{
"_id" : ObjectId("4cb8fc285582125055295600"),
"username" : "Joe",
"day" : "Fri May 21 2010 00:00:00 GMT-0400 (EDT)",
"hits" : 16
}
现在这里是真正的肉和土豆了……我的地图和简化功能.回到外壳中的服务器M上,我设置查询并像这样执行它.
Now comes the real meat and potatoes here... My map and reduce functions. Back on server M in the shell I setup the query and execute it like this.
use profiles;
var start = new Date(2010, 7, 16);
var map = function() {
emit(this.username, this.hits);
}
var reduce = function(key, values) {
var sum = 0;
for(var i in values) sum += values[i];
return sum;
}
res = db.views.mapReduce(
map,
reduce,
{
query : { day: { $gt: start }}
}
);
这是我遇到的问题. 此查询花费了超过15分钟的时间! .MySQL查询花费了不到一分钟的时间.这是输出:
And here's were I run into problems. This query took over 15 minutes to complete! The MySQL query took under a minute. Here's the output:
{
"result" : "tmp.mr.mapreduce_1287207199_6",
"shardCounts" : {
"127.20.90.7:10000" : {
"input" : 4917653,
"emit" : 4917653,
"output" : 1105648
},
"127.20.90.1:10000" : {
"input" : 5082347,
"emit" : 5082347,
"output" : 1150547
}
},
"counts" : {
"emit" : NumberLong(10000000),
"input" : NumberLong(10000000),
"output" : NumberLong(2256195)
},
"ok" : 1,
"timeMillis" : 811207,
"timing" : {
"shards" : 651467,
"final" : 159740
},
}
不仅要花很长时间才能运行,而且结果甚至看起来都不正确.
Not only did it take forever to run, but the results don't even seem to be correct.
db[res.result].find().sort({ hits: -1 }).limit(5);
{ "_id" : "Joe", "value" : 128 }
{ "_id" : "Jane", "value" : 2 }
{ "_id" : "Jerry", "value" : 2 }
{ "_id" : "Jack", "value" : 2 }
{ "_id" : "Jessy", "value" : 3 }
我知道这些值应该更高.
I know those value numbers should be much higher.
我对整个MapReduce范例的理解是执行此查询的任务应该在所有分片成员之间分配,这应该提高性能.我等到Mongo在导入后完成了在两个分片服务器之间分发文档的工作.当我开始此查询时,每个文档都几乎有5,000,000个文档.
My understanding of the whole MapReduce paradigm is the task of performing this query should be split between all shard members, which should increase performance. I waited till Mongo was done distributing the documents between the two shard servers after the import. Each had almost exactly 5,000,000 documents when I started this query.
所以我一定做错了.谁能给我任何指示?
So I must be doing something wrong. Can anyone give me any pointers?
IRC上有人提到在日字段中添加索引,但据我所知,它是由MongoDB自动完成的.
Someone on IRC mentioned adding an index on the day field, but as far as I can tell that was done automatically by MongoDB.
推荐答案
摘录自O'Reilly的MongoDB权威指南:
excerpts from MongoDB Definitive Guide from O'Reilly:
使用MapReduce的代价是速度: 小组不是特别快,但是 MapReduce较慢而不是 应该实时使用. 您将MapReduce作为背景运行 工作,它创建了一个集合 结果,然后您可以查询 实时收集.
The price of using MapReduce is speed: group is not particularly speedy, but MapReduce is slower and is not supposed to be used in "real time." You run MapReduce as a background job, it creates a collection of results, and then you can query that collection in real time.
options for map/reduce:
"keeptemp" : boolean
If the temporary result collection should be saved when the connection is closed.
"output" : string
Name for the output collection. Setting this option implies keeptemp : true.
这篇关于MongoDB:糟糕的MapReduce性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!