为什么提交工作到mapreduce需要很多时间在一般? [英] Why submitting job to mapreduce takes so much time in General?

查看:108
本文介绍了为什么提交工作到mapreduce需要很多时间在一般?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以通常对于20个节点集群提交作业来处理3GB(200个分裂)的数据需要约30秒,实际执行约1米。
我想了解作业提交过程中的瓶颈,并理解下一个报价


Per-MapReduce开销显着:启动/结束MapReduce作业成本时间

我知道的一些过程:
1.数据拆分
2。 jar文件共享

解决方案

有关HDFS和M / R的一些理解有助于理解这种延迟: b
$ b


  1. HDFS将文件存储为分布在多台计算机上的数据块,称为datanodes

  2. M / R运行多个名为mapper的程序每个数据块或块。这些映射器的(键,值)输出一起编译为reducers的结果。 (想象总结来自多个映射器的各种结果)
  3. 每个映射器和简化器都是在这些分布式系统上产生的完整的程序。即使让我们说他们什么也没做(没有OP映射减少程序),确实需要一些时间来产生完整的程序。

  4. 当要处理的数据量变得非常大,这些产卵时间变得微不足道,这就是Hadoop闪耀的时候。

如果您要处理1000行内容的文件,那么您更好地使用正常的文件读取和处理程序。 Hadoop基础架构在分布式系统上产生一个进程不会产生任何好处,但只会导致额外的开销,包括定位包含相关数据块的datanode,启动它们上的处理程序,跟踪和收集结果。

现在将其扩展到100个Peta字节的数据,与处理它们所需的时间相比,这些开销看起来完全不重要。处理器(映射器和缩减器)的并行化将在这里显示出优势。

因此,在分析M / R的性能之前,您应该首先对您的群集进行基准测试,以便更好地理解开销。



在集群上执行无操作map-reduce程序需要多长时间?



使用MRBench 用于此目的:


  1. MRbench多次循环少量作业

  2. 检查小作业运行是否响应并在集群上高效运行。

  3. 它对HDFS层的影响非常有限

要运行此程序,请尝试以下操作(检查最新版本的正确方法:

  hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50 



<令人惊讶的是,在我们的一个开发集群中,它只有22秒。

另一个问题是文件大小。



如果文件大小小于HDFS块大小,那么Map / Reduce程序会产生大量开销,Hadoop通常会尝试为每个块产生一个映射器,这意味着如果您有30个5KB文件,那么Hadoop可能会结束即使文件的大小很小,每个块最终也会产生30个映射器。这是一个真正的浪费,因为与处理小型文件的时间相比,每个程序开销都很大。


So usually for 20 node cluster submitting job to process 3GB(200 splits) of data takes about 30sec and actual execution about 1m. I want to understand what is the bottleneck in job submitting process and understand next quote

Per-MapReduce overhead is significant: Starting/ending MapReduce job costs time

Some process I'm aware: 1. data splitting 2. jar file sharing

解决方案

A few things to understand about HDFS and M/R that helps understand this latency:

  1. HDFS stores your files as data chunk distributed on multiple machines called datanodes
  2. M/R runs multiple programs called mapper on each of the data chunks or blocks. The (key,value) output of these mappers are compiled together as result by reducers. (Think of summing various results from multiple mappers)
  3. Each mapper and reducer is a full fledged program that is spawned on these distributed system. It does take time to spawn a full fledged programs, even if let us say they did nothing (No-OP map reduce programs).
  4. When the size of data to be processed becomes very big, these spawn times become insignificant and that is when Hadoop shines.

If you were to process a file with a 1000 lines content then you are better of using a normal file read and process program. Hadoop infrastructure to spawn a process on a distributed system will not yield any benefit but will only contribute to the additional overhead of locating datanodes containing relevant data chunks, starting the processing programs on them, tracking and collecting results.

Now expand that to 100 of Peta Bytes of data and these overheads looks completely insignificant compared to time it would take to process them. Parallelization of the processors (mappers and reducers) will show it's advantage here.

So before analyzing the performance of your M/R, you should first look to benchmark your cluster so that you understand the overheads better.

How much time does it take to do a no-operation map-reduce program on a cluster?

Use MRBench for this purpose:

  1. MRbench loops a small job a number of times
  2. Checks whether small job runs are responsive and running efficiently on your cluster.
  3. Its impact on the HDFS layer is very limited

To run this program, try the following (Check the correct approach for latest versions:

hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 50

Surprisingly on one of our dev clusters it was 22 seconds.

Another issue is file size.

If the file sizes are less than the HDFS block size then Map/Reduce programs have significant overhead. Hadoop will typically try to spawn a mapper per block. That means if you have 30 5KB files, then Hadoop may end up spawning 30 mappers eventually per block even if the size of file is small. This is a real wastage as each program overhead is significant compared to the time it would spend processing the small sized file.

这篇关于为什么提交工作到mapreduce需要很多时间在一般?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆