Hadoop多节点集群太慢。如何提高数据处理速度? [英] Hadoop multinode cluster too slow. How do I increase speed of data processing?

查看:604
本文介绍了Hadoop多节点集群太慢。如何提高数据处理速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个6节点群集-5 DN和1 NN。全部都有32 GB RAM。所有从站均具有8.7 TB硬盘。 DN具有1.1 TB硬盘。这是我的






纱:




I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.

After running an MR job, i checked my RAM Usage which is mentioned below:

Namenode

free -g
          total        used        free      shared  buff/cache   available
Mem:      31           7          15           0           8          22
Swap:     31           0          31

Datanode :

Slave1 :

free -g
          total        used        free      shared  buff/cache   available
Mem:      31           6           6           0          18          24
Swap:     31           3          28

Slave2:

          total        used        free      shared  buff/cache   available
Mem:      31           2           4           0          24          28
Swap:     31           1          30

Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.

Here is the output of ps command of the JAR that I submnitted to execute the MR job:

/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m 
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs 
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log 
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console 
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml 
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs 
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log 
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop 
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console 
-classpath --classpath of jars
 org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02

Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.

EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -

nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &

Here is some more information :

18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372

Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?

解决方案

To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:

For Tez: Divide RAM/CORES = Max TEZ Container size So in my case: 128/32 = 4GB

TEZ:


YARN:

这篇关于Hadoop多节点集群太慢。如何提高数据处理速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆