星火当地VS HDFS permormance [英] Spark local vs hdfs permormance

查看:279
本文介绍了星火当地VS HDFS permormance的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在相同的机器上星火集群和HDFS。
我复制一个文本文件,大约3Gbytes,对每台机器的本地文件系统和分布式HDFS文件系统。

我有一个简单的字数pyspark程序。

如果我提交程序读取本地文件系统中的文件,它持续约33秒。
如果我提交程序读取从HDFS文件,它持续约46秒。

为什么呢?我期望完全相反的结果。

补充sgvd的请求后:

16奴隶1主

星火独立的,没有特别的设置(复制因子3)

1.5.2版

 进口SYS
sys.path.insert(0,'在/ usr /本地/火花/蟒蛇/')
sys.path.insert(0,'/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip')
进口OS
os.environ ['SPARK_HOME'] ='在/ usr /本地/火花
os.environ ['JAVA_HOME'] ='在/ usr /本地/ JAVA
从pyspark进口SparkContext
#conf = pyspark.SparkConf()设置<的conf设置>
如果sys.argv中[1] =='本地':
    打印Esecuzine在modalita本地文件
    SC = SparkContext('火花://192.168.2.11:7077,测试本地文件')
    RDD = sc.textFile('/根/测试2')
其他:
    打印Esecuzine在modalita HDFS
    SC = SparkContext('火花://192.168.2.11:7077','测试HDFS文件')
    RDD = sc.textFile('HDFS://192.168.2.11:9000 /数据/测试2')
RDD1集= rdd.flatMap(波长X:x.split(''))。图(波长X:(X,1))。reduceByKey(波长X,Y:X + Y)
topFive = rdd1.takeOrdered(5键=拉姆达X:-x [1])
打印topFive


解决方案

哪些参数具体到执行器,驱动程序和RDD(相对于蔓延ANS存储级别)?

从星火文档

性能的影响

的洗牌是一种昂贵的操作,因为它涉及到磁盘I / O,数据序列化和网络I / O。来的洗牌组织数据,星火产生套任务 - map任务来组织数据,以及一组reduce任务汇总吧。这种命名源自马preduce,不直接涉及到星火的映射,减少操作。

因为它们采用的内存中的数据结构前,或把它们转移后整理记录

某些洗牌操作会消耗显著的堆内存。 具体来说,reduceByKey和aggregateByKey创建在地图上侧,这些结构和ByKey业务产生这些对减少侧。如果数据不适合在内存星火将波及这些表到磁盘,磁盘招致的额外开销I / O和增加垃圾收集

我感兴趣的内存/ CPU内核火花招聘VS 内存/ CPU内核为<$限额限制C $ C>地图&安培;减少任务。

从Hadoop的进行基准测试的关键参数:

  yarn.nodemanager.resource.cpu-vcores
马preduce.map.cpu.vcores
马preduce.reduce.cpu.vcores
马preduce.map.memory.mb
马preduce.reduce.memory.mb
马preduce.reduce.shuffle.memory.limit.percent

反对Hadoop的基​​准SPARK PARAMS为等价的关键参数。

  spark.driver.memory
spark.driver.cores
spark.executor.memory
spark.executor.cores
spark.memory.fraction

这些都只是一些关键参数。看一看详细的设置从 SPARK 和<一个href=\"https://hadoop.apache.org/docs/current/hadoop-ma$p$pduce-client/hadoop-ma$p$pduce-client-core/ma$p$pd-default.xml\"相对=nofollow>的Map Reduce

而不必一套正确的参数,我们无法比拟的跨越两种不同技术的工作性能。

I have a Spark cluster and a Hdfs on the same machines. I've copied a single text file, about 3Gbytes, on each machine's local filesystem and on hdfs distributed filesystem.

I have a simple word count pyspark program.

If i submit the program reading the file from local filesystem, it lasts about 33 sec. If i submit the program reading the file from hdfs, it lasts about 46 sec.

Why ? I expected exactly the opposite result.

Added after sgvd's request:

16 slaves 1 master

Spark Standalone with no particular settings (replication factor 3)

Version 1.5.2

import sys
sys.path.insert(0, '/usr/local/spark/python/')
sys.path.insert(0, '/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip')
import os
os.environ['SPARK_HOME']='/usr/local/spark'
os.environ['JAVA_HOME']='/usr/local/java'
from pyspark import SparkContext
#conf = pyspark.SparkConf().set<conf settings>


if sys.argv[1] == 'local':
    print 'Esecuzine in modalita local file'
    sc = SparkContext('spark://192.168.2.11:7077','Test Local file')
    rdd = sc.textFile('/root/test2')
else:
    print 'Esecuzine in modalita hdfs'
    sc = SparkContext('spark://192.168.2.11:7077','Test HDFS file')
    rdd = sc.textFile('hdfs://192.168.2.11:9000/data/test2')


rdd1 = rdd.flatMap(lambda x: x.split(' ')).map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y)
topFive = rdd1.takeOrdered(5,key=lambda x: -x[1])
print topFive

解决方案

What are the parameters specific to Executor, Driver and RDD (with respect to Spilling ans storage level) ?

From Spark documentation

Performance Impact

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.

Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

I am interested on memory/CPU core limits for Spark Job Vs memory/CPU core limits for Map & Reduce tasks.

Key parameters to benchmark from Hadoop:

yarn.nodemanager.resource.cpu-vcores
mapreduce.map.cpu.vcores
mapreduce.reduce.cpu.vcores
mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
mapreduce.reduce.shuffle.memory.limit.percent

Key parameters to benchmark SPARK params against Hadoop for equivalence.

spark.driver.memory
spark.driver.cores
spark.executor.memory
spark.executor.cores
spark.memory.fraction

These are just some of the key parameters. Have a look at detailed set from SPARK and Map Reduce

Without having right set of parameters, we can't compare the performance of jobs across two different technologies.

这篇关于星火当地VS HDFS permormance的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆