Spark:核心数量缩减时的性能数量不一致 [英] Spark: Inconsistent performance number in scaling number of cores

查看:160
本文介绍了Spark:核心数量缩减时的性能数量不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用排序基准对Spark进行简单的缩放测试 - 从1个核心到8个核心。我注意到8个内核比1个内核慢。

  //使用1核心$ b $运行spark#spark-submit --master local [1] --class john。 sort.jar data_800MB.txt data_800MB_output 

//使用8核心运行spark
spark-submit --master local [8] --class john.sort sort.jar data_800MB.txt data_800MB_output

每种情况下的输入和输出目录都在HDFS中。


<1>核心:80秒

<8>核心:160秒


我预计8核心性能会有x倍的加速。

解决方案

理论局限性

假设您熟悉



其中

>


  • s - 是并行部分的加速。
  • p - 是可以并行化的程序的一部分。



实际上,理论加速总是受限于无法并行化,即使 p 相对较高(0.95),理论上的限制也相当低:



)本文件采用知识共享署名 - 相同方式共享3.0 Unported许可协议授权。

署名: Daniels220 at 英语维基百科



约束你能得到多快。如果令人尴尬的平行工作,您可以预计 p 会相对较高。但我不会梦想任何接近0.95或更高的东西。这是因为

Spark是一个高成本的抽象

在数据中心规模上从事商品硬件工作。它的核心设计重点在于使整个系统稳健并且不受硬件故障的影响。当您使用数百个节点
并执行长时间运行的作业时,这是一个很棒的功能,但它不能很好地缩小。


Spark并非专注于并行计算 在实践中,Spark和类似系统主要关注两个问题:

$ ul

  • 通过在多个节点之间分配IO操作来减少整个IO延迟。

  • 增加可用内存量而不增加单位成本。



  • 这是大型数据密集型系统的基本问题。 p>

    并行处理比主要目标更具特定解决方案的副作用。火花首先分配,平行第二。主要的一点是通过扩大规模来增加数据量来保持处理时间不变,而不是加速现有的计算。

    使用现代协处理器和GPGPU,可以实现更高的并行性在一台机器上,而不是典型的Spark群集,但由于IO和内存限制,它不一定有助于数据密集型作业。问题是如何加载数据足够快,而不是如何处理它。



    实际影响




    • Spark不是单个机器上多处理或多线程的替代品。

    • 在单台机器上增加并行性不会带来任何改进,并且通常会导致组件开销下降。



    在此情况下



    假设类和jar是有意义的,它确实是一种排序方法,它只需要读取数据(单个分区,单个分区)和单个分区上的内存排序比执行整个火花分拣机器与洗牌文件和数据交换。

    I am doing a simple scaling test on Spark using sort benchmark -- from 1 core, up to 8 cores. I notice that 8 cores is slower than 1 core.

    //run spark using 1 core
    spark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output
    
    //run spark using 8 cores
    spark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output  
    

    The input and output directories in each case, are in HDFS.

    1 core: 80 secs

    8 cores: 160 secs

    I would expect 8 cores performance to have x amount of speedup.

    解决方案

    Theoretical limitations

    I assume you've are familiar Amdahl's law but here is a quick reminder. Theoretical speedup is defined as

    where

    • s - is the speedup of the parallel part.
    • p - is fraction of the program that can be parallelized.

    In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:

    (This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
    Attribution: Daniels220 at English Wikipedia
    )

    Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn't dream about anything close to 0.95 or higher. This is because

    Spark is a high cost abstraction

    Spark is designed to work on commodity hardware at the datacenter scale. It's core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes and execute long running jobs but it is doesn't scale down very well.

    Spark is not focused on parallel computing

    In practice Spark and similar systems are focused on two problems:

    • Reducing overall IO latency by distributing IO operations between multiple nodes.
    • Increasing amount of available memory without increasing the cost per unit.

    which are fundamental problems for large scale, data intensive systems.

    Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.

    With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn't necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.

    Practical implications

    • Spark is not a replacement for multiprocessing or mulithreading on a single machine.
    • Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.

    In this context:

    Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.

    这篇关于Spark:核心数量缩减时的性能数量不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆