Spark:扩展内核数量的性能数字不一致 [英] Spark: Inconsistent performance number in scaling number of cores

查看:22
本文介绍了Spark:扩展内核数量的性能数字不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用排序基准对 Spark 进行简单的扩展测试——从 1 核到 8 核.我注意到 8 核比 1 核慢.

//使用 1 个内核运行 sparkspark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output//使用8核运行sparkspark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output

每种情况下的输入和输出目录都在 HDFS 中.

<块引用>

1 个核心:80 秒

8 核:160 秒

我希望 8 核性能有 x 量的加速.

解决方案

理论限制

我假设您熟悉

哪里:

  • s - 是并行部分的加速.
  • p - 是可以并行化的程序的一部分.

在实践中,理论加速总是受到不能并行化的部分的限制,即使 p 相对较高(0.95),理论限制也相当低:

(此文件已根据知识共享署名-相同方式共享 3.0 未移植许可证获得许可.
署名:Daniels220英文维基百科
)

这实际上设定了您可以获得多快的理论界限.您可以预期 p 会相对较高,以防 令人尴尬的并行作业但我不会梦想接近 0.95 或更高.这是因为

Spark 是一种高成本的抽象

Spark 旨在用于数据中心规模的商用硬件.它的核心设计专注于使整个系统健壮且不受硬件故障的影响.当您使用数百个节点时,这是一个很棒的功能并执行长时间运行的作业,但它并没有很好地缩小规模.

Spark 不专注于并行计算

在实践中,Spark 和类似系统关注两个问题:

  • 通过在多个节点之间分配 IO 操作来减少整体 IO 延迟.
  • 在不增加单位成本的情况下增加可用内存量.

这是大规模数据密集型系统的基本问题.

并行处理更多是特定解决方案的副作用,而不是主要目标.Spark 首先是分布式的,其次是并行的.主要的一点是通过横向扩展来保持处理时间随着数据量的增加保持不变,而不是加快现有的计算速度.

使用现代协处理器和 GPGPU,您可以在单台机器上实现比典型 Spark 集群更高的并行度,但由于 IO 和内存限制,它不一定有助于数据密集型作业.问题在于如何足够快地加载数据,而不是如何处理数据.

实际意义

  • Spark 不能替代单台机器上的多处理或多线程.
  • 在单台机器上增加并行度不太可能带来任何改进,并且通常会由于组件的开销而降低性能.

在这种情况下:

假设 class 和 jar 是有意义的并且它确实是一种排序,那么读取数据(单个分区输入,单个分区输出)和在单个分区上的内存中排序比使用 shuffle 执行整个 Spark 排序机制更便宜文件和数据交换.

I am doing a simple scaling test on Spark using sort benchmark -- from 1 core, up to 8 cores. I notice that 8 cores is slower than 1 core.

//run spark using 1 core
spark-submit --master local[1] --class john.sort sort.jar data_800MB.txt data_800MB_output

//run spark using 8 cores
spark-submit --master local[8] --class john.sort sort.jar data_800MB.txt data_800MB_output  

The input and output directories in each case, are in HDFS.

1 core: 80 secs

8 cores: 160 secs

I would expect 8 cores performance to have x amount of speedup.

解决方案

Theoretical limitations

I assume you are familiar Amdahl's law but here is a quick reminder. Theoretical speedup is defined as followed :

where :

  • s - is the speedup of the parallel part.
  • p - is fraction of the program that can be parallelized.

In practice theoretical speedup is always limited by the part that cannot be parallelized and even if p is relatively high (0.95) the theoretical limit is quite low:

(This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
Attribution: Daniels220 at English Wikipedia
)

Effectively this sets theoretical bound how fast you can get. You can expect that p will be relatively high in case embarrassingly parallel jobs but I wouldn't dream about anything close to 0.95 or higher. This is because

Spark is a high cost abstraction

Spark is designed to work on commodity hardware at the datacenter scale. It's core design is focused on making a whole system robust and immune to hardware failures. It is a great feature when you work with hundreds of nodes and execute long running jobs but it is doesn't scale down very well.

Spark is not focused on parallel computing

In practice Spark and similar systems are focused on two problems:

  • Reducing overall IO latency by distributing IO operations between multiple nodes.
  • Increasing amount of available memory without increasing the cost per unit.

which are fundamental problems for large scale, data intensive systems.

Parallel processing is more a side effect of the particular solution than the main goal. Spark is distributed first, parallel second. The main point is to keep processing time constant with increasing amount of data by scaling out, not speeding up existing computations.

With modern coprocessors and GPGPUs you can achieve much higher parallelism on a single machine than a typical Spark cluster but it doesn't necessarily help in data intensive jobs due to IO and memory limitations. The problem is how to load data fast enough not how to process it.

Practical implications

  • Spark is not a replacement for multiprocessing or mulithreading on a single machine.
  • Increasing parallelism on a single machine is unlikely to bring any improvements and typically will decrease performance due to overhead of the components.

In this context:

Assuming that the class and jar are meaningful and it is indeed a sort it is just cheaper to read data (single partition in, single partition out) and sort in memory on a single partition than executing a whole Spark sorting machinery with shuffle files and data exchange.

这篇关于Spark:扩展内核数量的性能数字不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆