Spark集群无法扩展到小数据 [英] Spark cluster does not scale to small data

查看:53
本文介绍了Spark集群无法扩展到小数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在评估具有线性回归基准(Spark ML)的小型集群(3个具有32个CPU和128 GB Ram的节点)上的Spark 2.1.0.我只测量了参数计算的时间(不包括启动,数据加载等),并且发现了以下行为.对于较小的数据集0.1 Mio – 3 Mio数据点,测量的时间并没有真正增加,而是停留在40秒左右.仅对于较大的数据集(如300 Mio数据点),处理时间才达到200秒.这样看来,群集根本无法扩展到较小的数据集.

我还比较了本地PC上的小型数据集与仅使用10个worker和16GB内存的群集.集群的处理时间增加了3倍.那么这是否被认为是SPARK的正常行为,并且可以通过通信开销来解释呢?或者我做错了什么(或者线性回归不是真正的代表性)?

该群集是一个独立的群集(不包含Yarn或Mesos),基准测试由90名工作人员提交,每个工作人员具有1个核心和4 GB内存.

火花提交: ./spark-submit --master spark://server:7077 --class Benchmark --deploy-mode client --total-executor-cores 90 --executor-memory 4g --num-executors 90 .. ./Benchmark.jar pathToData

解决方案

最佳集群大小和配置会根据数据和作业的性质而有所不同.在这种情况下,我认为您的直觉是正确的,由于给定集群大小(核心和执行程序)的额外开销,在较小的数据集上完成这项工作的时间似乎不成比例.

请注意,将数据量增加两个数量级只会使处理时间增加5倍.您正在将数据增加到最适合群集设置的大小.

Spark是用于处理大量数据的出色工具,但如果数据适合,则在单台计算机上运行单个进程将没有竞争力.但是,它可能比其他基于磁盘的分布式处理工具快得多,后者的数据无法容纳在一台计算机上.

几年前,我在一次演讲中,演讲者给出了一个比喻,Spark就像是一辆机车在竞赛一辆自行车:-如果负载轻,则自行车将获胜,加速,敏捷,敏捷,但随着重载时,机车可能要花一些时间才能适应速度,但最终速度会更快. (恐怕我忘记了演讲者的名字,但这是在伦敦的一次Cassandra聚会上,演讲者来自能源行业的一家公司.)

i am currently evaluating Spark 2.1.0 on a small cluster (3 Nodes with 32 CPUs and 128 GB Ram) with a benchmark in linear regression (Spark ML). I only measured the time for the parameter calculation (not including start, data loading, …) and recognized the following behavior. For small datatsets 0.1 Mio – 3 Mio datapoints the measured time is not really increasing and stays at about 40 seconds. Only with larger datasets like 300 Mio datapoints the processing time went up to 200 seconds. So it seems, the cluster does not scale at all to small datasets.

I also compared the small dataset on my local pc with the cluster using only 10 worker and 16GB ram. The processing time of the cluster is larger by a factor of 3. So is this considered normal behavior of SPARK and explainable by communication overhead or am I doing something wrong (or is linear regression not really representative)?

The cluster is a standalone cluster (without Yarn or Mesos) and the benchmarks where submitted with 90 worker, each with 1 core and 4 GB ram.

Spark submit: ./spark-submit --master spark://server:7077 --class Benchmark --deploy-mode client --total-executor-cores 90 --executor-memory 4g --num-executors 90 .../Benchmark.jar pathToData

解决方案

The optimum cluster size and configuration varies based on the data and the nature of the job. In this case, I think that your intuition is correct, the job seems to take disproportionately longer to complete on smaller dataset, because of the excess overhead given the size of the cluster (cores and executors).

Notice that increasing the amount of data by two orders of magnitude increases the processing time only 5-fold. You are increasing the data toward an optimum size for your cluster setup.

Spark is a great tool for processing lots of data, but it isn't going to be competitive with running a single process on a single machine if the data will fit. However it can be much faster than other distributed processing tools that are disk-based, where the data does not fit on a single machine.

I was at a talk a couple years back and the speaker gave an analogy that Spark is like a locomotive racing a bicycle:- the bike will win if the load is light, it is quicker to accelerate and more agile, but with a heavy load the locomotive might take a while to get up to speed, but it's going to be faster in the end. (I'm afraid I forget the speakers name, but it was at a Cassandra meetup in London, and the speaker was from a company in the energy sector).

这篇关于Spark集群无法扩展到小数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆