Spark的性能瓶颈 [英] Performance bottleneck of Spark

查看:91
本文介绍了Spark的性能瓶颈的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

发表在NSDI 2015上的论文了解数据分析框架中的性能"得出的结论是,CPU(而非IO或网络)是Spark的性能瓶颈. Kay在Spark上进行了一些实验,包括BDbench,TPC-DS和生产工作负载(仅使用Spark SQL?).我不知道这个结论是否适用于基于Spark的某些框架(例如Streaming,通过网络接收连续的数据流,网络IO和磁盘都将承受很大压力).

A paper "Making Sense of Performance in Data Analytics Frameworks" published in NSDI 2015 gives the conclusion that CPU(not IO or network) is the performance bottleneck of Spark. Kay has done some experiments on Spark including BDbench ,TPC-DS and a procdution workload(only Spark SQL is used?) in this paper. I wonder whether this conclusion is right for some frameworks built on Spark(like Streaming,with a continuous data stream received through network,both network IO and disk will suffer high pressure ).

推荐答案

Spark Streaming中的网络和磁盘可能受到的压力较小,因为流通常是

Network and disk may suffer less pressure in Spark Streaming because the streams are usually checkpointed, meaning all data is not usually kept around forever.

但是最终,这是一个研究问题:解决这一问题的唯一方法是进行基准测试.凯的代码是开源.

But ultimately, this is a research question : the only way to settle this one is to benchmark. Kay's code is open-source.

这篇关于Spark的性能瓶颈的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆