受Apache Spark限制吗? [英] is CPU usage in Apache Spark limited?

查看：108 发布时间：2020/5/24 21:32:17 scala apache-spark parallel-processing

本文介绍了受Apache Spark限制吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我最近发现，即使在local[1]模式下运行spark或使用具有1个执行程序和1个内核的Yarn，在UDF中添加并行计算(例如，使用并行集合)也可以显着提高性能.

I recently discovered that adding parallel computing (e.g. using parallel-collections) inside UDFs increases performance considerable even when running spark in local[1] mode or using Yarn with 1 executor and 1 core.

例如在local[1]模式下，Spark-Job消耗尽可能多的CPU(即，如果我有8个内核(使用top测量)，则为800％).

E.g. in local[1] mode, the Spark-Jobs consumes as much CPU as possible (i.e. 800% if I have 8 cores, measured using top).

这似乎很奇怪，因为我认为Spark(或yarn)限制了每个Spark应用程序的CPU使用率?

This seems strange because I thought Spark (or yarn) limits the CPU usage per Spark application?

所以我想知道为什么会这样，是否建议在spark中使用并行处理/多线程，还是应该坚持使用spark并行化模式?

So I wonder why that is and whether it's recommended to use parallel-processing/mutli-threading in spark or should I stick to sparks parallelizing pattern?

这里是一个示例(以1个实例和1个内核的纱线客户端模式测量的时间)

Here an example to play with (times measured in yarn client-mode with 1 instance and 1 core)

case class MyRow(id:Int,data:Seq[Double])

// create dataFrame
val rows = 10
val points = 10000
import scala.util.Random.nextDouble
val data = {1 to rows}.map{i => MyRow(i, Stream.continually(nextDouble()).take(points))}
val df = sc.parallelize(data).toDF().repartition($"id").cache()

df.show() // trigger computation and caching

// some expensive dummy-computation for each array-element
val expensive = (d:Double) => (1 to 10000).foldLeft(0.0){case(a,b) => a*b}*d

val serialUDF = udf((in:Seq[Double]) => in.map{expensive}.sum)
val parallelUDF = udf((in:Seq[Double]) => in.par.map{expensive}.sum)

df.withColumn("sum",serialUDF($"data")).show() // takes ~ 10 seconds
df.withColumn("sum",parallelUDF($"data")).show() // takes ~ 2.5 seconds

受Apache Spark限制吗? [英] is CPU usage in Apache Spark limited?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

受Apache Spark限制吗? [英] is CPU usage in Apache Spark limited?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭