Web UI的Spark Jobs中的ThreadPoolExecutors作业是什么? [英] What are ThreadPoolExecutors jobs in web UI's Spark Jobs?
问题描述
我正在使用Spark SQL 1.6.1,正在执行一些联接.
I'm using Spark SQL 1.6.1 and am performing a few joins.
看一下火花用户界面,我发现有一些作业描述为在ThreadPoolExecutor.java:1142上运行".
Looking at the spark UI I see that there are some jobs with description "run at ThreadPoolExecutor.java:1142"
我想知道为什么一些Spark作业能得到该描述?
I was wondering why do some Spark jobs get that description?
推荐答案
经过调查,我发现运行在ThreadPoolExecutor.java:1142 中.Spark作业与使用join
运算符的查询有关符合BroadcastHashJoin
的定义,其中将一个连接面广播给执行者以进行连接.
After some investigation I found out that run at ThreadPoolExecutor.java:1142 Spark jobs are related to queries with join
operators that fit the definition of BroadcastHashJoin
where one join side is broadcast to executors for join.
该BroadcastHashJoin
运算符使用ThreadPool
进行异步广播(请参见此)
That BroadcastHashJoin
operator uses a ThreadPool
for this asynchronous broadcasting (see this and this).
scala> spark.version
res16: String = 2.1.0-SNAPSHOT
scala> val left = spark.range(1)
left: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val right = spark.range(1)
right: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> left.join(right, Seq("id")).show
+---+
| id|
+---+
| 0|
+---+
当您切换到"SQL"标签时,您应该会看到"已完成的查询"部分及其工作(在右侧).
When you switch to the SQL tab you should see Completed Queries section and their Jobs (on the right).
在我的情况下,运行在运行于ThreadPoolExecutor.java:1142"上的Spark作业,其中id为12和16.
In my case the Spark job(s) running on "run at ThreadPoolExecutor.java:1142" where ids 12 and 16.
它们都对应于join
查询.
如果您想知道我的一个连接导致该作业出现是有意义的,但据我所知,连接是随机转换而不是操作,那么为什么用ThreadPoolExecutor而不是我的作业描述了该作业?行动(就像我其他工作一样)?",那么我的答案通常是:
If you wonder "that makes sense that one of my joins is causing this job to appear but as far as I know join is a shuffle transformation and not an action, so why is the job described with the ThreadPoolExecutor and not with my action (as is the case with the rest of my jobs)?", then my answer is usually along the lines:
Spark SQL是Spark的扩展,具有自己的抽象(Dataset
只是快速出现的抽象名称),它们具有自己的执行运算符.一个简单" SQL操作可以运行一个或多个Spark作业.由Spark SQL的执行引擎自行决定要运行或提交多少Spark作业(但它们确实使用了RDD)-您不必知道这么低的细节...好吧..太低级...使用Spark SQL的SQL或Query DSL可以使您如此高级.
Spark SQL is an extension of Spark with its own abstractions (Dataset
s to name just the one that quickly springs to mind) that have their own operators for execution. One "simple" SQL operation can run one or more Spark jobs. It's at the discretion of Spark SQL's execution engine how many Spark jobs to run or submit (but they do use RDDs under the covers) -- you don't have to know such a low-leve details as it's...well...too low-level...given you are so high-level by using Spark SQL's SQL or Query DSL.
这篇关于Web UI的Spark Jobs中的ThreadPoolExecutors作业是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!