Web UI 的 Spark 作业中的 ThreadPoolExecutors 作业是什么? [英] What are ThreadPoolExecutors jobs in web UI's Spark Jobs?
问题描述
我使用的是 Spark SQL 1.6.1 并且正在执行一些连接.
查看 spark UI 我看到有一些作业的描述是run at ThreadPoolExecutor.java:1142"
我想知道为什么有些 Spark 工作会得到这样的描述?
经过一番调查,我发现 run at ThreadPoolExecutor.java:1142 Spark 作业与使用 join<的查询相关/code> 符合
BroadcastHashJoin
定义的运算符,其中一个连接端广播给执行程序以进行连接.
那个 BroadcastHashJoin
操作符使用一个 ThreadPool
来进行异步广播(见
在我的情况下,Spark 作业运行在在 ThreadPoolExecutor.java:1142 上运行",其中 ids 为 12 和 16.
它们都对应于 join
查询.
如果您想知道我的一个连接导致此作业出现是有道理的,但据我所知,连接是一个随机转换而不是一个动作,那么为什么用 ThreadPoolExecutor 而不是用我的行动(就像我的其他工作一样)?",那么我的回答通常是这样的:
Spark SQL 是 Spark 的一个扩展,它有自己的抽象(Dataset
s 来命名,只是很快想到的那个),有自己的执行操作符.一个简单"的 SQL 操作可以运行一个或多个 Spark 作业.运行或提交多少 Spark 作业由 Spark SQL 的执行引擎自行决定(但它们确实在幕后使用 RDD)——您不必知道如此低级的细节,因为它......好吧...太低级了...考虑到您使用 Spark SQL 的 SQL 或查询 DSL 如此高级.
I'm using Spark SQL 1.6.1 and am performing a few joins.
Looking at the spark UI I see that there are some jobs with description "run at ThreadPoolExecutor.java:1142"
I was wondering why do some Spark jobs get that description?
After some investigation I found out that run at ThreadPoolExecutor.java:1142 Spark jobs are related to queries with join
operators that fit the definition of BroadcastHashJoin
where one join side is broadcast to executors for join.
That BroadcastHashJoin
operator uses a ThreadPool
for this asynchronous broadcasting (see this and this).
scala> spark.version
res16: String = 2.1.0-SNAPSHOT
scala> val left = spark.range(1)
left: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> val right = spark.range(1)
right: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> left.join(right, Seq("id")).show
+---+
| id|
+---+
| 0|
+---+
When you switch to the SQL tab you should see Completed Queries section and their Jobs (on the right).
In my case the Spark job(s) running on "run at ThreadPoolExecutor.java:1142" where ids 12 and 16.
They both correspond to join
queries.
If you wonder "that makes sense that one of my joins is causing this job to appear but as far as I know join is a shuffle transformation and not an action, so why is the job described with the ThreadPoolExecutor and not with my action (as is the case with the rest of my jobs)?", then my answer is usually along the lines:
Spark SQL is an extension of Spark with its own abstractions (Dataset
s to name just the one that quickly springs to mind) that have their own operators for execution. One "simple" SQL operation can run one or more Spark jobs. It's at the discretion of Spark SQL's execution engine how many Spark jobs to run or submit (but they do use RDDs under the covers) -- you don't have to know such a low-leve details as it's...well...too low-level...given you are so high-level by using Spark SQL's SQL or Query DSL.
这篇关于Web UI 的 Spark 作业中的 ThreadPoolExecutors 作业是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!