Web UI 的 Spark 作业中的 ThreadPoolExecutors 作业是什么? [英] What are ThreadPoolExecutors jobs in web UI's Spark Jobs?

查看:50
本文介绍了Web UI 的 Spark 作业中的 ThreadPoolExecutors 作业是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Spark SQL 1.6.1 并且正在执行一些连接.

查看 spark UI 我看到有一些作业的描述是run at ThreadPoolExecutor.java:1142"

我想知道为什么有些 Spark 工作会得到这样的描述?

解决方案

经过一番调查,我发现 run at ThreadPoolExecutor.java:1142 Spark 作业与使用 join<的查询相关/code> 符合 BroadcastHashJoin 定义的运算符,其中一个连接端广播给执行程序以进行连接.

那个 BroadcastHashJoin 操作符使用一个 ThreadPool 来进行异步广播(见

在我的情况下,Spark 作业运行在在 ThreadPoolExecutor.java:1142 上运行",其中 ids 为 12 和 16.

它们都对应于 join 查询.

如果您想知道我的一个连接导致此作业出现是有道理的,但据我所知,连接是一个随机转换而不是一个动作,那么为什么用 ThreadPoolExecutor 而不是用我的行动(就像我的其他工作一样)?",那么我的回答通常是这样的:

Spark SQL 是 Spark 的一个扩展,它有自己的抽象(Datasets 来命名,只是很快想到的那个),有自己的执行操作符.一个简单"的 SQL 操作可以运行一个或多个 Spark 作业.运行或提交多少 Spark 作业由 Spark SQL 的执行引擎自行决定(但它们确实在幕后使用 RDD)——您不必知道如此低级的细节,因为它......好吧...太低级了...考虑到您使用 Spark SQL 的 SQL 或查询 DSL 如此高级.

I'm using Spark SQL 1.6.1 and am performing a few joins.

Looking at the spark UI I see that there are some jobs with description "run at ThreadPoolExecutor.java:1142"

I was wondering why do some Spark jobs get that description?

解决方案

After some investigation I found out that run at ThreadPoolExecutor.java:1142 Spark jobs are related to queries with join operators that fit the definition of BroadcastHashJoin where one join side is broadcast to executors for join.

That BroadcastHashJoin operator uses a ThreadPool for this asynchronous broadcasting (see this and this).

scala> spark.version
res16: String = 2.1.0-SNAPSHOT

scala> val left = spark.range(1)
left: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val right = spark.range(1)
right: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> left.join(right, Seq("id")).show
+---+
| id|
+---+
|  0|
+---+

When you switch to the SQL tab you should see Completed Queries section and their Jobs (on the right).

In my case the Spark job(s) running on "run at ThreadPoolExecutor.java:1142" where ids 12 and 16.

They both correspond to join queries.

If you wonder "that makes sense that one of my joins is causing this job to appear but as far as I know join is a shuffle transformation and not an action, so why is the job described with the ThreadPoolExecutor and not with my action (as is the case with the rest of my jobs)?", then my answer is usually along the lines:

Spark SQL is an extension of Spark with its own abstractions (Datasets to name just the one that quickly springs to mind) that have their own operators for execution. One "simple" SQL operation can run one or more Spark jobs. It's at the discretion of Spark SQL's execution engine how many Spark jobs to run or submit (but they do use RDDs under the covers) -- you don't have to know such a low-leve details as it's...well...too low-level...given you are so high-level by using Spark SQL's SQL or Query DSL.

这篇关于Web UI 的 Spark 作业中的 ThreadPoolExecutors 作业是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆