Web UI的Spark Jobs中的ThreadPoolExecutors作业是什么? [英] What are ThreadPoolExecutors jobs in web UI's Spark Jobs?

查看:445
本文介绍了Web UI的Spark Jobs中的ThreadPoolExecutors作业是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark SQL 1.6.1,正在执行一些联接.

I'm using Spark SQL 1.6.1 and am performing a few joins.

看一下火花用户界面,我发现有一些作业描述为在ThreadPoolExecutor.java:1142上运行".

Looking at the spark UI I see that there are some jobs with description "run at ThreadPoolExecutor.java:1142"

我想知道为什么一些Spark作业能得到该描述?

I was wondering why do some Spark jobs get that description?

推荐答案

经过调查,我发现运行在ThreadPoolExecutor.java:1142 中.Spark作业与使用join运算符的查询有关符合BroadcastHashJoin的定义,其中将一个连接面广播给执行者以进行连接.

After some investigation I found out that run at ThreadPoolExecutor.java:1142 Spark jobs are related to queries with join operators that fit the definition of BroadcastHashJoin where one join side is broadcast to executors for join.

BroadcastHashJoin运算符使用ThreadPool进行异步广播(请参见)

That BroadcastHashJoin operator uses a ThreadPool for this asynchronous broadcasting (see this and this).

scala> spark.version
res16: String = 2.1.0-SNAPSHOT

scala> val left = spark.range(1)
left: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val right = spark.range(1)
right: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> left.join(right, Seq("id")).show
+---+
| id|
+---+
|  0|
+---+

当您切换到"SQL"标签时,您应该会看到"已完成的查询"部分及其工作(在右侧).

When you switch to the SQL tab you should see Completed Queries section and their Jobs (on the right).

在我的情况下,运行在运行于ThreadPoolExecutor.java:1142"上的Spark作业,其中id为12和16.

In my case the Spark job(s) running on "run at ThreadPoolExecutor.java:1142" where ids 12 and 16.

它们都对应于join查询.

如果您想知道我的一个连接导致该作业出现是有意义的,但据我所知,连接是随机转换而不是操作,那么为什么用ThreadPoolExecutor而不是我的作业描述了该作业?行动(就像我其他工作一样)?",那么我的答案通常是:

If you wonder "that makes sense that one of my joins is causing this job to appear but as far as I know join is a shuffle transformation and not an action, so why is the job described with the ThreadPoolExecutor and not with my action (as is the case with the rest of my jobs)?", then my answer is usually along the lines:

Spark SQL是Spark的扩展,具有自己的抽象(Dataset只是快速出现的抽象名称),它们具有自己的执行运算符.一个简单" SQL操作可以运行一个或多个Spark作业.由Spark SQL的执行引擎自行决定要运行或提交多少Spark作业(但它们确实使用了RDD)-您不必知道这么低的细节...好吧..太低级...使用Spark SQL的SQL或Query DSL可以使您如此高级.

Spark SQL is an extension of Spark with its own abstractions (Datasets to name just the one that quickly springs to mind) that have their own operators for execution. One "simple" SQL operation can run one or more Spark jobs. It's at the discretion of Spark SQL's execution engine how many Spark jobs to run or submit (but they do use RDDs under the covers) -- you don't have to know such a low-leve details as it's...well...too low-level...given you are so high-level by using Spark SQL's SQL or Query DSL.

这篇关于Web UI的Spark Jobs中的ThreadPoolExecutors作业是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆