AWS Emr没有工人增加工作量 [英] aws emr no workers added to spark job
问题描述
我想通过spark-submit运行一个非常简单的pyspark应用.我通过在AWS EMR web-console
中添加一个步骤来启动应用程序,然后从s3
选择deploy mode cluster
选择应用程序,其余部分保留为空白.
I want to run a very simple pyspark app via spark-submit. I launch the app by adding a step in the AWS EMR web-console
I select the app from s3
select deploy mode cluster
and leave the rest blank.
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
mylist = [1, 2, 3, 4]
df = spark.createDataFrame(mylist, IntegerType())
df.write.parquet('s3:/path/to/save', mode='overwrite')
现在,当我执行此操作时,火花作业可以正确启动,但没有添加任何工作程序. 这是纱线的样子,我在那里有一个工人:
now when I do this the spark job correctly starts up but it does not get a worker added. This is what yarn looks like I have a worker there:
这是火花作业视图看起来像未分配工作者节点的样子
and this is how the spark job view looks like the worker node is not assigned
在EC2上使用自制"群集之前,我总是需要像这样将config
添加到SparkSession.builder.getOrCreate()
:
Before when I used my "homebrew" clusters on EC2 I always needed to add config
to the SparkSession.builder.getOrCreate()
like this:
from pyspark import SparkConf
conf = SparkConf().setAppName('EMR_test').setMaster('spark://MASTERDNS:7077')
spark = SparkSession.builder.config(conf=conf).getOrCreate()
但是当我这样做时,我只会得到一个19/07/31 10:19:28 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master MASTERDNS:7077
But when I do this I just get a 19/07/31 10:19:28 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master MASTERDNS:7077
我也尝试了spark-submit --master yarn
和SparkConf().setAppName('EMR_test').setMaster('yarn-cluster')
无济于事.在这两种情况下,我的spark应用程序都没有执行程序.
I also tried spark-submit --master yarn
and SparkConf().setAppName('EMR_test').setMaster('yarn-cluster')
to no avail. In both cases I dont get any executers for my spark app.
那么我该如何正确执行呢?当我启动pyspark console
或Livy Notebook
时,我得到了具有分配的工作程序节点的有效spark
会话.
so how do I do this properly? When I start a either a pyspark console
or a Livy Notebook
I get a working spark
session with assigned worker nodes.
推荐答案
好的,我解决了.默认情况下,Amazon EMR Web UI传递以下信息:
Okay I solved it. As a default the amazon EMR web UI passes this:
spark-submit --deploy-mode cluster s3://mybucket/EMR_test.py
这不是偶然的,我删除了--deploy-mode cluster
,并且一切正常,就像工作一样,我的工作得到了执行者.就这样...
which does not work by chance I removed --deploy-mode cluster
and everything works like a charm, my jobs get executors. That's it ...
要使它在首次使用EMR Web UI时更加烦人,从deploy-mode
或client
的下拉菜单中有两个选项.您显然想要cluster
,因为client
只会在主服务器上运行脚本.但是cluster
将永远无法工作.
To make it extra annoying when you first use the EMR web UI you have two options from a drop down menu about the deploy-mode
either cluster
or client
. You obviously want cluster
because client
would just run the script on the master. But cluster
will never work.
附录:
我通过它做了一些自我工作,问题与spark
的Dynamic Resource Allocation
选项有关(如果它在默认情况下在AWS EMR
上处于启用状态),则--deploy-mode cluster
不能工作您必须使用--deploy-mode client
或不使用任何内容.如果关闭了动态资源分配,则--deploy-mode cluster
可行.
I worked my self through it a bit more and the issue has to do with the Dynamic Resource Allocation
option of spark
if it is on (which it is by default on AWS EMR
) --deploy-mode cluster
will not work instead you have to use --deploy-mode client
or nothing. If Dynamic Resource Allocation is switched off --deploy-mode cluster
works.
这篇关于AWS Emr没有工人增加工作量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!