AWS Emr没有工人增加工作量 [英] aws emr no workers added to spark job

查看:132
本文介绍了AWS Emr没有工人增加工作量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过spark-submit运行一个非常简单的pyspark应用.我通过在AWS EMR web-console中添加一个步骤来启动应用程序,然后从s3选择deploy mode cluster选择应用程序,其余部分保留为空白.

I want to run a very simple pyspark app via spark-submit. I launch the app by adding a step in the AWS EMR web-console I select the app from s3 select deploy mode cluster and leave the rest blank.

from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
mylist = [1, 2, 3, 4]
df = spark.createDataFrame(mylist, IntegerType())
df.write.parquet('s3:/path/to/save', mode='overwrite')

现在,当我执行此操作时,火花作业可以正确启动,但没有添加任何工作程序. 这是纱线的样子,我在那里有一个工人:

now when I do this the spark job correctly starts up but it does not get a worker added. This is what yarn looks like I have a worker there:

这是火花作业视图看起来像未分配工作者节点的样子

and this is how the spark job view looks like the worker node is not assigned

在EC2上使用自制"群集之前,我总是需要像这样将config添加到SparkSession.builder.getOrCreate():

Before when I used my "homebrew" clusters on EC2 I always needed to add config to the SparkSession.builder.getOrCreate() like this:

from pyspark import SparkConf
conf = SparkConf().setAppName('EMR_test').setMaster('spark://MASTERDNS:7077')
spark = SparkSession.builder.config(conf=conf).getOrCreate()

但是当我这样做时,我只会得到一个19/07/31 10:19:28 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master MASTERDNS:7077

But when I do this I just get a 19/07/31 10:19:28 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master MASTERDNS:7077

我也尝试了spark-submit --master yarnSparkConf().setAppName('EMR_test').setMaster('yarn-cluster')无济于事.在这两种情况下,我的spark应用程序都没有执行程序.

I also tried spark-submit --master yarn and SparkConf().setAppName('EMR_test').setMaster('yarn-cluster') to no avail. In both cases I dont get any executers for my spark app.

那么我该如何正确执行呢?当我启动pyspark consoleLivy Notebook时,我得到了具有分配的工作程序节点的有效spark会话.

so how do I do this properly? When I start a either a pyspark console or a Livy Notebook I get a working spark session with assigned worker nodes.

推荐答案

好的,我解决了.默认情况下,Amazon EMR Web UI传递以下信息:

Okay I solved it. As a default the amazon EMR web UI passes this:

spark-submit --deploy-mode cluster s3://mybucket/EMR_test.py

这不是偶然的,我删除了--deploy-mode cluster,并且一切正常,就像工作一样,我的工作得到了执行者.就这样...

which does not work by chance I removed --deploy-mode cluster and everything works like a charm, my jobs get executors. That's it ...

要使它在首次使用EMR Web UI时更加烦人,从deploy-modeclient的下拉菜单中有两个选项.您显然想要cluster,因为client只会在主服务器上运行脚本.但是cluster将永远无法工作.

To make it extra annoying when you first use the EMR web UI you have two options from a drop down menu about the deploy-mode either cluster or client. You obviously want cluster because client would just run the script on the master. But cluster will never work.

附录:

我通过它做了一些自我工作,问题与sparkDynamic Resource Allocation选项有关(如果它在默认情况下在AWS EMR上处于启用状态),则--deploy-mode cluster不能工作您必须使用--deploy-mode client或不使用任何内容.如果关闭了动态资源分配,则--deploy-mode cluster可行.

I worked my self through it a bit more and the issue has to do with the Dynamic Resource Allocation option of spark if it is on (which it is by default on AWS EMR) --deploy-mode cluster will not work instead you have to use --deploy-mode client or nothing. If Dynamic Resource Allocation is switched off --deploy-mode cluster works.

这篇关于AWS Emr没有工人增加工作量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆