气流火花提交操作员 - 没有这样的文件或目录:'spark-submit':'spark-submit' [英] airflow spark-submit operator - No such file or directory: 'spark-submit': 'spark-submit'

查看:63
本文介绍了气流火花提交操作员 - 没有这样的文件或目录:'spark-submit':'spark-submit'的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是气流的新手,我正在尝试在部署在 docker 容器中的气流中安排 pyspark 作业,这是我的 dag,

I am new to airflow and I am trying to schedule a pyspark job in airflow deployed in docker containers, here is my dag,

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from datetime import datetime, timedelta

spark_master = "spark://spark:7077"
spark_app_name = "Spark Hello World"


now = datetime.now()

default_args = {
 "owner": "airflow",
 "depends_on_past": False,
 "start_date": datetime(now.year, now.month, now.day),
 "email": ["airflow@airflow.com"],
 "email_on_failure": False,
 "email_on_retry": False,
 "retries": 1,
 "retry_delay": timedelta(minutes=1)
 }

dag = DAG(
    dag_id="spark-test",
    description="This DAG runs a simple Pyspark app.",
    default_args=default_args,
    schedule_interval=timedelta(1)
         )

t1 = DummyOperator(task_id="start", dag=dag)

#Task 2 check if file exist
t2 = BashOperator(task_id='check_file_exists', bash_command='shasum 
/usr/local/spark/app/first.py',retries=2, retry_delay=timedelta(seconds=15),dag=dag)

t3 = SparkSubmitOperator(task_id="spark_job", application='/usr/local/spark/app/first.py', 
    name=spark_app_name,
   conn_id="spark_default",
   conf={"spark.master":spark_master},
   dag=dag)


t1 >> t2 >> t3

我的python脚本是:first.py

My python script is: first.py

from pyspark import SparkContext, SparkConf

if __name__ == '__main__':
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)


text_file = sc.textFile("/usr/local/spark/resources/data/Loren.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("/usr/local/spark/resources/data/loren_counts_task4")

我收到的错误 FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'

Reading local file: /usr/local/airflow/logs/spark-test/spark_job/2021-07- 
09T20:46:19.130980+00:00/2.log
[2021-07-09 20:47:50,119] {{taskinstance.py:655}} INFO - Dependencies all met for 
<TaskInstance: spark-test.spark_job 2021-07-09T20:46:19.130980+00:00 [queued]>
[2021-07-09 20:47:50,151] {{taskinstance.py:655}} INFO - Dependencies all met for 
<TaskInstance: spark-test.spark_job 2021-07-09T20:46:19.130980+00:00 [queued]>
[2021-07-09 20:47:50,152] {{taskinstance.py:866}} INFO - 
--------------------------------------------------------------------------------
[2021-07-09 20:47:50,152] {{taskinstance.py:867}} INFO - Starting attempt 2 of 2
[2021-07-09 20:47:50,152] {{taskinstance.py:868}} INFO - 
--------------------------------------------------------------------------------
[2021-07-09 20:47:50,165] {{taskinstance.py:887}} INFO - Executing <Task(SparkSubmitOperator): 
spark_job> on 2021-07-09T20:46:19.130980+00:00
[2021-07-09 20:47:50,169] {{standard_task_runner.py:53}} INFO - Started process 19335 to run 
task
[2021-07-09 20:47:50,249] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance: 
spark-test.spark_job 2021-07-09T20:46:19.130980+00:00 [running]> 9b6d4f74ee93
[2021-07-09 20:47:50,293] {{logging_mixin.py:112}} INFO - [2021-07-09 20:47:50,292] 
{{base_hook.py:84}} INFO - Using connection to: id: spark_default. Host: yarn, Port: None, 
Schema: None, Login: None, Password: None, extra: XXXXXXXX
[2021-07-09 20:47:50,294] {{logging_mixin.py:112}} INFO - [2021-07-09 20:47:50,294] 
{{spark_submit_hook.py:323}} INFO - Spark-Submit cmd: spark-submit --master yarn --conf 
spark.master=spark://spark:7077 --name Spark Hello World --queue root.default 
usr/local/spark/app/first.py
[2021-07-09 20:47:50,301] {{taskinstance.py:1128}} ERROR - [Errno 2] No such file or 
directory: 'spark-submit': 'spark-submit'
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 966, in 
_run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site- 
packages/airflow/contrib/operators/spark_submit_operator.py", line 187, in execute
self._hook.submit(self._application)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/hooks/spark_submit_hook.py", line 
393, in submit
**kwargs)
File "/usr/local/lib/python3.7/subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'
[2021-07-09 20:47:50,304] {{taskinstance.py:1170}} INFO - All retries failed; marking task as 
FAILED.dag_id=spark-test, task_id=spark_job, execution_date=20210709T204619, 
start_date=20210709T204750, end_date=20210709T204750
[2021-07-09 20:48:00,096] {{logging_mixin.py:112}} INFO - [2021-07-09 20:48:00,095] 
{{local_task_job.py:103}} INFO - Task exited with return code 1

我在 spark 容器 上运行了 spark-submit,它运行良好.我不知道出了什么问题

I ran the spark-submit on the spark container and it works perfectly. i am not sure what is wrong

推荐答案

你应该看到这个链接 基于 Docker 的解决方案中的 Apache Spark 和 Apache Airflow 连接

来自错误

spark-submit --master yarn --conf 
spark.master=spark://spark:7077 --name Spark Hello World --queue root.default

必须的

spark-submit --master spark://spark:7077 --conf 
spark.master=spark://spark:7077 --name Spark Hello World --queue root.default

通过在您的连接中为此 spark conn id (spark_default) 的额外选项设置 master.

By setting master in extra options in your connections for this spark conn id (spark_default).

Conn Type:Spark(如果没有spark,需要在airflow docker中安装apache-airflow-providers-apache-spark.)
主持人:spark://spark
端口:7077

Conn Type: Spark (If there is no spark. You should install apache-airflow-providers-apache-spark in airflow docker.)
Host: spark://spark
port: 7077

我不确定这是否是您的 docker-compose 文件.
https://github.com/puckel/docker-气流/blob/master/docker-compose-LocalExecutor.yml

I am not sure this is your docker-compose file or not.
https://github.com/puckel/docker-airflow/blob/master/docker-compose-LocalExecutor.yml

如果要将软件包安装在容器中.你应该编辑第二行

If you want to install the package in a container. You should edit the second line

   webserver:
        image: puckel/docker-airflow:1.10.9
        restart: always

   webserver:
        build: ./airflow
        restart: always

这是一个气流目录.

  • 气流
    • Dockerfile
    • requirements.txt

    Dockerfile

    FROM puckel/docker-airflow:1.10.9
    COPY requirements.txt ./
    
    RUN pip install --no-cache-dir -r requirements.txt
    RUN rm -rf requirements.txt
    

    要求.txt

    apache-airflow-providers-apache-spark == X.X.X (The version which compatible with your airflow version )
    

    您可以在这里找到它(与您的气流版本兼容的版本).https://pypi.org/project/apache-airflow-providers-apache-火花/

    You can find it here (The version which compatible with your airflow version ). https://pypi.org/project/apache-airflow-providers-apache-spark/

    也许你应该运行命令 submit-spark 看看发生了什么并修复那里的错误(在容器中).我希望你能修复它.

    Maybe you should run the command submit-spark to see what is going on and fix the error there(in the container).I hope you can fix it.

    这篇关于气流火花提交操作员 - 没有这样的文件或目录:'spark-submit':'spark-submit'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆