气流火花提交操作员 - 没有这样的文件或目录:'spark-submit':'spark-submit' [英] airflow spark-submit operator - No such file or directory: 'spark-submit': 'spark-submit'
问题描述
我是气流的新手,我正在尝试在部署在 docker 容器中的气流中安排 pyspark 作业,这是我的 dag,
I am new to airflow and I am trying to schedule a pyspark job in airflow deployed in docker containers, here is my dag,
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from datetime import datetime, timedelta
spark_master = "spark://spark:7077"
spark_app_name = "Spark Hello World"
now = datetime.now()
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": datetime(now.year, now.month, now.day),
"email": ["airflow@airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=1)
}
dag = DAG(
dag_id="spark-test",
description="This DAG runs a simple Pyspark app.",
default_args=default_args,
schedule_interval=timedelta(1)
)
t1 = DummyOperator(task_id="start", dag=dag)
#Task 2 check if file exist
t2 = BashOperator(task_id='check_file_exists', bash_command='shasum
/usr/local/spark/app/first.py',retries=2, retry_delay=timedelta(seconds=15),dag=dag)
t3 = SparkSubmitOperator(task_id="spark_job", application='/usr/local/spark/app/first.py',
name=spark_app_name,
conn_id="spark_default",
conf={"spark.master":spark_master},
dag=dag)
t1 >> t2 >> t3
我的python脚本是:first.py
My python script is: first.py
from pyspark import SparkContext, SparkConf
if __name__ == '__main__':
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
text_file = sc.textFile("/usr/local/spark/resources/data/Loren.txt")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("/usr/local/spark/resources/data/loren_counts_task4")
我收到的错误 FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'
Reading local file: /usr/local/airflow/logs/spark-test/spark_job/2021-07-
09T20:46:19.130980+00:00/2.log
[2021-07-09 20:47:50,119] {{taskinstance.py:655}} INFO - Dependencies all met for
<TaskInstance: spark-test.spark_job 2021-07-09T20:46:19.130980+00:00 [queued]>
[2021-07-09 20:47:50,151] {{taskinstance.py:655}} INFO - Dependencies all met for
<TaskInstance: spark-test.spark_job 2021-07-09T20:46:19.130980+00:00 [queued]>
[2021-07-09 20:47:50,152] {{taskinstance.py:866}} INFO -
--------------------------------------------------------------------------------
[2021-07-09 20:47:50,152] {{taskinstance.py:867}} INFO - Starting attempt 2 of 2
[2021-07-09 20:47:50,152] {{taskinstance.py:868}} INFO -
--------------------------------------------------------------------------------
[2021-07-09 20:47:50,165] {{taskinstance.py:887}} INFO - Executing <Task(SparkSubmitOperator):
spark_job> on 2021-07-09T20:46:19.130980+00:00
[2021-07-09 20:47:50,169] {{standard_task_runner.py:53}} INFO - Started process 19335 to run
task
[2021-07-09 20:47:50,249] {{logging_mixin.py:112}} INFO - Running %s on host %s <TaskInstance:
spark-test.spark_job 2021-07-09T20:46:19.130980+00:00 [running]> 9b6d4f74ee93
[2021-07-09 20:47:50,293] {{logging_mixin.py:112}} INFO - [2021-07-09 20:47:50,292]
{{base_hook.py:84}} INFO - Using connection to: id: spark_default. Host: yarn, Port: None,
Schema: None, Login: None, Password: None, extra: XXXXXXXX
[2021-07-09 20:47:50,294] {{logging_mixin.py:112}} INFO - [2021-07-09 20:47:50,294]
{{spark_submit_hook.py:323}} INFO - Spark-Submit cmd: spark-submit --master yarn --conf
spark.master=spark://spark:7077 --name Spark Hello World --queue root.default
usr/local/spark/app/first.py
[2021-07-09 20:47:50,301] {{taskinstance.py:1128}} ERROR - [Errno 2] No such file or
directory: 'spark-submit': 'spark-submit'
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 966, in
_run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.7/site-
packages/airflow/contrib/operators/spark_submit_operator.py", line 187, in execute
self._hook.submit(self._application)
File "/usr/local/lib/python3.7/site-packages/airflow/contrib/hooks/spark_submit_hook.py", line
393, in submit
**kwargs)
File "/usr/local/lib/python3.7/subprocess.py", line 800, in __init__
restore_signals, start_new_session)
File "/usr/local/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'spark-submit': 'spark-submit'
[2021-07-09 20:47:50,304] {{taskinstance.py:1170}} INFO - All retries failed; marking task as
FAILED.dag_id=spark-test, task_id=spark_job, execution_date=20210709T204619,
start_date=20210709T204750, end_date=20210709T204750
[2021-07-09 20:48:00,096] {{logging_mixin.py:112}} INFO - [2021-07-09 20:48:00,095]
{{local_task_job.py:103}} INFO - Task exited with return code 1
我在 spark 容器 上运行了 spark-submit,它运行良好.我不知道出了什么问题
I ran the spark-submit on the spark container and it works perfectly. i am not sure what is wrong
推荐答案
你应该看到这个链接 基于 Docker 的解决方案中的 Apache Spark 和 Apache Airflow 连接
来自错误
spark-submit --master yarn --conf
spark.master=spark://spark:7077 --name Spark Hello World --queue root.default
必须的
spark-submit --master spark://spark:7077 --conf
spark.master=spark://spark:7077 --name Spark Hello World --queue root.default
通过在您的连接中为此 spark conn id (spark_default) 的额外选项设置 master.
By setting master in extra options in your connections for this spark conn id (spark_default).
Conn Type:Spark(如果没有spark,需要在airflow docker中安装apache-airflow-providers-apache-spark.)
主持人:spark://spark
端口:7077
Conn Type: Spark (If there is no spark. You should install apache-airflow-providers-apache-spark in airflow docker.)
Host: spark://spark
port: 7077
我不确定这是否是您的 docker-compose 文件.
https://github.com/puckel/docker-气流/blob/master/docker-compose-LocalExecutor.yml
I am not sure this is your docker-compose file or not.
https://github.com/puckel/docker-airflow/blob/master/docker-compose-LocalExecutor.yml
如果要将软件包安装在容器中.你应该编辑第二行
If you want to install the package in a container. You should edit the second line
webserver:
image: puckel/docker-airflow:1.10.9
restart: always
到
webserver:
build: ./airflow
restart: always
这是一个气流目录.
- 气流
- Dockerfile
- requirements.txt
Dockerfile
FROM puckel/docker-airflow:1.10.9 COPY requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt RUN rm -rf requirements.txt
要求.txt
apache-airflow-providers-apache-spark == X.X.X (The version which compatible with your airflow version )
您可以在这里找到它(与您的气流版本兼容的版本).https://pypi.org/project/apache-airflow-providers-apache-火花/
You can find it here (The version which compatible with your airflow version ). https://pypi.org/project/apache-airflow-providers-apache-spark/
也许你应该运行命令 submit-spark 看看发生了什么并修复那里的错误(在容器中).我希望你能修复它.
Maybe you should run the command submit-spark to see what is going on and fix the error there(in the container).I hope you can fix it.
这篇关于气流火花提交操作员 - 没有这样的文件或目录:'spark-submit':'spark-submit'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!