传递命令行参数星火壳 [英] Passing command line arguments to Spark-shell
问题描述
我用Scala编写的火花工作。我用
I have a spark job written in scala. I use
spark-shell -i <file-name>
运行作业。我需要通过一个命令行参数来工作。现在,我通过一个Linux任务调用脚本,我在那里做
to run the job. I need to pass a command-line argument to the job. Right now, I invoke the script through a linux task, where I do
export INPUT_DATE=2015/04/27
和使用环境变量选项用来访问值:
and use the environment variable option to access the value using:
System.getenv("INPUT_DATE")
有没有更好的方式来处理星火壳命令行参数?
Is there a better way to handle the command line arguments in Spark-shell?
推荐答案
火花壳-i≤(VAL呼应= theDate $ INPUT_DATE;猫&LT;文件名称&gt;)
该解决方案导致在传递给前档火花提交的开头添加以下行
:
This solution causes the following line to be added at the beginning of the file before passed to spark-submit
:
VAL theDate = ...
,
,从而限定一个新的变量。这样做的方式(即≤(...)
语法)被称为进程替换。它在Bash中可用。请参见这个问题了解这一点,并为替代品(如 mkfifo子
)对于非猛砸环境。
thereby defining a new variable. The way this is done (the <( ... )
syntax) is called process substitution. It is available in Bash. See this question for more on this, and for alternatives (e.g. mkFifo
) for non-Bash environments.
把code下面的脚本(如 spark-script.sh
),然后你可以简单地使用:
Put the code below in a script (e.g. spark-script.sh
), and then you can simply use:
./ spark-script.sh your_file.scala first_arg second_arg third_arg
,
并有一个数组[字符串]
名为 ARGS
与你的论点。
文件 spark-script.sh
:
scala_file=$1
shift 1
arguments=$@
#set +o posix # to enable process substitution when not running on bash
spark-shell --master yarn --deploy-mode client \
--queue default \
--driver-memory 2G --executor-memory 4G \
--num-executors 10 \
-i <(echo 'val args = "'$arguments'".split("\\s+")' ; cat $scala_file)
这篇关于传递命令行参数星火壳的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!