连接到远程 Spark master - Java/Scala [英] Connecting to a remote Spark master - Java / Scala
问题描述
我在 AWS 中创建了一个 3 节点(1 个主节点,2 个工作节点)Apache Spark
集群.我可以从主服务器向集群提交作业,但无法远程工作.
I created a 3 node (1 master, 2 workers) Apache Spark
cluster in AWS. I'm able to submit jobs to the cluster from the master, however I cannot get it work remotely.
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "/usr/local/spark/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://ec2-54-245-111-320.compute-1.amazonaws.com:7077")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
sc.stop()
}
}
我可以从主人那里看到:
I can see from the master:
Spark Master at spark://ip-171-13-22-125.ec2.internal:7077
URL: spark://ip-171-13-22-125.ec2.internal:7077
REST URL: spark://ip-171-13-22-125.ec2.internal:6066 (cluster mode)
所以当我从本地机器执行 SimpleApp.scala
时,它无法连接到 Spark Master
:
so when I execute SimpleApp.scala
from my local machine, it fails to connect to the the Spark Master
:
2017-02-04 19:59:44,074 INFO [appclient-register-master-threadpool-0] client.StandaloneAppClient$ClientEndpoint (Logging.scala:54) [] - Connecting to master spark://ec2-54-245-111-320.compute-1.amazonaws.com:7077...
2017-02-04 19:59:44,166 WARN [appclient-register-master-threadpool-0] client.StandaloneAppClient$ClientEndpoint (Logging.scala:87) [] - Failed to connect to spark://ec2-54-245-111-320.compute-1.amazonaws.com:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) ~[spark-core_2.10-2.0.2.jar:2.0.2]
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) ~[spark-core_2.10-2.0.2.jar:2.0.2]
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) ~[scala-library-2.10.0.jar:?]
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) ~[spark-core_2.10-2.0.2.jar:2.0.2]
但是,我知道如果我将 master 设置为 local
,它会起作用,因为它会在本地运行.但是,我想让我的客户端连接到这个远程主机.我怎样才能做到这一点?Apache 配置文件.我什至可以 telnet 到那个公共 DNS 和端口,我还为每个 EC2
实例配置了带有公共 DNS 和主机名的 /etc/hosts
.我希望能够向这个远程主提交作业,我错过了什么?
However, I know it would have worked if I had set the master to local
, because then it would run locally. However, I want to have my client connecting to this remote master. How can I accomplish that? The Apache configuration looks file. I can even telnet to that public DNS and port, I also configured /etc/hosts
with the public DNS and hostname for each of the EC2
instances.
I want to be able to submit jobs to this remote master, what am I missing?
推荐答案
要绑定 master 主机名/IP,请转到您的 spark 安装 conf 目录 (spark-2.0.2-bin-hadoop2.7/conf) 并创建 spark-env.sh 文件使用以下命令.
For binding master host-name/IP go to your spark installation conf directory (spark-2.0.2-bin-hadoop2.7/conf) and create spark-env.sh file using below command.
cp spark-env.sh.template spark-env.sh
在 vi 编辑器中打开 spark-env.sh 文件,并在下面添加带有主机名/IP 的行.
Open spark-env.sh file in vi editor and add below line with host-name/IP of your master.
SPARK_MASTER_HOST=ec2-54-245-111-320.compute-1.amazonaws.com
使用 stop-all.sh 和 start-all.sh 停止和启动 Spark.现在您可以使用它来使用
Stop and start Spark using stop-all.sh and start-all.sh. Now you can use it to connect remote master using
val spark = SparkSession.builder()
.appName("SparkSample")
.master("spark://ec2-54-245-111-320.compute-1.amazonaws.com:7077")
.getOrCreate()
有关设置环境变量的更多信息,请查看 http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts
For more information on setting environment variables please check http://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts
这篇关于连接到远程 Spark master - Java/Scala的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!