Spark 工作人员无法在 EC2 集群上找到 JAR [英] Spark workers unable to find JAR on EC2 cluster

查看:23
本文介绍了Spark 工作人员无法在 EC2 集群上找到 JAR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 spark-ec2 来运行一些 Spark 代码.当我将 master 设置为本地",然后它运行良好.但是,当我将 master 设置为 $MASTER 时,工人立即失败,java.lang.NoClassDefFoundError 为班级.worker 连接到 master,并显示在 UI 中,并尝试运行任务;但是一旦它加载了它的第一个依赖类(它在程序集 jar 中),就会立即引发该异常.

I'm using spark-ec2 to run some Spark code. When I set master to "local", then it runs fine. However, when I set master to $MASTER, the workers immediately fail, with java.lang.NoClassDefFoundError for the classes. The workers connect to the master, and show up in the UI, and try to run the task; but immediately raise that exception as soon as it loads its first dependency class (which is in the assembly jar).

我使用 sbt-assembly 制作了一个带有类的 jar,确认使用jar tvf 类在那里,并将 SparkConf 设置为分发班级.Spark Web UI 确实将程序集 jar 显示为添加到类路径:http://172.xxx47441/jars/myjar-assembly-1.0.jar

I've used sbt-assembly to make a jar with the classes, confirmed using jar tvf that the classes are there, and set SparkConf to distribute the classes. The Spark Web UI indeed shows the assembly jar to be added to the classpath: http://172.x.x.x47441/jars/myjar-assembly-1.0.jar

看来,尽管 myjar-assembly 包含类,并且正在被添加到集群中,它没有到达工人.我该如何解决?(我需要手动复制jar文件吗?如果是这样,到哪个目录?我认为 SparkConf 添加的点jars 是自动执行此操作)

It seems that, despite the fact that myjar-assembly contains the class, and is being added to the cluster, it's not reaching the workers. How do I fix this? (Do I need to manually copy the jar file? If so, to which dir? I thought that the point of the SparkConf add jars was to do this automatically)

我的调试尝试表明:

  1. 正在将程序集 jar 复制到/root/spark/work/app-xxxxxx/1/(由ssh决定到worker并搜索jar)
  2. 但是,该路径并未出现在工作线程的类路径中(根据日志确定,显示 java -cp 但缺少该文件)

所以,看来我需要告诉 Spark 将路径添加到程序集jar 到工人的类路径.我怎么做?还是有另一个罪魁祸首?(我花了几个小时试图调试它,但无济于事!)

So, it seems like I need to tell Spark to add the path to the assembly jar to the worker's classpath. How do I do that? Or is there another culprit? (I've spent hours trying to debug this but to no avail!)

推荐答案

注意:EC2 特定的答案,而不是一般的 Spark 答案.只是想对一年前提出的一个问题做一个完整的回答,这个问题有相同的症状,但通常是不同的原因,并且让很多人绊倒.

NOTE: EC2 specific answer, not a general Spark answer. Just trying to round out an answer to a question asked a year ago, one that has the same symptom but often different causes and trips up a lot of people.

如果我正确理解了这个问题,您会问,我需要手动复制 jar 文件吗?如果是,复制到哪个目录?"你说,并设置 SparkConf 来分发类",但你不清楚这是通过 spark-env.sh 还是 spark-defaults.conf 完成的?所以做一些假设,主要的假设是你在集群模式下运行,这意味着你的驱动程序在其中一个工作人员上运行,而你事先不知道是哪一个......然后......

If I am understanding the question correctly, you are asking, "Do I need to manually copy the jar file? If so, to which dir?" You say, "and set SparkConf to distribute the classes" but you are not clear if this is done via spark-env.sh or spark-defaults.conf? So making some assumptions, the main one being your are running in cluster mode, meaning your driver runs on one of the workers and you don't know which one in advance... then...

答案是肯定的,对于在类路径中命名的目录.在 EC2 中,唯一的持久数据存储是/root/persistent-hdfs,但我不知道这是否是个好主意.

The answer is yes, to the dir named in the classpath. In EC2 the only persistent data storage is /root/persistent-hdfs, but I don't know if that's a good idea.

在 EC2 上的 Spark 文档中,我看到这一行:

To deploy code or data within your cluster, you can log in and use
the provided script ~/spark-ec2/copy-dir, which, given a directory 
path, RSYNCs it to the same location on all the slaves.

SPARK_CLASSPATH

我不会使用 SPARK_CLASSPATH 因为它从 Spark 1.0 开始被弃用所以一个好主意是在 $SPARK_HOME/conf/spark-defaults.conf 中使用它的替代品:

I wouldn't use SPARK_CLASSPATH because it's deprecated as of Spark 1.0 so a good idea is to use its replacement in $SPARK_HOME/conf/spark-defaults.conf:

spark.executor.extraClassPath /path/to/jar/on/worker

这应该是有效的选项.如果您需要即时执行此操作,而不是在 conf 文件中,建议使用./spark-submit with --driver-class-path 来扩充驱动程序类路径"(来自 Spark docs about spark.executor.extraClassPath 并查看另一个来源的答案结尾).

This should be the option that works. If you need to do this on the fly, not in a conf file, the recommendation is "./spark-submit with --driver-class-path to augment the driver classpath" (from Spark docs about spark.executor.extraClassPath and see end of answer for another source on that).

但是......你没有使用spark-submit......我不知道它在EC2中是如何工作的,看着脚本我没有弄清楚EC2让你在命令行上提供这些参数的位置.您提到您在设置 SparkConf 对象时已经这样做了,所以如果这对您有用,请坚持下去.

BUT ... you are not using spark-submit ... I don't know how that works in EC2, looking at the script I didn't figure out where EC2 let's you supply these parameters on a command line. You mention you already do this in setting up your SparkConf object so stick with that if that works for you.

我在 Spark-years 中看到这是一个非常古老的问题,所以我想知道您是如何解决它的?我希望这对某人有所帮助,我在研究 EC2 的细节方面学到了很多东西.

I see in Spark-years this is a very old question so I wonder how you resolved it? I hope this helps someone, I learned a lot researching the specifics of EC2.

我必须承认,作为对此的限制,它让我在 Spark docs that for spark.executor.extraClassPath 它说:

I must admit, as a limitation on this, it confuses me in the Spark docs that for spark.executor.extraClassPath it says:

用户通常不需要设置此选项

Users typically should not need to set this option

我认为他们的意思是大多数人会通过驱动程序配置选项获取类路径.我知道 spark-submit 的大多数文档都应该像脚本一样处理在集群中移动你的代码,但我认为这只是在我假设你没有使用的独立客户端模式"中,我认为 EC2 必须处于独立模式"集群模式."

I assume they mean most people will get the classpath out through a driver config option. I know most of the docs for spark-submit make it should like the script handles moving your code around the cluster but I think that's only in "standalone client mode" which I assume you are not using, I assume EC2 must be in "standalone cluster mode."

关于 SPARK_CLASSPATH 弃用的更多/背景:

MORE / BACKGROUND ON SPARK_CLASSPATH deprecation:

更多让我想到 SPARK_CLASSPATH 的背景 已弃用此存档线程.这个,穿过另一个线程这个关于使用 SPARK_CLASSPATH 时的警告消息:

More background that leads me to think SPARK_CLASSPATH is deprecated is this archived thread. and this one, crossing the other thread and this one about a WARN message when using SPARK_CLASSPATH:

14/07/09 13:37:36 WARN spark.SparkConf:
SPARK_CLASSPATH was detected (set to 'path-to-proprietary-hadoop-lib/*:
/path-to-proprietary-hadoop-lib/lib/*').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath

这篇关于Spark 工作人员无法在 EC2 集群上找到 JAR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆