pyspark:用船JAR依赖火花提交 [英] pyspark: ship jar dependency with spark-submit
问题描述
我写了一个pyspark脚本读取两个JSON文件,协同组
他们并将结果发送到elasticsearch集群;一切正常(大部分)如预期,当我在本地运行它,我下载了 elasticsearch-的Hadoop为
和 org.elasticsearch.hadoop.mr
jar文件。 EsOutputFormat org.elasticsearch.hadoop.mr.LinkedMapWritable
类,然后运行我的工作用pyspark的 - 罐子
的说法,我可以看到的文件出现在我的elasticsearch集群。
I wrote a pyspark script that reads two json files, coGroup
them and sends the result to an elasticsearch cluster; everything works (mostly) as expected when I run it locally, I downloaded the elasticsearch-hadoop
jar file for the org.elasticsearch.hadoop.mr.EsOutputFormat
and org.elasticsearch.hadoop.mr.LinkedMapWritable
classes, and then run my job with pyspark using the --jars
argument, and I can see documents appearing in my elasticsearch cluster.
当我尝试火花集群上运行它,但是,我得到这个错误:
When I try to run it on a spark cluster, however, I'm getting this error:
Traceback (most recent call last):
File "/root/spark/spark_test.py", line 141, in <module>
conf=es_write_conf
File "/root/spark/python/pyspark/rdd.py", line 1302, in saveAsNewAPIHadoopFile
keyConverter, valueConverter, jconf)
File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: java.lang.ClassNotFoundException: org.elasticsearch.hadoop.mr.LinkedMapWritable
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at org.apache.spark.util.Utils$.classForName(Utils.scala:157)
at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1$$anonfun$apply$9.apply(PythonRDD.scala:611)
at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1$$anonfun$apply$9.apply(PythonRDD.scala:610)
at scala.Option.map(Option.scala:145)
at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1.apply(PythonRDD.scala:610)
at org.apache.spark.api.python.PythonRDD$$anonfun$getKeyValueTypes$1.apply(PythonRDD.scala:609)
at scala.Option.flatMap(Option.scala:170)
at org.apache.spark.api.python.PythonRDD$.getKeyValueTypes(PythonRDD.scala:609)
at org.apache.spark.api.python.PythonRDD$.saveAsNewAPIHadoopFile(PythonRDD.scala:701)
at org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
这似乎是pretty明确对我说:在 elasticsearch-的hadoop
罐子上没有工人;这样的问题:我怎么把它与我的应用程序一起?我可以使用 sc.addPyFile
的蟒蛇依赖,但它不会与罐子工作,并使用 - 罐子$ C $的C>参数
火花提交
于事无补。
which seems pretty clear to me: the elasticsearch-hadoop
jar is not available on the workers; so the question: how do I send it along with my app? I could use sc.addPyFile
for a python dependency, but it won't work with jars, and using the --jars
parameters of spark-submit
doesn't help.
推荐答案
的 - 罐子
只是工作;问题是我怎么跑在首位的火花提交
工作;执行正确的方法是:
The --jars
just works; the problem is how I run the spark-submit
job in the first place; the correct way to execute is:
./bin/spark-submit <options> scriptname
因此, - 罐子
选项必须放在脚本之前:
Therefore the --jars
option must be placed before the script:
./bin/spark-submit --jars /path/to/my.jar myscript.py
这如有明显的,如果你认为这是参数传递给脚本本身,因为一切的唯一方法后,脚本名称将被用作脚本的输入参数:
This if obvious if you think that this is the only way to pass arguments to the script itself, as everything after the script name will be used as input arguments for the script:
./bin/spark-submit --jars /path/to/my.jar myscript.py --do-magic=true
这篇关于pyspark:用船JAR依赖火花提交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!