如何在Virtualenv中为pyspark运行spark-submit? [英] How to run spark-submit in virtualenv for pyspark?

查看:135
本文介绍了如何在Virtualenv中为pyspark运行spark-submit?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在virtualenv中,是否可以运行 spark-submit (来自HDP 3.1.0的spark v2.3.2)?出现以下情况:在virtualenv中具有使用python3(和某些特定的lib)的python文件(以将lib版本与系统其余部分隔离).我想使用/bin/spark-submit 运行此文件,但尝试这样做我得到...

Is there a way to run spark-submit (spark v2.3.2 from HDP 3.1.0) while in a virtualenv? Have situation where have python file that uses python3 (and some specific libs) in a virtualenv (to isolate lib versions from rest of system). I would like to run this file with /bin/spark-submit, but attempting to do so I get...

[me@airflowetl tests]$ source ../venv/bin/activate; /bin/spark-submit sparksubmit.test.py 
  File "/bin/hdp-select", line 255
    print "ERROR: Invalid package - " + name
                                    ^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: Invalid package - " + name)?
ls: cannot access /usr/hdp//hadoop/lib: No such file or directory
Exception in thread "main" java.lang.IllegalStateException: hdp.version is not set while running Spark under HDP, please set through HDP_VERSION in spark-env.sh or add a java-opts file in conf with -Dhdp.version=xxx
    at org.apache.spark.launcher.Main.main(Main.java:118)

也尝试过...

(venv) [me@airflowetl tests]$ export HADOOP_CONF_DIR=/etc/hadoop/conf; spark-submit --master yarn --deploy-mode cluster sparksubmit.test.py 
19/12/12 13:50:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/12/12 13:50:20 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Exception in thread "main" java.lang.NoClassDefFoundError: com/sun/jersey/api/client/config/ClientConfig
    at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:55)
    ....
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: com.sun.jersey.api.client.config.ClientConfig

...或(从此处 https://www.hackingnote.com/en/spark/trouble-shooting/NoClassDefFoundError-ClientConfig )...

(venv) [airflow@airflowetl tests]$ spark-submit --master yarn --deploy-mode client --conf spark.hadoop.yarn.timeline-service.enabled=false sparksubmit.test.py 
19/12/12 15:22:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/12/12 15:22:49 INFO spark.SparkContext: Running Spark version 2.4.4
19/12/12 15:22:49 INFO spark.SparkContext: Submitted application: hph_etl_TEST
19/12/12 15:22:49 INFO spark.SecurityManager: Changing view acls to: airflow
19/12/12 15:22:49 INFO spark.SecurityManager: Changing modify acls to: airflow
19/12/12 15:22:49 INFO spark.SecurityManager: Changing view acls groups to: 
19/12/12 15:22:49 INFO spark.SecurityManager: Changing modify acls groups to: 
19/12/12 15:22:49 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(airflow); groups with view permissions: Set(); users  with modify permissions: Set(airflow); groups with modify permissions: Set()
19/12/12 15:22:49 INFO util.Utils: Successfully started service 'sparkDriver' on port 45232.
19/12/12 15:22:50 INFO spark.SparkEnv: Registering MapOutputTracker
19/12/12 15:22:50 INFO spark.SparkEnv: Registering BlockManagerMaster
19/12/12 15:22:50 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/12/12 15:22:50 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/12/12 15:22:50 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-320366b6-609a-497b-ac40-119d11682044
19/12/12 15:22:50 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB
19/12/12 15:22:50 INFO spark.SparkEnv: Registering OutputCommitCoordinator
19/12/12 15:22:50 INFO util.log: Logging initialized @2663ms
19/12/12 15:22:50 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
19/12/12 15:22:50 INFO server.Server: Started @2763ms
19/12/12 15:22:50 INFO server.AbstractConnector: Started ServerConnector@50a3c656{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
19/12/12 15:22:50 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@306c15f1{/jobs,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2b566f8d{/jobs/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1b5ef515{/jobs/job,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@59f7a5e2{/jobs/job/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@41c58356{/stages,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2d5f2026{/stages/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@324ca89a{/stages/stage,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6f487c61{/stages/stage/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3897116a{/stages/pool,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@68ab090f{/stages/pool/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@42ea3278{/storage,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6eedf530{/storage/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6e71a5c6{/storage/rdd,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5e222a76{/storage/rdd/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4dc8aa38{/environment,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4c8d82c4{/environment/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2fb15106{/executors,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@608faf1c{/executors/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@689e405f{/executors/threadDump,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@48a5742a{/executors/threadDump/json,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6db93559{/static,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4d7ed508{/,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5510f12d{/api,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6d87de7{/jobs/job/kill,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@62595660{/stages/stage/kill,null,AVAILABLE,@Spark}
19/12/12 15:22:50 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://airflowetl.local:4040
19/12/12 15:22:51 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
19/12/12 15:22:51 INFO client.RMProxy: Connecting to ResourceManager at hw001.local/172.18.4.46:8050
19/12/12 15:22:51 INFO yarn.Client: Requesting a new application from cluster with 4 NodeManagers
19/12/12 15:22:51 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (15360 MB per container)
19/12/12 15:22:51 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
19/12/12 15:22:51 INFO yarn.Client: Setting up container launch context for our AM
19/12/12 15:22:51 INFO yarn.Client: Setting up the launch environment for our AM container
19/12/12 15:22:51 INFO yarn.Client: Preparing resources for our AM container
19/12/12 15:22:51 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
19/12/12 15:22:53 INFO yarn.Client: Uploading resource file:/tmp/spark-4e600acd-2d34-4271-b01c-25f312906f93/__spark_libs__8368679994314392346.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/__spark_libs__8368679994314392346.zip
19/12/12 15:22:54 INFO yarn.Client: Uploading resource file:/home/airflow/projects/hph_etl_airflow/venv/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/pyspark.zip
19/12/12 15:22:55 INFO yarn.Client: Uploading resource file:/home/airflow/projects/hph_etl_airflow/venv/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/py4j-0.10.7-src.zip
19/12/12 15:22:55 INFO yarn.Client: Uploading resource file:/tmp/spark-4e600acd-2d34-4271-b01c-25f312906f93/__spark_conf__5403285055443058510.zip -> hdfs://hw001.local:8020/user/airflow/.sparkStaging/application_1572898343646_0029/__spark_conf__.zip
19/12/12 15:22:55 INFO spark.SecurityManager: Changing view acls to: airflow
19/12/12 15:22:55 INFO spark.SecurityManager: Changing modify acls to: airflow
19/12/12 15:22:55 INFO spark.SecurityManager: Changing view acls groups to: 
19/12/12 15:22:55 INFO spark.SecurityManager: Changing modify acls groups to: 
19/12/12 15:22:55 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(airflow); groups with view permissions: Set(); users  with modify permissions: Set(airflow); groups with modify permissions: Set()
19/12/12 15:22:56 INFO yarn.Client: Submitting application application_1572898343646_0029 to ResourceManager
19/12/12 15:22:56 INFO impl.YarnClientImpl: Submitted application application_1572898343646_0029
19/12/12 15:22:56 INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_1572898343646_0029 and attemptId None
19/12/12 15:22:57 INFO yarn.Client: Application report for application_1572898343646_0029 (state: ACCEPTED)
19/12/12 15:22:57 INFO yarn.Client: 
     client token: N/A
     diagnostics: AM container is launched, waiting for AM container to Register with RM
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1576200176385
     final status: UNDEFINED
     tracking URL: http://hw001.local:8088/proxy/application_1572898343646_0029/
     user: airflow
19/12/12 15:22:58 INFO yarn.Client: Application report for application_1572898343646_0029 (state: FAILED)
19/12/12 15:22:58 INFO yarn.Client: 
     client token: N/A
     diagnostics: Application application_1572898343646_0029 failed 2 times due to AM Container for appattempt_1572898343646_0029_000002 exited with  exitCode: 1
Failing this attempt.Diagnostics: [2019-12-12 15:22:58.214]Exception from container-launch.
Container id: container_e02_1572898343646_0029_02_000001
Exit code: 1

[2019-12-12 15:22:58.215]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/hadoop/yarn/local/usercache/airflow/appcache/application_1572898343646_0029/container_e02_1572898343646_0029_02_000001/launch_container.sh: line 38: $PWD:$PWD/__spark_conf__:$PWD/__spark_libs__/*:$HADOOP_CONF_DIR:/usr/hdp/3.1.0.0-78/hadoop/*:/usr/hdp/3.1.0.0-78/hadoop/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/*:$PWD/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:$PWD/mr-framework/hadoop/share/hadoop/common/*:$PWD/mr-framework/hadoop/share/hadoop/common/lib/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/*:$PWD/mr-framework/hadoop/share/hadoop/yarn/lib/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/*:$PWD/mr-framework/hadoop/share/hadoop/hdfs/lib/*:$PWD/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/${hdp.version}/hadoop/lib/hadoop-lzo-0.6.0.${hdp.version}.jar:/etc/hadoop/conf/secure:$PWD/__spark_conf__/__hadoop_conf__: bad substitution
....

不确定该怎么做或如何进一步进行操作,并且在对其进行谷歌搜索之后还不能完全理解错误消息.

Not sure what to make of this or how to proceed further and did not totally understand the error message after googling it.

有更多经验的人对此有进一步的调试提示或修补程序吗?

Anyone with more experience have any further debugging tips for this or fixes?

推荐答案

spark-submit是一个bash脚本,使用Java类来运行,因此使用virtualenv不一定有帮助(尽管您可以在日志中看到文件是从环境上传的).

spark-submit is a bash script, and uses Java classes to run, so using a virtualenv wouldn't necessarily help (although, you can see in the logs that files were uploaded from the environment).

第一个错误是因为hdp-select需要Python2,但它似乎与Python3一起运行(可能是由于您的venv)

The first error is because hdp-select requires Python2, but it looks like it ran with Python3 (probably due to your venv)

如果要将Python环境带给执行者和驱动程序,则可能需要使用-pyfiles 选项,或者在每个Spark节点上设置相同的python环境

If you want to carry your Python environment to the executors and driver, you'd probably want to use the --pyfiles option instead, or setup the same python environment on each Spark node

此外,您似乎拥有Spark 2.4.4,而不是您所说的2.3.2,如果您混合使用Spark版本,这可以解释NoClassDef(特别是pip的pyspark不会下载任何调度程序特定的程序包,就像YARN时间轴一样)

Also, you seem to have Spark 2.4.4, not 2.3.2, like you say, which could explain the NoClassDef if you're mixing Spark versions (in particular pyspark from pip doesn't download any scheduler specific packages, like the YARN timeline)

但是您运行的很好,您可以在

But you ran the code fine and you can find the real exception under

http://hw001.local:8088/proxy/application_1572898343646_0029

这篇关于如何在Virtualenv中为pyspark运行spark-submit?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆