无法让pyspark作业在hadoop群集的所有节点上运行 [英] Can't get pyspark job to run on all nodes of hadoop cluster

查看:542
本文介绍了无法让pyspark作业在hadoop群集的所有节点上运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

摘要:我无法让我的python-spark作业在我的hadoop群集的所有节点上运行。
我已经为hadoop'spark-1.5.2-bin-hadoop2.6'安装了spark。当启动一个java spark工作时,负载在所有节点上分配
,当启动一个python spark工作时,只有一个节点负载该工作。



安装


  • hdfs和纱线配​​置为4个节点:nk01(namenode),nk02,nk03,nk04,在xen虚拟服务器上

  • 版本:jdk1.8.0_66,hadoop-2.7.1,spark-1.5.2-bin-hadoop2.6

  • hadoop安装了所有4个节点

  • spark仅安装在nk01上



我复制了一堆Gutenberg文件(谢谢你,Johannes!)到hdfs上,并尝试在文件的子集(以'e'开头的文件)上使用java和python做一个wordcount:



使用自制的Python脚本来做wordcount:

  / opt / spark / bin / spark-submit wordcount.py  - 主要纱线集群\ 
--num -executeors 4 --executor-cores 1

ython代码分配4个部分:

  tt = sc.textFile('/ user / me / gutenberg / text / e * .txt ',4)

在60秒内加载4个节点:





结论:Java版本在集群中分布负载,python版本只运行在1个节点上。

:我如何获得python版本还可以在所有节点上分配负载?

Spark-submit

  ./ bin / spark-submit \ 
--class< main-class>
--master< master-url> \
--deploy-mode< deploy-mode> \
--conf< key> =< value> \
...#其他选项
< application-jar> \
[application-arguments]

这里有一些与scala / java submit in不同的地方参数位置。
$ b


对于Python应用程序,只需传递一个.py文件来代替
application-jar 而不是JAR,并使用--py-files将Python .zip,.egg或.py文件添加到搜索路径中。

你应该使用下面的命令来代替:

/ opt / spark / bin / spark-submit --master yarn-cluster wordcount.py
--num-executors 4 --executor-cores 1


Summary: I can't get my python-spark job to run on all nodes of my hadoop cluster. I've installed the spark for hadoop 'spark-1.5.2-bin-hadoop2.6'. When launching a java spark job, the load gets distributed over all nodes, when launching a python spark job, only the one node takes the load.

Setup:

  • hdfs and yarn configured for 4 nodes: nk01 (namenode), nk02, nk03, nk04, running on xen virtual servers
  • versions: jdk1.8.0_66, hadoop-2.7.1, spark-1.5.2-bin-hadoop2.6
  • hadoop installed all 4 nodes
  • spark only installed on nk01

I copied a bunch of Gutenberg files (thank you, Johannes!) onto hdfs, and try doing a wordcount using java and python on a subset of the files (the files that start with an 'e') :

Python:

Using a homebrew python script for doing wordcount:

/opt/spark/bin/spark-submit wordcount.py --master yarn-cluster \
    --num-executors 4 --executor-cores 1

The Python code assigns 4 partions:

tt=sc.textFile('/user/me/gutenberg/text/e*.txt',4)

Load on the 4 nodes during 60 seconds:

Java:

Using the JavaWordCount found in the spark distribution:

/opt/spark/bin/spark-submit --class JavaWordCount --master yarn-cluster \
    --num-executors 4 jwc.jar '/user/me/gutenberg/text/e*.txt'

Conclusion: the java version distributes its load across the cluster, the python version just runs on 1 node.

Question: how do I get the python version also to distribute the load across all nodes?

解决方案

Spark-submit

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

Here are some different with scala/java submit in parameter position.

For Python applications, simply pass a .py file in the place of application-jar instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.

You should use below command instead:
/opt/spark/bin/spark-submit --master yarn-cluster wordcount.py --num-executors 4 --executor-cores 1

这篇关于无法让pyspark作业在hadoop群集的所有节点上运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆