使用pyspark作业的外部库，从谷歌，dataproc火花集群 [英] use an external library in pyspark job in a Spark cluster from google-dataproc

查看：462 发布时间：2016/5/22 15:35:54 import apache-spark pyspark google-cloud-dataproc

本文介绍了使用pyspark作业的外部库，从谷歌，dataproc火花集群的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有我通过谷歌dataproc创建Spark集群。我希望能够使用从databricks的的 CSV库的（见 https：//开头github.com/databricks/spark-csv ）。所以我第一次测试是这样的：

我开始SSH会话与我集群的主节点，然后我输入：

  pyspark --packages com.databricks：火花csv_2.11：1.2.0

然后，它推出了pyspark外壳中，我输入：

  DF = sqlContext.read.format（'com.databricks.spark.csv'）。选项（标题=真，则InferSchema =真）。负载（GS ：/xxxx/foo.csv'）
df.show（）

和它的工作。

我的下一个步骤是使用命令从我的主机启动该作业：

  gcloud测试Dataproc工作提出pyspark --cluster＆LT;我-dataproc集群＆GT; my_job.py

但在这里它不工作，我得到一个错误。我想是因为我没给 - 包com.databricks：火花csv_2.11：1.2.0 作为参数，但我想10个不同的方式来给它我没有管理。

我的问题是：

安装了databricks CSV库我输入后 pyspark --packages com.databricks：火花csv_2.11：1.2.0

我可以写在我行 job.py 以进口呢？

我应该给我的gcloud指令或者什么PARAMS导入或安装？

解决方案

简答

有在其中参数排序的怪癖 - 包不受接受火花提交如果来了 my_job.py 参数之后。要解决这个问题，可以从Dataproc的CLI提交时做到以下几点：

  gcloud测试Dataproc工作提出pyspark --cluster＆LT;我-dataproc集群＆GT; \\
    --properties spark.jars.packages = com.databricks：火花csv_2.11：1.2.0 my_job.py

基本上，只要加入 - 性能spark.jars.packages = com.databricks：火花csv_2.11：1.2.0 在的.py 文件在你的命令。

长的答案

所以，这实际上是一个不同的问题比已知缺乏支持 - 罐子在 gcloud测试Dataproc工作提出pyspark ;似乎没有Dataproc明确承认 - 包作为一个特殊的火花提交 -level标志，它试图通过它的之后的应用参数，以便火花提交让 - 包作为一个应用程序参数告吹，而不是正确地解析它作为一个送呈级别选项。事实上，在SSH会话，下面做的不的工作：

 ＃不若job.py取决于封装工作。
火花提交job.py --packages com.databricks：火花csv_2.11：1.2.0

但切换的参数的顺序不会再工作，即使在 pyspark 情况下，两个排序工作：

 ＃工程与在该软件包的依赖关系。
火花提交--packages com.databricks：火花csv_2.11：1.2.0 job.py
pyspark job.py --packages com.databricks：火花csv_2.11：1.2.0
pyspark --packages com.databricks：火花csv_2.11：1.2.0 job.py

因此，即使火花提交job.py 应该是一个简易替代一切，previously名为 pyspark工作的.py ，在东西解析订货像的区别 - 包意味着它实际上不是一个100％兼容的迁移。这可能是一些与星火侧跟进。

总之，好在有一个变通方法，因为 - 包只是对于Spark产权另一个别名 spark.jars.packages 和Dataproc的CLI支持属性就好了。所以，你可以做到以下几点：

  gcloud测试Dataproc工作提出pyspark --cluster＆LT;我-dataproc集群＆GT; \\
    --properties spark.jars.packages = com.databricks：火花csv_2.11：1.2.0 my_job.py

请注意， - 属性必须来之前的的的 my_job.py ，否则它被发送作为应用程序的参数，而不是作为配置标记。希望对你有用！注意，在SSH会话相当于将火花提交--packages com.databricks：火花csv_2.11：1.2.0 job.py

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this:

I started a ssh session with the master node of my cluster, then I input:

pyspark --packages com.databricks:spark-csv_2.11:1.2.0

Then it launched a pyspark shell in which I input:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv')
df.show()

And it worked.

My next step is to launch this job from my main machine using the command:

gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> my_job.py

But here It does not work and I get an error. I think because I did not gave the --packages com.databricks:spark-csv_2.11:1.2.0 as an argument, but I tried 10 different ways to give it and I did not manage.

My question are:

was the databricks csv library installed after I typed pyspark --packages com.databricks:spark-csv_2.11:1.2.0
can I write a line in my job.py in order to import it?
or what params should I give to my gcloud command to import it or install it?

解决方案

Short Answer

There are quirks in ordering of arguments where --packages isn't accepted by spark-submit if it comes after the my_job.py argument. To workaround this, you can do the following when submitting from Dataproc's CLI:

gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
    --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py

Basically, just add --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 before the .py file in your command.

Long Answer

So, this is actually a different issue than the known lack of support for --jars in gcloud beta dataproc jobs submit pyspark; it appears that without Dataproc explicitly recognizing --packages as a special spark-submit-level flag, it tries to pass it after the application arguments so that spark-submit lets the --packages fall through as an application argument rather than properly parsing it as a submission-level option. Indeed, in an SSH session, the following does not work:

# Doesn't work if job.py depends on that package.
spark-submit job.py --packages com.databricks:spark-csv_2.11:1.2.0

But switching the order of the arguments does work again, even though in the pyspark case, both orderings work:

# Works with dependencies on that package.
spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py
pyspark job.py --packages com.databricks:spark-csv_2.11:1.2.0
pyspark --packages com.databricks:spark-csv_2.11:1.2.0 job.py

So even though spark-submit job.py is supposed to be a drop-in replacement for everything that previously called pyspark job.py, the difference in parse ordering for things like --packages means it's not actually a 100% compatible migration. This might be something to follow up with on the Spark side.

Anyhow, fortunately there's a workaround, since --packages is just another alias for the Spark property spark.jars.packages, and Dataproc's CLI supports properties just fine. So you can just do the following:

gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> \
    --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py

Note that the --properties must come before the my_job.py, otherwise it gets sent as an application argument rather than as a configuration flag. Hope that works for you! Note that the equivalent in an SSH session would be spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py.

这篇关于使用pyspark作业的外部库，从谷歌，dataproc火花集群的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用pyspark作业的外部库，从谷歌，dataproc火花集群 [英] use an external library in pyspark job in a Spark cluster from google-dataproc

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用pyspark作业的外部库，从谷歌，dataproc火花集群 [英] use an external library in pyspark job in a Spark cluster from google-dataproc

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭