在来自 google-dataproc 的 Spark 集群中的 pyspark 作业中使用外部库 [英] use an external library in pyspark job in a Spark cluster from google-dataproc

查看:16
本文介绍了在来自 google-dataproc 的 Spark 集群中的 pyspark 作业中使用外部库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个通过 google dataproc 创建的 Spark 集群.我希望能够使用 databricks 中的 csv 库(参见 https://github.com/databricks/spark-csv).所以我先是这样测试的:

I have a spark cluster I created via google dataproc. I want to be able to use the csv library from databricks (see https://github.com/databricks/spark-csv). So I first tested it like this:

我与集群的主节点启动了 ssh 会话,然后输入:

I started a ssh session with the master node of my cluster, then I input:

pyspark --packages com.databricks:spark-csv_2.11:1.2.0

然后它启动了一个 pyspark shell,我在其中输入:

Then it launched a pyspark shell in which I input:

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('gs:/xxxx/foo.csv')
df.show()

它奏效了.

我的下一步是使用以下命令从我的主机启动此作业:

My next step is to launch this job from my main machine using the command:

gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> my_job.py

但是在这里它不起作用并且我收到错误消息.我想是因为我没有给出 --packages com.databricks:spark-csv_2.11:1.2.0 作为参数,但是我尝试了 10 种不同的方法来给出它,但我没有成功.

But here It does not work and I get an error. I think because I did not gave the --packages com.databricks:spark-csv_2.11:1.2.0 as an argument, but I tried 10 different ways to give it and I did not manage.

我的问题是:

  1. 是在我输入 pyspark --packages com.databricks:spark-csv_2.11:1.2.0
  2. 后安装的 databricks csv 库
  3. 我可以在 job.py 中写一行以便导入吗?
  4. 或者我应该给我的 gcloud 命令提供什么参数来导入或安装它?
  1. was the databricks csv library installed after I typed pyspark --packages com.databricks:spark-csv_2.11:1.2.0
  2. can I write a line in my job.py in order to import it?
  3. or what params should I give to my gcloud command to import it or install it?

推荐答案

简短回答

如果 --packages 出现在 my_job.py 之后,--packages 不接受 --packages 的参数排序有一些怪癖> 论据.要解决此问题,您可以在从 Dataproc 的 CLI 提交时执行以下操作:

There are quirks in ordering of arguments where --packages isn't accepted by spark-submit if it comes after the my_job.py argument. To workaround this, you can do the following when submitting from Dataproc's CLI:

gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> 
    --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py

基本上,只需在您的 .py 文件之前添加 --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0命令.

Basically, just add --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 before the .py file in your command.

长答案

因此,这实际上与 gcloud beta dataproc 作业提交 pyspark 中已知缺乏对 --jars 的支持不同;似乎没有 Dataproc 明确将 --packages 识别为特殊的 spark-submit 级别标志,它会尝试在应用程序参数之后 传递它因此 spark-submit 允许 --packages 作为应用程序参数失败,而不是将其正确解析为提交级别选项.实际上,在 SSH 会话中,以下内容不起作用:

So, this is actually a different issue than the known lack of support for --jars in gcloud beta dataproc jobs submit pyspark; it appears that without Dataproc explicitly recognizing --packages as a special spark-submit-level flag, it tries to pass it after the application arguments so that spark-submit lets the --packages fall through as an application argument rather than properly parsing it as a submission-level option. Indeed, in an SSH session, the following does not work:

# Doesn't work if job.py depends on that package.
spark-submit job.py --packages com.databricks:spark-csv_2.11:1.2.0

但是切换参数的顺序确实再次起作用,即使在 pyspark 的情况下,两种顺序都起作用:

But switching the order of the arguments does work again, even though in the pyspark case, both orderings work:

# Works with dependencies on that package.
spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py
pyspark job.py --packages com.databricks:spark-csv_2.11:1.2.0
pyspark --packages com.databricks:spark-csv_2.11:1.2.0 job.py

因此,即使 spark-submit job.py 应该是以前称为 pyspark job.py 的所有内容的直接替代品,但解析顺序的差异对于诸如 --packages 之类的东西,这意味着它实际上不是 100% 兼容的迁移.这可能是 Spark 方面需要跟进的事情.

So even though spark-submit job.py is supposed to be a drop-in replacement for everything that previously called pyspark job.py, the difference in parse ordering for things like --packages means it's not actually a 100% compatible migration. This might be something to follow up with on the Spark side.

无论如何,幸运的是有一个解决方法,因为 --packages 只是 Spark 属性 spark.jars.packages 的另一个别名,而 Dataproc 的 CLI 支持属性就好了.因此,您只需执行以下操作:

Anyhow, fortunately there's a workaround, since --packages is just another alias for the Spark property spark.jars.packages, and Dataproc's CLI supports properties just fine. So you can just do the following:

gcloud beta dataproc jobs submit pyspark --cluster <my-dataproc-cluster> 
    --properties spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0 my_job.py

请注意,--properties 必须my_job.py 之前,否则它会作为应用程序参数而不是作为配置标志.希望对你有用!请注意,SSH 会话中的等效项是 spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py.

Note that the --properties must come before the my_job.py, otherwise it gets sent as an application argument rather than as a configuration flag. Hope that works for you! Note that the equivalent in an SSH session would be spark-submit --packages com.databricks:spark-csv_2.11:1.2.0 job.py.

这篇关于在来自 google-dataproc 的 Spark 集群中的 pyspark 作业中使用外部库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆