pyspark作业参数中的--archives,-files,py-files有什么区别 [英] What's the difference between --archives, --files, py-files in pyspark job arguments

查看:2075
本文介绍了pyspark作业参数中的--archives,-files,py-files有什么区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

--archives--files--py-filessc.addFilesc.addPyFile相当混乱,有人可以清楚地解释这些吗?

--archives, --files, --py-files and sc.addFile and sc.addPyFile are quite confusing, can someone explain these clearly?

推荐答案

这些选项确实分散在各处.

These options are truly scattered all over the place.

通常,通过--files--archives添加数据文件,并通过--py-files添加代码文件.后者将添加到类路径中(请参见此处),以便您可以导入和使用.

In general, add your data files via --files or --archives and code files via --py-files. The latter will be added to the classpath (c.f., here) so you could import and use.

您可以想象,CLI参数实际上是由addFileaddPyFiles函数处理的(参见,

As you can imagine, the CLI arguments is actually dealt with by addFile and addPyFiles functions (c.f., here)

pyspark在后台调用更通用的spark-submit脚本.

Behind the scenes, pyspark invokes the more general spark-submit script.

您可以通过将逗号分隔的列表传递给--py-files

You can add Python .zip, .egg or .py files to the runtime path by passing a comma-separated list to --py-files

  • 来自 http://spark.apache.org/docs/latest/running-on-yarn.html
    • From http://spark.apache.org/docs/latest/running-on-yarn.html
    • --files--archives选项支持使用#与Hadoop类似来指定文件名.例如,您可以指定:--files localtest.txt#appSees.txt,这会将您本地命名为localtest.txt的文件上传到HDFS,但这将通过名称appSees.txt链接到该文件,并且您的应用程序应使用在YARN上运行时,将其命名为appSees.txt以进行引用.

      The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.

      • 来自 http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=addpyfile#pyspark.SparkContext.addPyFile
        • From http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=addpyfile#pyspark.SparkContext.addPyFile
        • addFile(path)在每个节点上添加要与此Spark作业一起下载的文件.传递的路径可以是本地文件,HDFS中的文件(或其他Hadoop支持的文件系统)或HTTP,HTTPS或FTP URI.

          addFile(path) Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

          addPyFile(path)为将来在此SparkContext上执行的所有任务添加.py或.zip依赖项.传递的路径可以是本地文件,HDFS中的文件(或其他Hadoop支持的文件系统)或HTTP,HTTPS或FTP URI.

          addPyFile(path) Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.

          这篇关于pyspark作业参数中的--archives,-files,py-files有什么区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆