如何使用spark_apply_bundle [英] how to use spark_apply_bundle

查看:68
本文介绍了如何使用spark_apply_bundle的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用 spark_apply_bundle 来限制传输到 YARN 受管群集上的工作程序节点的 packages/data 的数量.如此处所述,我必须通过 tarball spark_apply 作为packages参数,我还必须通过spark配置中的"sparklyr.shell.files" 使它可用./p>

我的问题是:

  • 压缩包的路径可以相对于项目的工作目录吗?如果不是,那么应该将其存储在hdfs或其他位置吗?
  • 应将什么传递给"sparklyr.shell.files" ?它是否是传递给 spark_apply 的路径的副本?

当前我失败的脚本如下所示:

  bundle<-粘贴(getwd(),list.files()[grep("\\.tar $",list.files())] [1],sep ="/")...config $ sparklyr.shell.files<-捆绑sc<-spark_connect(master ="yarn-client",config = config)...spark_apply(sdf,f,包=捆绑包) 

解决方案

通过将压缩包复制到hdfs,火花作业成功完成.似乎可以使用其他方法(例如,将文件复制到每个工作程序节点),但这似乎是最简单的解决方案.

更新后的脚本如下:

  bundle<-粘贴(getwd(),list.files()[grep("\\.tar $",list.files())] [1],sep ="/")...hdfs_path<-"hdfs://nn.example.com/some/directory/"hdfs_bundle<-paste0(hdfs_path,basename(bundle))系统(粘贴("hdfs dfs -put",bundle,hdfs_path))config $ sparklyr.shell.files<-hdfs_bundlesc<-spark_connect(master ="yarn-client",config = config)...spark_apply(sdf,f,包=捆绑包) 

I am trying to use spark_apply_bundle to limit the number of packages/data transferred to the worker nodes on a YARN managed cluster. As mentioned in here I must pass the path of the tarball to spark_apply as the packages argument and I also must make it available via "sparklyr.shell.files" in the spark config.

My questions are:

  • Can the path to the tarball be relative to the project's working directory, if not then should it be stored on hdfs or somewhere else?
  • What should be passed to "sparklyr.shell.files"? Is it a duplicate of the path passed to spark_apply?

Currently my unsuccessful script look something like this:

bundle <- paste(getwd(), list.files()[grep("\\.tar$",list.files())][1], sep = "/")

...

config$sparklyr.shell.files <- bundle
sc <- spark_connect(master = "yarn-client", config = config)

...

spark_apply(sdf, f, packages = bundle)

解决方案

The spark job succeeded by copying the tarball to hdfs. It seems as if it's plausible to use some other method (e.g. copying the file to each worker node) but this seems to be the easiest solution.

The updated script looks as follows:

bundle <- paste(getwd(), list.files()[grep("\\.tar$",list.files())][1], sep = "/")

...

hdfs_path <- "hdfs://nn.example.com/some/directory/"
hdfs_bundle <- paste0(hdfs_path, basename(bundle))
system(paste("hdfs dfs -put", bundle, hdfs_path))
config$sparklyr.shell.files <- hdfs_bundle
sc <- spark_connect(master = "yarn-client", config = config)

...

spark_apply(sdf, f, packages = bundle)

这篇关于如何使用spark_apply_bundle的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆