使用Spark加载--files参数分发的共享库(.so) [英] Loading shared libraries (.so) distributed by --files argument with spark

查看:733
本文介绍了使用Spark加载--files参数分发的共享库(.so)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在运行Spark作业时,我正在尝试使用外部本机库(.so文件).首先,我使用--files参数提交文件.

I'm trying to work with an external native library (.so file) when running a spark job. First of all I'm submitting the file using --files argument.

要加载库,请在创建SparkContext之后使用System.load(SparkFiles.get(libname))(确保已填充SparkFiles). 问题是该库仅由驱动程序节点加载,并且当任务尝试访问本机方法时,我会得到

To load the library I'm using System.load(SparkFiles.get(libname)) after creating the SparkContext (to make sure SparkFiles are populated). Problem is that the library is only loaded by the driver node, and when tasks try to access the native methods I'm getting

WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 13.0.0.206, executor 0): java.lang.UnsatisfiedLinkError

对我唯一有效的方法是在运行spark应用程序之前将.so文件复制到所有工作程序,并创建一个Scala对象,该对象将在每个任务之前加载库(可以使用mapPartitions优化).

The only thing that worked for me was copying the .so file to all the workers before running the spark app, and creating a Scala object that would load the library before each task (can be optimized with mapPartitions).

我尝试使用

--conf "spark.executor.extraLibraryPath=/local/path/to/so" \
--conf "spark.driver.extraLibraryPath=/local/path/to/so"

尝试避免这种情况,但没有成功.

to try to avoid that, but without success.

现在,由于我正在使用EMR运行Spark作业,而不是一个一致的集群, 我想避免在运行作业之前将文件复制到所有节点.

Now since I'm using EMR to run spark jobs, and not a consistent cluster, I would like to avoid copying files to all the nodes before running the job.

有什么建议吗?

推荐答案

解决方案比我想象的要简单-我需要的是每个JVM一次加载该库

Solution was simpler than I thought - All I need is for the library to be loaded once per JVM

所以基本上我需要使用--files添加库文件并创建Loader对象:

so basically what I need is to add the library file using --files and to create a Loader object:

object LibraryLoader {
    lazy val load = System.load(SparkFiles.get("libname"))
}

,并在每个任务(mapfilter等)之前使用它

and use it before each task (map, filter etc.)

例如

rdd.map { x =>
    LibraryLoader.load
    // do some stuff with x
}

懒惰将确保在填充SparkFiles之后创建对象,并确保每个JVM进行一次评估.

the laziness will ensure object will be created after SparkFiles are populated, and also single evaluation per JVM.

这篇关于使用Spark加载--files参数分发的共享库(.so)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆