Spark-如何为每个执行程序上下文创建不同的变量? [英] Spark - How to create a variable that is different for each executor context?

查看:77
本文介绍了Spark-如何为每个执行程序上下文创建不同的变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的Spark应用程序启动了多个执行程序. 我有几个分区遍布我的执行者.

My Spark application launches several executors. I have several partitions that get spread over my executors.

在这些分区上使用map()时,我想使用MongoDB连接( MongoDB Java驱动程序)并从那里查询更多数据,处理这些数据并将其作为map()函数的输出返回.

When using map() on these partitions, I want to use a MongoDB connection (MongoDB Java Driver) and query more data from there, process this data and return it as the output of the map() function.

我想为每个执行者创建一个连接. 然后,每个分区都应访问此executor-local变量并将其用于查询数据.

I want to create one connection per executor. Each partition should then access this executor-local variable and use it to query the data.

为每个分区建立连接可能不是一个好主意.广播连接也不起作用,因为它不可序列化(我认为吗?).

Establishing a connection for each partition is probably not a good idea. Broadcasting the connection won't work either because it is not serializable (I think?).

总结一下:

  • 如何为每个执行者上下文创建一个不同的变量?

推荐答案

您应使用

You should use the MongoConnector.

它将处理创建集合,并由有效支持任何MongoClients关闭的缓存支持.它是可序列化的,因此它可以是广播的,并且可以使用选项,readConfig或Spark上下文来配置连接的位置.

It will handle creating a collection and is backed by a cache that efficiently handles the shutdown of any MongoClients. It is serialisable so it can be a broadcast and it can take options, a readConfig or the Spark context to configure where to connect to.

MongoConnector使用借出模式来处理对MongoDB的基础连接的引用管理,并允许在MongoClientMongoDatabaseMongoCollection级别进行访问.

MongoConnector uses the loan pattern to handle reference management of the underlying connection to MongoDB and allows access at the MongoClient, MongoDatabase or the MongoCollection level.

这篇关于Spark-如何为每个执行程序上下文创建不同的变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆