Spark-如何为每个执行程序上下文创建不同的变量? [英] Spark - How to create a variable that is different for each executor context?
问题描述
我的Spark应用程序启动了多个执行程序. 我有几个分区遍布我的执行者.
My Spark application launches several executors. I have several partitions that get spread over my executors.
在这些分区上使用map()时,我想使用MongoDB连接( MongoDB Java驱动程序)并从那里查询更多数据,处理这些数据并将其作为map()函数的输出返回.
When using map() on these partitions, I want to use a MongoDB connection (MongoDB Java Driver) and query more data from there, process this data and return it as the output of the map() function.
我想为每个执行者创建一个连接. 然后,每个分区都应访问此executor-local变量并将其用于查询数据.
I want to create one connection per executor. Each partition should then access this executor-local variable and use it to query the data.
为每个分区建立连接可能不是一个好主意.广播连接也不起作用,因为它不可序列化(我认为吗?).
Establishing a connection for each partition is probably not a good idea. Broadcasting the connection won't work either because it is not serializable (I think?).
总结一下:
- 如何为每个执行者上下文创建一个不同的变量?
推荐答案
You should use the MongoConnector
.
它将处理创建集合,并由有效支持任何MongoClients关闭的缓存支持.它是可序列化的,因此它可以是广播的,并且可以使用选项,readConfig或Spark上下文来配置连接的位置.
It will handle creating a collection and is backed by a cache that efficiently handles the shutdown of any MongoClients. It is serialisable so it can be a broadcast and it can take options, a readConfig or the Spark context to configure where to connect to.
MongoConnector
使用借出模式来处理对MongoDB的基础连接的引用管理,并允许在MongoClient
,MongoDatabase
或MongoCollection
级别进行访问.
MongoConnector
uses the loan pattern to handle reference management of the underlying connection to MongoDB and allows access at the MongoClient
, MongoDatabase
or the MongoCollection
level.
这篇关于Spark-如何为每个执行程序上下文创建不同的变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!