如何在火花执行初始化? [英] How to perform initialization in spark?
问题描述
我想在火花履行我的数据geoip的查找。要做到这一点,我使用的MaxMind的GEOIP数据库。
I want to perform geoip lookups of my data in spark. To do that I'm using MaxMind's geoIP database.
我想要做的就是在每个分区上一次初始化geoip的数据库对象,后来又用它来查找相关的一个IP地址的城市。
What I want to do is to initialize a geoip database object once on each partition, and later use that to lookup the city related to an IP address.
火花是否有每个节点的初始化阶段,或者我应该检查而不是一个实例变量是否是不确定的,如果是这样,继续之前初始化?例如。类似的信息(这是蟒蛇,但我希望有一个解决方案阶):
Does spark have an initialization phase for each node, or should I instead check whether an instance variable is undefined, and if so, initialize it before continuing? E.g. something like (this is python but I want a scala solution):
class IPLookup(object):
database = None
def getCity(self, ip):
if not database:
self.database = self.initialise(geoipPath)
...
当然,这样做需要的火花将连载整个对象,东西的文档告诫。
Of course, doing this requires spark will serialise the whole object, something which the docs caution against.
推荐答案
这似乎是一个广播变量的一个很好的使用。你有没有看着来实现该功能的文档,如果您有它无法满足以某种方式您的要求?
This seems like a good usage of a broadcast variable. Have you looked at the documentation for that functionality and if you have does it fail to meet your requirements in someway?
这篇关于如何在火花执行初始化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!