如何在火花执行初始化? [英] How to perform initialization in spark?

查看:185
本文介绍了如何在火花执行初始化?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在火花履行我的数据geoip的查找。要做到这一点,我使用的MaxMind的GEOIP数据库。

I want to perform geoip lookups of my data in spark. To do that I'm using MaxMind's geoIP database.

我想要做的就是在每个分区上一次初始化geoip的数据库对象,后来又用它来查找相关的一个IP地址的城市。

What I want to do is to initialize a geoip database object once on each partition, and later use that to lookup the city related to an IP address.

火花是否有每个节点的初始化阶段,或者我应该检查而不是一个实例变量是否是不确定的,如果是这样,继续之前初始化?例如。类似的信息(这是蟒蛇,但我希望有一个解决方案阶):

Does spark have an initialization phase for each node, or should I instead check whether an instance variable is undefined, and if so, initialize it before continuing? E.g. something like (this is python but I want a scala solution):

class IPLookup(object):
    database = None

    def getCity(self, ip):
      if not database:
        self.database = self.initialise(geoipPath)
  ...

当然,这样做需要的火花将连载整个对象,东西的文档告诫。

Of course, doing this requires spark will serialise the whole object, something which the docs caution against.

推荐答案

这似乎是一个广播变量的一个很好的使用。你有没有看着来实现该功能的文档,如果您有它无法满足以某种方式您的要求?

This seems like a good usage of a broadcast variable. Have you looked at the documentation for that functionality and if you have does it fail to meet your requirements in someway?

这篇关于如何在火花执行初始化?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆