批量查找Spark流数据 [英] Batch lookup data for Spark streaming

查看:80
本文介绍了批量查找Spark流数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从HDFS上的文件中查找Spark流工作中的一些数据 批处理作业每天获取一次此数据.
这样的任务是否有"设计模式"?

I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job.
Is there a "design pattern" for such a task?

  • 如何在
    之后立即将数据重新加载到内存(哈希图)中 每日更新?
  • 在此查找数据为时如何连续服务流工作
    被拿来吗?
  • how can I reload the data in memory (a hashmap) immediately after a
    daily update?
  • how to serve the streaming job continously while this lookup data is
    being fetched?

推荐答案

一种可能的方法是删除本地数据结构并改用有状态流.假设您有一个名为mainStream的主数据流:

One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream:

val mainStream: DStream[T] = ???

接下来,您可以创建另一个读取查询数据的流:

Next you can create another stream which reads lookup data:

val lookupStream: DStream[(K, V)] = ???

和一个可用于更新状态的简单函数

and a simple function which can be used to update state

def update(
  current: Seq[V],  // A sequence of values for a given key in the current batch
  prev: Option[V]   // Value for a given key from in the previous state
): Option[V] = { 
  current
    .headOption    // If current batch is not empty take first element 
    .orElse(prev)  // If it is empty (None) take previous state
 }

这两个可以用来创建状态:

This two pieces can be used to create state:

val state = lookup.updateStateByKey(update)

剩下的就是按mainStream键并连接数据:

All whats left is to key-by mainStream and connect data:

def toPair(t: T): (K, T) = ???

mainStream.map(toPair).leftOuterJoin(state)

虽然从性能的角度来看这可能不是最佳选择,但它利用了已经存在的体系结构,使您免于手动处理失效或故障恢复的麻烦.

While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.

这篇关于批量查找Spark流数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆