批量查找Spark流数据 [英] Batch lookup data for Spark streaming
问题描述
我需要从HDFS上的文件中查找Spark流工作中的一些数据
批处理作业每天获取一次此数据.
这样的任务是否有"设计模式"?
I need to look up some data in a Spark-streaming job from a file on HDFS
This data is fetched once a day by a batch job.
Is there a "design pattern" for such a task?
- 如何在
之后立即将数据重新加载到内存(哈希图)中 每日更新? - 在此查找数据为时如何连续服务流工作
被拿来吗?
- how can I reload the data in memory (a hashmap) immediately after a
daily update? - how to serve the streaming job continously while this lookup data is
being fetched?
推荐答案
一种可能的方法是删除本地数据结构并改用有状态流.假设您有一个名为mainStream
的主数据流:
One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream
:
val mainStream: DStream[T] = ???
接下来,您可以创建另一个读取查询数据的流:
Next you can create another stream which reads lookup data:
val lookupStream: DStream[(K, V)] = ???
和一个可用于更新状态的简单函数
and a simple function which can be used to update state
def update(
current: Seq[V], // A sequence of values for a given key in the current batch
prev: Option[V] // Value for a given key from in the previous state
): Option[V] = {
current
.headOption // If current batch is not empty take first element
.orElse(prev) // If it is empty (None) take previous state
}
这两个可以用来创建状态:
This two pieces can be used to create state:
val state = lookup.updateStateByKey(update)
剩下的就是按mainStream
键并连接数据:
All whats left is to key-by mainStream
and connect data:
def toPair(t: T): (K, T) = ???
mainStream.map(toPair).leftOuterJoin(state)
虽然从性能的角度来看这可能不是最佳选择,但它利用了已经存在的体系结构,使您免于手动处理失效或故障恢复的麻烦.
While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.
这篇关于批量查找Spark流数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!