批量查找Spark流数据 [英] Batch lookup data for Spark streaming

查看：80 发布时间：2020/9/4 5:10:04 scala apache-spark spark-streaming

本文介绍了批量查找Spark流数据的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要从HDFS上的文件中查找Spark流工作中的一些数据批处理作业每天获取一次此数据.
这样的任务是否有"设计模式"?

I need to look up some data in a Spark-streaming job from a file on HDFS This data is fetched once a day by a batch job.
Is there a "design pattern" for such a task?

如何在
之后立即将数据重新加载到内存(哈希图)中每日更新?
在此查找数据为时如何连续服务流工作
被拿来吗?

how can I reload the data in memory (a hashmap) immediately after a
daily update?
how to serve the streaming job continously while this lookup data is
being fetched?

推荐答案

一种可能的方法是删除本地数据结构并改用有状态流.假设您有一个名为mainStream的主数据流:

One possible approach is to drop local data structures and use stateful stream instead. Lets assume you have main data stream called mainStream:

val mainStream: DStream[T] = ???

接下来，您可以创建另一个读取查询数据的流:

Next you can create another stream which reads lookup data:

val lookupStream: DStream[(K, V)] = ???

和一个可用于更新状态的简单函数

and a simple function which can be used to update state

def update(
  current: Seq[V],  // A sequence of values for a given key in the current batch
  prev: Option[V]   // Value for a given key from in the previous state
): Option[V] = { 
  current
    .headOption    // If current batch is not empty take first element 
    .orElse(prev)  // If it is empty (None) take previous state
 }

这两个可以用来创建状态:

This two pieces can be used to create state:

val state = lookup.updateStateByKey(update)

剩下的就是按mainStream键并连接数据:

All whats left is to key-by mainStream and connect data:

def toPair(t: T): (K, T) = ???

mainStream.map(toPair).leftOuterJoin(state)

虽然从性能的角度来看这可能不是最佳选择，但它利用了已经存在的体系结构，使您免于手动处理失效或故障恢复的麻烦.

While this is probably less than optimal from a performance point of view it leverages architecture which is already in place and frees you from manually dealing with invalidation or failure recovery.

这篇关于批量查找Spark流数据的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

批量查找Spark流数据 [英] Batch lookup data for Spark streaming

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

批量查找Spark流数据 [英] Batch lookup data for Spark streaming

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭