在不停止进程的情况下刷新 Spark 实时流中的数据帧 [英] Refresh Dataframe in Spark real-time Streaming without stopping process

查看：26 发布时间：2021/11/14 23:03:27 apache-spark amazon-s3 spark-streaming spark-dataframe snappydata

本文介绍了在不停止进程的情况下刷新 Spark 实时流中的数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的应用程序中，我从 Kafka 队列获得了一个帐户流(使用 Spark 流和 kafka)

in my application i get a stream of accounts from Kafka queue (using Spark streaming with kafka)

而且我需要从 S3 获取与这些帐户相关的属性，因此我计划缓存 S3 结果数据帧，因为 S3 数据目前至少不会更新一天，将来可能会更改为 1 小时或 10 分钟.所以问题是如何在不停止进程的情况下定期刷新缓存的数据帧.

And i need to fetch attributes related to these accounts from S3 so im planning to cache S3 resultant dataframe as the S3 data will not updated atleast for a day for now, it might change to 1hr or 10 mins very soon in future .So the question is how can i refresh the cached dataframe periodically without stopping process.

**更新:我计划在 S3 中有更新时将事件发布到 kafka，使用 SNS 和 AWS lambda，我的流应用程序将订阅该事件并基于此事件刷新缓存的数据帧(基本上是不持久的() 缓存并从 S3 重新加载)这是一个好方法吗?

**Update:Im planning to publish an event into kafka whenever there is an update in S3, using SNS and AWS lambda and my streaming application will subscribe to the event and refresh the cached dataframe based on this event (basically unpersist()cache and reload from S3) Is this a good approach ?

在不停止进程的情况下刷新 Spark 实时流中的数据帧 [英] Refresh Dataframe in Spark real-time Streaming without stopping process

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在不停止进程的情况下刷新 Spark 实时流中的数据帧 [英] Refresh Dataframe in Spark real-time Streaming without stopping process

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭