在Spark实时流中刷新数据帧而无需停止进程 [英] Refresh Dataframe in Spark real-time Streaming without stopping process

查看:115
本文介绍了在Spark实时流中刷新数据帧而无需停止进程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的应用程序中,我从Kafka队列中获得了一个帐户流(使用Spark流和kafka)

in my application i get a stream of accounts from Kafka queue (using Spark streaming with kafka)

我需要从S3提取与这些帐户相关的属性,因此我计划缓存S3结果数据帧,因为S3数据目前至少一天不会更新,将来可能很快更改为1小时或10分钟.因此,问题是如何在不停止进程的情况下定期刷新缓存的数据帧.

And i need to fetch attributes related to these accounts from S3 so im planning to cache S3 resultant dataframe as the S3 data will not updated atleast for a day for now, it might change to 1hr or 10 mins very soon in future .So the question is how can i refresh the cached dataframe periodically without stopping process.

**更新:我打算使用SNS和AWS Lambda在S3中有更新时将事件发布到kafka中,而我的流式应用程序将订阅该事件并根据该事件刷新缓存的数据帧(基本上是不持久( )从S3缓存并重新加载) 这是个好方法吗?

**Update:Im planning to publish an event into kafka whenever there is an update in S3, using SNS and AWS lambda and my streaming application will subscribe to the event and refresh the cached dataframe based on this event (basically unpersist()cache and reload from S3) Is this a good approach ?

推荐答案

最近在据我所知,执行您所要求的唯一方法是在收到新数据时从S3重新加载DataFrame,这意味着您还必须重新创建流式DF并重新启动查询.这是因为DataFrame从根本上是不可变的.

As far as I know the only way to do what you're asking is to reload the DataFrame from S3 when new data arrives which means you have to recreate the streaming DF as well and restart the query. This is because DataFrames are fundamentally immutable.

如果要更新(变异)DataFrame中的数据而不重新加载它,则需要尝试 SnappyData .

If you want to update (mutate) data in a DataFrame without reloading it, you need to try one of the datastores that integrate with or connect to Spark and allow mutations. One that I'm aware of is SnappyData.

这篇关于在Spark实时流中刷新数据帧而无需停止进程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆