如何在Spark结构化流中使用流数据帧更新静态数据帧 [英] How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

查看:133
本文介绍了如何在Spark结构化流中使用流数据帧更新静态数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个静态DataFrame,具有以下几百万行.

I have a Static DataFrame with millions of rows as follows.

静态DataFrame:

--------------
id|time_stamp|
--------------
|1|1540527851|
|2|1540525602|
|3|1530529187|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------

现在每批处理中都会形成一个流DataFrame,其中包含ID和经过以下类似操作后更新的time_stamp.

Now in every batch, a Streaming DataFrame is being formed which contains id and updated time_stamp after some operations like below.

第一批:

--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
--------------

现在每一批,我都想用Streaming Dataframe的更新值来更新Static DataFrame,如下所示. 该怎么做?

Now in every batch, I want to update the Static DataFrame with the updated values of Streaming Dataframe like follows. How to do that?

第一批之后的静态DF:

Static DF after first batch :

--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------

我已经尝试过 except(),union()或'left_anti'连接.但是看来结构化流式传输不支持此类操作.

I've already tried except(), union() or 'left_anti' join. But it seems structured streaming doesn't support such operations.

推荐答案

因此,我通过Spark 2.4.0 AddBatch方法解决了此问题,该方法将流式数据帧转换为小型批处理数据帧.但是对于< 2.4.0版本,仍然令人头疼.

So I resolved this issue by Spark 2.4.0 AddBatch method which coverts the streaming Dataframe into mini Batch Dataframes. But for the <2.4.0 version it's still a headache.

这篇关于如何在Spark结构化流中使用流数据帧更新静态数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆