如何在Spark结构化流中使用流数据帧更新静态数据帧 [英] How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming
问题描述
我有一个静态DataFrame
,具有以下几百万行.
I have a Static DataFrame
with millions of rows as follows.
静态DataFrame
:
--------------
id|time_stamp|
--------------
|1|1540527851|
|2|1540525602|
|3|1530529187|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------
现在每批处理中都会形成一个流DataFrame
,其中包含ID和经过以下类似操作后更新的time_stamp.
Now in every batch, a Streaming DataFrame
is being formed which contains id and updated time_stamp after some operations like below.
第一批:
--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
--------------
现在每一批,我都想用Streaming Dataframe的更新值来更新Static DataFrame,如下所示. 该怎么做?
Now in every batch, I want to update the Static DataFrame with the updated values of Streaming Dataframe like follows. How to do that?
第一批之后的静态DF:
Static DF after first batch :
--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------
我已经尝试过 except(),union()或'left_anti'连接.但是看来结构化流式传输不支持此类操作.
I've already tried except(), union() or 'left_anti' join. But it seems structured streaming doesn't support such operations.
推荐答案
因此,我通过Spark 2.4.0 AddBatch方法解决了此问题,该方法将流式数据帧转换为小型批处理数据帧.但是对于< 2.4.0版本,仍然令人头疼.
So I resolved this issue by Spark 2.4.0 AddBatch method which coverts the streaming Dataframe into mini Batch Dataframes. But for the <2.4.0 version it's still a headache.
这篇关于如何在Spark结构化流中使用流数据帧更新静态数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!