如何在 Spark 结构化流中使用流数据帧更新静态数据帧 [英] How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming
问题描述
我有一个包含数百万行的静态 DataFrame
,如下所示.
I have a Static DataFrame
with millions of rows as follows.
静态DataFrame
:
--------------
id|time_stamp|
--------------
|1|1540527851|
|2|1540525602|
|3|1530529187|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------
现在在每个批次中,正在形成一个流DataFrame
,其中包含 id 和经过如下操作后更新的 time_stamp.
Now in every batch, a Streaming DataFrame
is being formed which contains id and updated time_stamp after some operations like below.
第一批:
--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
--------------
现在,在每个批次中,我想使用流式数据帧的更新值更新静态数据帧,如下所示.怎么做?
Now in every batch, I want to update the Static DataFrame with the updated values of Streaming Dataframe like follows. How to do that?
第一批后的静态DF:
--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------
我已经尝试过except()、union() 或'left_anti' join.但似乎结构化流媒体不支持此类操作.
I've already tried except(), union() or 'left_anti' join. But it seems structured streaming doesn't support such operations.
推荐答案
所以我通过 Spark 2.4.0 AddBatch 方法解决了这个问题,该方法将流式 Dataframe 转换为 mini Batch Dataframes.但是对于<2.4.0版本来说还是很头疼的.
So I resolved this issue by Spark 2.4.0 AddBatch method which coverts the streaming Dataframe into mini Batch Dataframes. But for the <2.4.0 version it's still a headache.
这篇关于如何在 Spark 结构化流中使用流数据帧更新静态数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!