如何在 Spark 结构化流中使用流数据帧更新静态数据帧 [英] How to update a Static Dataframe with Streaming Dataframe in Spark structured streaming

查看:18
本文介绍了如何在 Spark 结构化流中使用流数据帧更新静态数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含数百万行的静态 DataFrame,如下所示.

I have a Static DataFrame with millions of rows as follows.

静态DataFrame:

--------------
id|time_stamp|
--------------
|1|1540527851|
|2|1540525602|
|3|1530529187|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------

现在在每个批次中,正在形成一个流DataFrame,其中包含 id 和经过如下操作后更新的 time_stamp.

Now in every batch, a Streaming DataFrame is being formed which contains id and updated time_stamp after some operations like below.

第一批:

--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
--------------

现在,在每个批次中,我想使用流式数据帧的更新值更新静态数据帧,如下所示.怎么做?

Now in every batch, I want to update the Static DataFrame with the updated values of Streaming Dataframe like follows. How to do that?

第一批后的静态DF:

--------------
id|time_stamp|
--------------
|1|1540527888|
|2|1540525999|
|3|1530529784|
|4|1520529185|
|5|1510529182|
|6|1578945709|
--------------

我已经尝试过except()、union() 或'left_anti' join.但似乎结构化流媒体不支持此类操作.

I've already tried except(), union() or 'left_anti' join. But it seems structured streaming doesn't support such operations.

推荐答案

所以我通过 Spark 2.4.0 AddBatch 方法解决了这个问题,该方法将流式 Dataframe 转换为 mini Batch Dataframes.但是对于<2.4.0版本来说还是很头疼的.

So I resolved this issue by Spark 2.4.0 AddBatch method which coverts the streaming Dataframe into mini Batch Dataframes. But for the <2.4.0 version it's still a headache.

这篇关于如何在 Spark 结构化流中使用流数据帧更新静态数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆