在Spark结构化流中保留给定密钥的最后一行 [英] Retain last row for given key in spark structured streaming

查看:56
本文介绍了在Spark结构化流中保留给定密钥的最后一行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

类似于Kafka的日志压缩,有很多用例,其中需要仅保留给定键的最后一次更新并使用结果(例如,用于加入数据).

Similar to Kafka's log compaction there are quite a few use cases where it is required to keep only the last update on a given key and use the result for example for joining data.

如何将其存储在Spark结构化流中(最好使用PySpark)?

How can this be archived in spark structured streaming (preferably using PySpark)?

例如,假设我有桌子

key    | time   | value
----------------------------
A      | 1      | foo
B      | 2      | foobar
A      | 2      | bar
A      | 15     | foobeedoo

现在,我想将每个键的最后一个值保留为状态(带有水印),即可以访问数据框

Now I would like to retain the last values for each key as state (with watermarking), i.e. to have access to a the dataframe

key    | time   | value
----------------------------
B      | 2      | foobar
A      | 15     | foobeedoo

我想加入另一个视频流.

that I might like to join against another stream.

最好做到这一点,不要浪费一个受支持的聚合步骤.我想我需要一种具有相反顺序的dropDuplicates()函数.

Preferably this should be done without wasting the one supported aggregation step. I suppose I would need kind of a dropDuplicates() function with reverse order.

请注意,这个问题是关于结构化流以及如何在不浪费聚合步骤的构造的情况下解决该问题的(因此,所有具有窗口函数或最大聚合的方法都不是一个好答案). (如果您不知道:链接聚合现在为

Please note that this question is explicily about structured streaming and how to solve the problem without constructs that waste the aggregation step (hence, everything with window functions or max aggregation is not a good answer). (In case you do not know: Chaining Aggregations is right now unsupported in structured streaming.)

推荐答案

使用flatMapGroupsWithStatemapGroupsWithState,按键分组,然后在flatMapGroupsWithState函数中按时间对值进行排序,将最后一行存储到GroupState.

Using flatMapGroupsWithState or mapGroupsWithState, group by key, and sort the value by time in the flatMapGroupsWithState function, store the last line into the GroupState.

这篇关于在Spark结构化流中保留给定密钥的最后一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆