在Spark结构化流中保留给定密钥的最后一行 [英] Retain last row for given key in spark structured streaming
问题描述
类似于Kafka的日志压缩,有很多用例,其中需要仅保留给定键的最后一次更新并使用结果(例如,用于加入数据).
Similar to Kafka's log compaction there are quite a few use cases where it is required to keep only the last update on a given key and use the result for example for joining data.
如何将其存储在Spark结构化流中(最好使用PySpark)?
How can this be archived in spark structured streaming (preferably using PySpark)?
例如,假设我有桌子
key | time | value
----------------------------
A | 1 | foo
B | 2 | foobar
A | 2 | bar
A | 15 | foobeedoo
现在,我想将每个键的最后一个值保留为状态(带有水印),即可以访问数据框
Now I would like to retain the last values for each key as state (with watermarking), i.e. to have access to a the dataframe
key | time | value
----------------------------
B | 2 | foobar
A | 15 | foobeedoo
我想加入另一个视频流.
that I might like to join against another stream.
最好做到这一点,不要浪费一个受支持的聚合步骤.我想我需要一种具有相反顺序的dropDuplicates()
函数.
Preferably this should be done without wasting the one supported aggregation step. I suppose I would need kind of a dropDuplicates()
function with reverse order.
请注意,这个问题是关于结构化流以及如何在不浪费聚合步骤的构造的情况下解决该问题的(因此,所有具有窗口函数或最大聚合的方法都不是一个好答案). (如果您不知道:链接聚合现在为
Please note that this question is explicily about structured streaming and how to solve the problem without constructs that waste the aggregation step (hence, everything with window functions or max aggregation is not a good answer). (In case you do not know: Chaining Aggregations is right now unsupported in structured streaming.)
推荐答案
使用flatMapGroupsWithState
或mapGroupsWithState
,按键分组,然后在flatMapGroupsWithState
函数中按时间对值进行排序,将最后一行存储到GroupState
.
Using flatMapGroupsWithState
or mapGroupsWithState
, group by key, and sort the value by time in the flatMapGroupsWithState
function, store the last line into the GroupState
.
这篇关于在Spark结构化流中保留给定密钥的最后一行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!