在Spark结构化流中保留给定密钥的最后一行 [英] Retain last row for given key in spark structured streaming

查看：56 发布时间：2020/9/4 3:41:04 apache-spark pyspark spark-structured-streaming

本文介绍了在Spark结构化流中保留给定密钥的最后一行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

类似于Kafka的日志压缩，有很多用例，其中需要仅保留给定键的最后一次更新并使用结果(例如，用于加入数据).

Similar to Kafka's log compaction there are quite a few use cases where it is required to keep only the last update on a given key and use the result for example for joining data.

如何将其存储在Spark结构化流中(最好使用PySpark)?

How can this be archived in spark structured streaming (preferably using PySpark)?

例如，假设我有桌子

key    | time   | value
----------------------------
A      | 1      | foo
B      | 2      | foobar
A      | 2      | bar
A      | 15     | foobeedoo

现在，我想将每个键的最后一个值保留为状态(带有水印)，即可以访问数据框

Now I would like to retain the last values for each key as state (with watermarking), i.e. to have access to a the dataframe

key    | time   | value
----------------------------
B      | 2      | foobar
A      | 15     | foobeedoo

我想加入另一个视频流.

that I might like to join against another stream.

最好做到这一点，不要浪费一个受支持的聚合步骤.我想我需要一种具有相反顺序的dropDuplicates()函数.

Preferably this should be done without wasting the one supported aggregation step. I suppose I would need kind of a dropDuplicates() function with reverse order.

请注意，这个问题是关于结构化流以及如何在不浪费聚合步骤的构造的情况下解决该问题的(因此，所有具有窗口函数或最大聚合的方法都不是一个好答案). (如果您不知道:链接聚合现在为

Please note that this question is explicily about structured streaming and how to solve the problem without constructs that waste the aggregation step (hence, everything with window functions or max aggregation is not a good answer). (In case you do not know: Chaining Aggregations is right now unsupported in structured streaming.)

在Spark结构化流中保留给定密钥的最后一行 [英] Retain last row for given key in spark structured streaming

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Spark结构化流中保留给定密钥的最后一行 [英] Retain last row for given key in spark structured streaming

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭