是否可以让火花结构化流(更新模式)写入数据库? [英] is it possible to let spark structured stream(update mode) to write to db?

查看:20
本文介绍了是否可以让火花结构化流(更新模式)写入数据库?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 spark(3.0.0) 结构化流从 kafka 读取主题.

I use spark(3.0.0) structured streaming to read topic from kafka.

我使用了 joins 然后使用了 mapGropusWithState 来获取我的流数据,所以我必须使用 update 模式,基于我对 spark 官方指南的理解:https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes

I've used joins and then used mapGropusWithState to get my stream data, so I have to use update mode, based on my understanding from the spark offical guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes

spark 官方指南的以下部分没有提及 DB sink,并且它不支持写入 files 或者 update mode:https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks

Below section of the spark offical guide says nothing about DB sink, and It does not support write to files either for update mode: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks

目前我将其输出到console,我想将数据存储在文件或数据库中.

Currently I output it to console, and I would like to to store the data in files or DB.

所以我的问题是:在我的情况下,如何将流数据写入数据库或文件?我是否必须将数据写入 kafka,然后使用 kafka connect 将它们读回文件/db?

So my question is: how can I write the stream data to db or file in my situation? Do i have to write the data to kafka and then use kafka connect to read them back to files/db?

附言我按照文章获取聚合流式查询.

p.s. I followed the articles to get the aggregated streaming query.

- https://stackoverflow.com/questions/62738727/how-to-deduplicate-and-keep-latest-based-on-timestamp-field-in-spark-structured
- https://databricks.com/blog/2017/10/17/arbitrary-stateful-processing-in-apache-sparks-structured-streaming.html
- will also try one more time for below using java api
(https://stackoverflow.com/questions/50933606/spark-streaming-select-record-with-max-timestamp-for-each-id-in-dataframe-pysp)

推荐答案

我对 OUTPUT 和 WRITE 感到困惑.此外,我错误地假设 DB 和 FILE Sink 在文档的 OUTPUT SINK 部分中是并行的(因此在指南的 OUTPUT SINKs 部分中看不到 DB sink:https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks).

I got confused by the OUTPUT and WRITE. Also I was wrongly assuming the DB and FILE Sink are in parallel term in the OUTPUT SINK section of the doc(and so one cannot see DB sink in the OUTPUT SINKs section of the guide: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks).

我刚刚意识到OUTPUT模式(追加/更新/完成)是做查询流查询约束的.但这与如何写入 SINK 无关.我也意识到可以通过使用 FOREACH SINK 来实现 DB 写入(最初我只是理解它是为了额外的转换).

I just realized that the OUTPUT mode (append/update/complete) is to do query streaming query constraints. But it has nothing to do with how to WRITE to the SINK. I also realized the DB writing can be achieved by using the FOREACH SINK (initially I just understood it is for extra transformation).

我发现这些文章/讨论很有用

所以稍后,再次阅读官方指南,确认每个批次也可以在写入存储时执行自定义逻辑等.

so later on, read the official guide again, confirmed the for each batch can also do custom logic etc when WRITING to a STORAGE.

这篇关于是否可以让火花结构化流(更新模式)写入数据库?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆