Spark结构化流:多个接收器 [英] Spark Structured streaming: multiple sinks

查看:76
本文介绍了Spark结构化流:多个接收器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  1. 我们从Kafka那里使用结构化流,并将处理后的数据集写入s3.

  1. We are consuming from Kafka using structured streaming and writing the processed data set to s3.

我们还希望将处理后的数据向前写入Kafka,是否可以从同一流查询中进行处理?(火花版本2.1.1)

We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1)

在日志中,我看到了流式查询进度输出,并且从日志中获得了一个示例持续时间JSON,能否有人请更清楚地说明 addBatch getBatch ?

In the logs, I see the streaming query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what the difference is between addBatch and getBatch?

TriggerExecution-处理获取的数据和写入接收器都需要时间吗?

TriggerExecution - is it the time take to both process the fetched data and writing to the sink?

"durationMs" : {
    "addBatch" : 2263426,
    "getBatch" : 12,
    "getOffset" : 273,
   "queryPlanning" : 13,
    "triggerExecution" : 2264288,
    "walCommit" : 552
},

推荐答案

  1. 是.

在Spark 2.1.1中,您可以使用 writeStream.foreach 将数据写入Kafka.此博客中有一个示例:

In Spark 2.1.1, you can use writeStream.foreach to write your data into Kafka. There is an example in this blog: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

或者您可以使用Spark 2.2.0,该版本添加了Kafka接收器以支持正式写入Kafka.

Or you can use Spark 2.2.0 which adds Kafka sink to support writing to Kafka officially.

getBatch 测量从源创建DataFrame的时间.这通常非常快. addBatch 测量在接收器中运行DataFrame的时间.

getBatch measures how long to create a DataFrame from source. This is usually pretty fast. addBatch measures how long to run the DataFrame in a sink.

triggerExecution 测量执行一次触发器执行的时间,通常与 getOffset + getBatch + 几乎相同addBatch .

triggerExecution measures how long to run a trigger execution, is usually almost the same as getOffset + getBatch + addBatch.

这篇关于Spark结构化流:多个接收器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆