Spark批处理从多列DataFrame写入Kafka主题 [英] Spark batch write to Kafka topic from multi-column DataFrame

查看:215
本文介绍了Spark批处理从多列DataFrame写入Kafka主题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

批处理之后,Spark ETL我需要将包含多个不同列的结果DataFrame写到Kafka主题.

After the batch, Spark ETL I need to write to Kafka topic the resulting DataFrame that contains multiple different columns.

根据以下Spark文档 https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html 写入Kafka的数据框应在架构中包含以下必填列:

According to the following Spark documentation https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html the Dataframe being written to Kafka should have the following mandatory column in schema:

值(必填)字符串或二进制

value (required) string or binary

正如我之前提到的,我有更多带有值的列,所以我有一个问题-如何正确地将整个DataFrame行作为一条消息从我的Spark应用程序发送给Kafka主题?我是否需要将所有列中的所有值都用一个值列(将包含所连接的值)连接到新的DataFrame中?还是有更合适的方法来实现它?

As I mentioned previously, I have much more columns with values so I have a question - how to properly send the whole DataFrame row as a single message to Kafka topic from my Spark application? Do I need to join all of the values from all columns into the new DataFrame with a single value column(that will contain the joined value) or there is more proper way to achieve it?

推荐答案

文档已经提示了正确的方法,并且与任何Kafka客户端的处理方式并没有什么不同-您必须在将有效负载序列化之后再发送到Kafka.

The proper way to do that is already hinted by the docs, and doesn't really differ form what you'd do with any Kafka client - you have to serialize the payload before sending to Kafka.

您将如何操作( Apache Avro )取决于您的业务需求-除了您(或您的团队),没人能回答这个问题).

How you you'll do that (to_json, to_csv, Apache Avro) depends on your business requirements - nobody can answers this but you (or your team).

这篇关于Spark批处理从多列DataFrame写入Kafka主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆