Spark 从多列 DataFrame 批量写入 Kafka 主题 [英] Spark batch write to Kafka topic from multi-column DataFrame
问题描述
批处理后,Spark ETL 我需要将包含多个不同列的结果数据帧写入 Kafka 主题.
After the batch, Spark ETL I need to write to Kafka topic the resulting DataFrame that contains multiple different columns.
根据以下 Spark 文档 https:///spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html 写入 Kafka 的 Dataframe 应该在架构中包含以下强制性列:
According to the following Spark documentation https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html the Dataframe being written to Kafka should have the following mandatory column in schema:
值(必需)字符串或二进制
value (required) string or binary
正如我之前提到的,我有更多带有值的列,所以我有一个问题 - 如何将整个 DataFrame 行作为单个消息从我的 Spark 应用程序正确发送到 Kafka 主题?我是否需要使用单个值列(将包含连接的值)将所有列中的所有值连接到新的 DataFrame 中,还是有更合适的方法来实现它?
As I mentioned previously, I have much more columns with values so I have a question - how to properly send the whole DataFrame row as a single message to Kafka topic from my Spark application? Do I need to join all of the values from all columns into the new DataFrame with a single value column(that will contain the joined value) or there is more proper way to achieve it?
推荐答案
文档已经暗示了正确的方法,并且与您对任何 Kafka 客户端所做的并没有真正不同 - 您必须在发送到 Kafka 之前序列化有效负载.
The proper way to do that is already hinted by the docs, and doesn't really differ form what you'd do with any Kafka client - you have to serialize the payload before sending to Kafka.
你会怎么做(to_json
, nofollow>、Apache Avro)取决于您的业务需求 - 除了您(或您的团队),没有人可以回答这个问题).
How you you'll do that (to_json
, to_csv
, Apache Avro) depends on your business requirements - nobody can answers this but you (or your team).
这篇关于Spark 从多列 DataFrame 批量写入 Kafka 主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!