将大型 DataFrame 从 PySpark 写入 Kafka 遇到超时 [英] Writing large DataFrame from PySpark to Kafka runs into timeout

查看：49 发布时间：2021/11/12 1:57:22 azure apache-spark pyspark apache-kafka databricks

本文介绍了将大型 DataFrame 从 PySpark 写入 Kafka 遇到超时的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试向 Kafka 写入一个包含大约 2.3 亿条记录的数据框.更具体地说是 Kafka-启用 Azure 事件中心，但我不确定这是否真的是我的问题的根源.

I'm trying to write a data frame which has about 230 million records to a Kafka. More specifically to a Kafka-enable Azure Event Hub, but I'm not sure if that's actually the source of my issue.

EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'

dfKafka \
.write  \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()

这启动得很好并且成功地(并且相当快地)向队列写入了大约 3-4 百万条记录.但是几分钟后工作就会停止，并显示如下消息:

This starts up fine and writes about 3-4 million records successfully (and pretty fast) to the queue. But then the job stops after a couple of minutes with messages like those:

org.apache.spark.SparkException:作业因阶段失败而中止:阶段 7.0 中的任务 6 失败 4 次，最近失败:阶段 7.0 中丢失任务 6.3(TID 248、10.139.64.5、执行程序 1):kafkashaded.org.apache.kafka.common.errors.TimeoutException: Expiring 61 record(s) for mytopic-18: 32839 ms has过去自上次追加

org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 4 times, most recent failure: Lost task 6.3 in stage 7.0 (TID 248, 10.139.64.5, executor 1): kafkashaded.org.apache.kafka.common.errors.TimeoutException: Expiring 61 record(s) for mytopic-18: 32839 ms has passed since last append

或

org.apache.spark.SparkException:作业因阶段失败而中止:阶段 8.0 中的任务 13 失败 4 次，最近失败:阶段 8.0 中的任务 13.3 丢失(TID 348、10.139.64.5、执行程序 1):kafkashaded.org.apache.kafka.common.errors.TimeoutException:请求超时.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 13 in stage 8.0 failed 4 times, most recent failure: Lost task 13.3 in stage 8.0 (TID 348, 10.139.64.5, executor 1): kafkashaded.org.apache.kafka.common.errors.TimeoutException: The request timed out.

此外，我从未看到创建/写入检查点文件.我还使用了 .option("kafka.delivery.timeout.ms", 30000) 和不同的值，但这似乎没有任何效果.

Also, I never see the checkpoint file being created/written to. I also played around with .option("kafka.delivery.timeout.ms", 30000) and different values but that didn't seem to have any effect.

我在 Azure Databricks 集群版本 5.0(包括 Apache Spark 2.4.0、Scala 2.11)中运行它

I'm running this in an Azure Databricks cluster version 5.0 (includes Apache Spark 2.4.0, Scala 2.11)

我在我的事件中心没有看到任何像节流这样的错误，所以应该没问题.

I don't see any errors like throttling on my Event Hub, so that should be ok.

推荐答案

终于想通了(大部分):

Finally figured it out (mostly):

事实证明，大约 16000 条消息的默认批量大小对于端点来说太大了.在我将 batch.size 参数设置为 5000 后，它开始工作并且每分钟向事件中心写入大约 700k 条消息.此外，上面的超时参数是错误的，只是被忽略了.它是 kafka.request.timeout.ms

Turns out the default batch size of about 16000 messages was too large for the endpoint. After I set the batch.size parameter to 5000, it worked and is writing at about 700k messages per minute to the Event Hub. Also, the timeout parameter above was wrong and was just being ignored. It is kafka.request.timeout.ms

唯一的问题是它仍然会随机运行超时并且显然又从头开始，所以我最终得到了重复.将打开另一个问题.

Only issue is that randomly it still runs in timeouts and apparently starts from the beginning again so that I'm ending up with duplicates. Will open another question for that.

dfKafka \
.write  \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.batch.size", 5000) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("kafka.request.timeout.ms", 120000) \
.option("topic", "raw") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()

这篇关于将大型 DataFrame 从 PySpark 写入 Kafka 遇到超时的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

将大型 DataFrame 从 PySpark 写入 Kafka 遇到超时 [英] Writing large DataFrame from PySpark to Kafka runs into timeout

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将大型 DataFrame 从 PySpark 写入 Kafka 遇到超时 [英] Writing large DataFrame from PySpark to Kafka runs into timeout

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭