在将RDD发布到Kafka之前对Spark中的RDD进行排序? [英] Sort RDD in Spark before publishing it to Kafka?

查看：79 发布时间：2021/4/8 19:10:42 scala apache-spark apache-kafka

本文介绍了在将RDD发布到Kafka之前对Spark中的RDD进行排序?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的代码中，我首先订阅一个Kafka流，处理每个RDD以创建我的类 People 的实例，然后，我要发布结果集( Dataset [People] )到Kafka的特定主题.重要的是要注意，并不是从卡夫卡收到的所有传入消息都映射到 People 的实例.此外，应按照与从卡夫卡收到的顺序完全相同的顺序将人员实例发送到卡夫卡.

In my code, I first subscribe to a Kafka stream, process each RDD to create an instance of my class People and then, I want to publish the result set (Dataset[People]) to a specific topic to Kafka. It is important to note that not every incoming message received from Kafka maps to an instance of People. Moreover, instances of people should be sent to Kafka in exactly the same order as received from Kafka.

但是，我不确定在执行程序上运行相应代码时，是否真的有必要进行排序，或者 People 的实例是否保持相同的顺序(我可以直接将数据集发布到Kafka).据我了解，排序是必要的，因为 foreachRDD 中的代码可以在集群中的不同节点上执行.这是正确的吗?

However, I am not sure if sorting is really necessary or if the instances of People maintain the same order when the respective code is run on the executors (and I can directly publish my Dataset to Kafka). As far as I understand, sorting is necessary, because the code inside foreachRDD can be executed on different nodes in the cluster. Is this correct?

这是我的代码:

val myStream = KafkaUtils.createDirectStream[K, V](streamingContext, PreferConsistent, Subscribe[K, V](topics, consumerConfig))

def process(record: (RDD[ConsumerRecord[String, String]], Time)): Unit = record match {
case (rdd, time) if !rdd.isEmpty =>
    // More Code...
    // In the end, I have: Dataset[People]
case _ =>
}

myStream.foreachRDD((x, y) => process((x, y))) // Do I have to replace this call with map, sort the RDD and then publish it to Kafka?

在将RDD发布到Kafka之前对Spark中的RDD进行排序? [英] Sort RDD in Spark before publishing it to Kafka?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在将RDD发布到Kafka之前对Spark中的RDD进行排序? [英] Sort RDD in Spark before publishing it to Kafka?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭