Spark从json文件中逐块处理rdd并发布到Kafka主题 [英] Spark to process rdd chunk by chunk from json files and post to Kafka topic

查看：58 发布时间：2021/7/3 18:50:53 scala apache-spark rdd

本文介绍了Spark从json文件中逐块处理rdd并发布到Kafka主题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是 Spark & 的新手斯卡拉.我需要处理来自 s3 位置的 json 文件数量.这些数据基本上是批处理数据，将保留以便稍后重新处理.现在我的 spark 作业应该以这样一种方式处理这些文件，即它应该选择 5 个原始 json 记录并且应该向 Kafka 主题发送一条消息.只选择 5 条记录的原因是 kafka 主题在同一主题上同时处理实时和批量数据.所以批处理不应该延迟实时处理.

I am new to Spark & scala. I have a requirement to process number of json files say from s3 location. These data is basically batch data which would be kept for reproccessing sometime later. Now my spark job should process these files in such a way that it should pick 5 raw json records and should send a message to Kafka topic. The reason of picking only 5 records is kafka topic is processing both real time and batch data simultaneously on the same topic. so the batch processing should not delay the real time processing.

我需要按顺序处理整个 json 文件，所以我一次只选择 5 条记录并向 kafka 发布消息并选择 json 文件的下 5 条记录，依此类推...

I need to process the whole json file sequentially and so I would pick only 5 records at a time and post a message to kafka and pick next 5 records of json file and so on...

我写了一段代码，可以从 json 文件中读取并将其发布到 kafka 主题.

I have written a piece of code which would read from json files and post it to kafka topic.

        val jsonRDD = sc.textFile(s3Location)

        var count = 0

        val buf = new StringBuilder

        jsonRDD.collect().foreach(line => {
            count += 1
                    buf ++= line
                    if (count > 5) {
                        println(s"Printing 5 jsons $buf")
                        count = 0
                        buf.setLength(0)
                        SendMessagetoKakfaTopic(buf) // psuedo cod for sending message to kafkatopic 
                        Thread.sleep(10000)
                    }
        })
        if (buf != null) {
            println(s"Printing remaining jsons $buf")
            SendMessagetoKakfaTopic(buf)
        }

我相信在 Spark 中有一种更有效的处理 JSON 的方法.

I believe there is a more efficient way of processing JSONs in Spark.

而且我还应该寻找任何其他参数，如内存、资源等.因为数据可能会超过 100 次演出.

And also I should also be looking for any other parameters like memory, resources etc. Since the data might go beyond 100's of gig.

Spark从json文件中逐块处理rdd并发布到Kafka主题 [英] Spark to process rdd chunk by chunk from json files and post to Kafka topic

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark从json文件中逐块处理rdd并发布到Kafka主题 [英] Spark to process rdd chunk by chunk from json files and post to Kafka topic

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭