Spark-Streaming Kafka Direct Streaming API &并行性 [英] Spark-Streaming Kafka Direct Streaming API & Parallelism

查看:24
本文介绍了Spark-Streaming Kafka Direct Streaming API &并行性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解存在于 Kafka 分区和 Spark RDD 分区以及最终的 Spark 任务之间的自动映射.但是,为了正确调整我的 Executor 的大小(以核心数为单位)并最终调整我的节点和集群的大小,我需要了解一些似乎在文档中被掩盖的内容.

在 Spark-Streaming 中,数据消耗 vs 数据处理 vs 任务分配究竟是如何工作的,换句话说:

<块引用>

  1. Kafka 分区 执行相应的 Spark 任务,两者都读取并一起处理数据?

  • 这个问题背后的原因是,在之前的 API 中,是基于接收器的,一个 TASK 专门用于接收数据,这意味着你的执行者的一些任务槽是为数据保留的摄取和另一个在那里进行处理.这有一个影响您在内核方面如何调整执行程序的大小.

  • 以关于如何使用
    启动spark-streaming的建议为例--掌握本地.每个人都会说,在火花流的情况下,应该将 local[2] 放在最低限度,因为其中之一核心,将致力于运行从不长的接收任务结束,另一个核将进行数据处理.

  • 所以如果答案是在这种情况下,任务会同时读取并且立即处理,那么接下来的问题是
    真的很聪明,我的意思是,这听起来像是异步的.我们想成为
    能够在我们处理时获取,以便在下一次处理数据时已经在那了.但是,如果只有一个核心或更精确地
    两者都读取数据并处理它们,如何在
    中完成并行,以及这通常如何使事情变得更快.

  • 我最初的理解是,事情会以某种方式保持从某种意义上说,将启动一个任务来读取但
    处理将在另一个任务中完成.这意味着,如果
    处理任务还没有完成,我们还可以继续阅读,直到一定的内存限制.

有人可以清楚地概述这里到底发生了什么吗?

EDIT1

我们甚至不必进行这种内存限制控制.仅仅是能够在处理进行时获取并在那里停止的事实.换句话说,这两个过程应该是异步的,并且限制只是领先一步.对我来说,如果这没有发生,我觉得非常奇怪的是 Spark 会实现一些破坏性能的东西.

解决方案

对 Kafka 分区执行相应的 Spark 任务,同时读取和完全处理数据?

这种关系与您所描述的非常接近,如果通过谈论一项任务,我们指的是从 kafka 读取直到 shuffle 操作的图形部分.执行流程如下:

  1. 驱动程序从所有 kafka 主题和分区中读取偏移量
  2. 驱动程序为每个执行程序分配一个主题和要读取和处理的分区.
  3. 除非有 shuffle 边界操作,否则 Spark 很可能会在同一个执行器上优化分区的整个执行.

这意味着单个执行器将读取给定的 TopicPartition 并在其上处理整个执行图,除非我们需要 shuffle.由于 Kafka 分区映射到 RDD 内的分区,我们得到了保证.

结构化流媒体更进一步.在 Structured Streaming 中,TopicPartition 和 worker/executor 之间存在粘性.这意味着,如果给定的 worker 被分配了一个 TopicPartition,它很可能会在应用程序的整个生命周期内继续处理它.

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.

In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:

  1. Does a corresponding Spark task to a Kafka partition both read and process the data altogether ?

  • The rational behind this question is that in the previous API, that is, the receiver based, a TASK was dedicated for receiving the data, meaning a number tasks slot of your executors were reserved for data ingestion and the other were there for processing. This had an impact on how you size your executor in term of cores.

  • Take for example the advise on how to launch spark-streaming with
    --master local. Everyone would tell that in the case of spark streaming, one should put local[2] minimum, because one of the core, will be dedicated to running the long receiving task that never ends, and the other core will do the data processing.

  • So if the answer is that in this case, the task does both the reading and the processing at once, then the question that follows, is that
    really smart, i mean, this sounds like asynchronous. We want to be
    able to fetch while we process so on the next processing the data is already there. However if there only one core or more precisely to
    both read the data and process them, how can both be done in
    parallel, and how does that make things faster in general.

  • My original understand was that, things would have remain somehow the same in the sense that, a task would be launch to read but that the
    processing would be done in another task. That would mean that, if
    the processing task is not done yet, we can still keep reading, until a certain memory limit.

Can someone outline with clarity what is exactly going on here ?

EDIT1

We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.

解决方案

Does a corresponding Spark task to a Kafka partition both read and process the data altogether ?

The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:

  1. Driver reads offsets from all kafka topics and partitions
  2. Driver assigns each executor a topic and partition to be read and processed.
  3. Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.

This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.

Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

这篇关于Spark-Streaming Kafka Direct Streaming API &amp;并行性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆