Spark-Streaming Kafka Direct Streaming API&平行性 [英] Spark-Streaming Kafka Direct Streaming API & Parallelism

查看:112
本文介绍了Spark-Streaming Kafka Direct Streaming API&平行性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我了解了Kafka分区与Spark RDD分区之间以及最终Spark任务之间存在的自动映射.但是,为了正确调整Executor的大小(以Core数为单位),从而最终确定我的节点和群集的大小,我需要了解文档中似乎掩盖的内容.

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.

在Spark-Streaming中,数据消耗vs数据处理vs任务分配到底如何工作,换句话说:

In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:

  1. Kafka分区执行相应的 Spark任务 并完全处理数据?
  1. Does a corresponding Spark task to a Kafka partition both read and process the data altogether ?

  • 此问题背后的原因是在先前的API中, 是基于接收器的TASK专用于接收数据, 意味着您的执行器的多个任务插槽已保留用于数据 食入,另一人在那里进行处理.这有一个 对您如何以核心数量衡量执行器大小的影响.

    • The rational behind this question is that in the previous API, that is, the receiver based, a TASK was dedicated for receiving the data, meaning a number tasks slot of your executors were reserved for data ingestion and the other were there for processing. This had an impact on how you size your executor in term of cores.

      例如,有关如何使用
      进行火花流传输的建议 -本地管理员 .每个人都会说,在火花流的情况下, 一个应该至少放置 local [2] ,因为其中之一 核心,将致力于执行从未收到的长期接收任务 结束,另一个核心将进行数据处理.

      Take for example the advise on how to launch spark-streaming with
      --master local. Everyone would tell that in the case of spark streaming, one should put local[2] minimum, because one of the core, will be dedicated to running the long receiving task that never ends, and the other core will do the data processing.

      因此,如果答案是在这种情况下,则任务将同时读取 然后立即进行处理,那么接下来的问题是,
      真的很聪明,我的意思是,这听起来像异步的.我们想成为
      能够在我们处理的同时获取数据,因此在下一个处理中,数据是 已经在那了.但是,如果只有一个核心或更确切地说是
      都读取数据并处理它们,怎么都可以在
      并行操作,这通常会使事情变得更快.

      So if the answer is that in this case, the task does both the reading and the processing at once, then the question that follows, is that
      really smart, i mean, this sounds like asynchronous. We want to be
      able to fetch while we process so on the next processing the data is already there. However if there only one core or more precisely to
      both read the data and process them, how can both be done in
      parallel, and how does that make things faster in general.

      我本来的理解是,事情本来会以某种方式保留 从某种意义上说,相同的是,将启动一个任务来读取,但是
      处理将在另一个任务中完成.这意味着,如果
      处理任务尚未完成,我们仍然可以继续阅读,直到 一定的内存限制.

      My original understand was that, things would have remain somehow the same in the sense that, a task would be launch to read but that the
      processing would be done in another task. That would mean that, if
      the processing task is not done yet, we can still keep reading, until a certain memory limit.

      有人可以清楚地概述这里到底发生了什么吗?

      Can someone outline with clarity what is exactly going on here ?

      EDIT1

      我们甚至不必具有此内存限制控件.仅仅是能够在处理正在进行时获取并就此停止的事实.换句话说,这两个过程应该是异步的,并且限制仅是向前迈出了一步.对于我来说,如果某种方式没有发生,我发现Spark会实施某些会破坏性能的事情感到非常奇怪.

      We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.

      推荐答案

      对Kafka分区执行相应的Spark任务,读取和读取 完全处理数据?

      Does a corresponding Spark task to a Kafka partition both read and process the data altogether ?

      这种关系与您所描述的关系非常接近,如果通过谈论一个任务,我们指的是从kafka读取的图形部分,直到随机播放操作为止.执行流程如下:

      The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:

      1. 驱动程序读取所有kafka主题和分区的偏移量
      2. 驱动程序为每个执行者分配一个主题和分区,以便进行读取和处理.
      3. 除非执行随机边界操作,否则Spark可能会优化同一执行程序上分区的整个执行.

      这意味着单个执行器将读取给定的TopicPartition并在其上处理整个执行图,除非我们需要改组.由于Kafka分区映射到RDD内部的分区,因此可以得到保证.

      This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.

      结构化流技术使这一点更加深入.在结构化流中,TopicPartition与工作程序/执行程序之间存在粘性.这意味着,如果为给定的工作人员分配了TopicPartition,则很可能会在应用程序的整个生命周期中继续对其进行处理.

      Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

      这篇关于Spark-Streaming Kafka Direct Streaming API&平行性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆