如何与不同批次的持续时间设置多个星火流工作? [英] How do you setup multiple Spark Streaming jobs with different batch durations?

查看:227
本文介绍了如何与不同批次的持续时间设置多个星火流工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在将一个大型企业的当前数据架构的开始阶段,我目前正在建设一个Spark流ETL架构中,我们将我们所有的源连接到目的地(源/目标可能是卡夫卡的话题,弗卢姆,HDFS等)通过变换。这看起来是这样的:

We are in the beginning phases of transforming the current data architecture of a large enterprise and I am currently building a Spark Streaming ETL framework in which we would connect all of our sources to destinations (source/destinations could be Kafka topics, Flume, HDFS, etc.) through transformations. This would look something like:

SparkStreamingEtlManager.addEtl(资料来源,转换*,目的地)
SparkStreamingEtlManager.streamEtl()
streamingContext.start()

的假设是,因为我们应该只有一个SparkContext,我们将部署所有的ETL管道在一个应用程序/瓶。

The assumptions is that, since we should only have one SparkContext, we would deploy all of the ETL pipelines in one application/jar.

这里的问题是,batchDuration是上下文本身不是ReceiverInputDStream的属性和(这是为什么?)。我们是否需要因此拥有多个星火集群,或允许多个SparkContexts和部署多个应用程序?是否有任何其他的方式来控制每个接收器批次持续时间?

The problem with this is that the batchDuration is an attribute of the context itself and not of the ReceiverInputDStream (Why is this?). Do we need to therefore have multiple Spark Clusters, or, allow for multiple SparkContexts and deploy multiple applications? Is there any other way to control the batch duration per receiver?

请让我知道如果我的任何假设都是幼稚的,或需要被改写。谢谢!

Please let me know if any of my assumptions are naive or need to be rephrased. Thanks!

推荐答案

在我的经验,不同的流有不同的调整需求。吞吐量,延迟,接收端的能力,服务水平协议得到尊重,等等。

In my experience, different streams have different tuning requirements. Throughput, latency, capacity of the receiving side, SLAs to be respected, etc.

为了顾及多样性,我们需要配置每个星火流作业,演讲中说特异性。所以,不仅是批间隔也是资源,如执行节点(当负载网络的约束)的内存和CPU,数据分割,#

To cater for that multiplicity, we require to configure each Spark Streaming job to address said specificity. So, not only batch interval but also resources like memory and cpu, data partitioning, # of executing nodes (when the loads are network bound).

由此可见,每个星火流工作就在星火集群独立的工作部署。这也将允许单独的管道的监控和管理相互独立,并在过程的进一步微调帮助

It follows that each Spark Streaming job becomes a separate job deployment on a Spark Cluster. That will also allow for monitoring and management of separate pipelines independently of each other and help in the further fine-tuning of the processes.

在我们的例子中,我们使用Mesos +马拉松来管理我们的星火工作流运行3600x24x7集。

In our case, we use Mesos + Marathon to manage our set of Spark Streaming jobs running 3600x24x7.

这篇关于如何与不同批次的持续时间设置多个星火流工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆