Java Spring Batch中的ETL与Apache Spark基准测试 [英] ETL in Java Spring Batch vs Apache Spark Benchmarking

查看:215
本文介绍了Java Spring Batch中的ETL与Apache Spark基准测试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经与Apache Spark + Scala一起工作了5年以上(学术和专业经验).我总是发现Spark/Scala是构建任何类型的批处理或流ETL/ELT应用程序的强大组合之一.

I have been working with Apache Spark + Scala for over 5 years now (Academic and Professional experiences). I always found Spark/Scala to be one of the robust combos for building any kind of Batch or Streaming ETL/ ELT applications.

但是最近,我的客户决定将Java Spring Batch用于我们的两个主要管道:

But lately, my client decided to use Java Spring Batch for 2 of our major pipelines :

  1. 从MongoDB中读取->业务逻辑->写入JSON文件(〜2GB | 60万行)
  2. 从Cassandra读取->业务逻辑->写入JSON文件(〜4GB | 2M行)

我对这个企业级决定感到困惑.我同意,业界比我有更大的胸怀,但我无法理解采取此举的必要性.

I was pretty baffled by this enterprise-level decision. I agree there are greater minds than mine in the industry but I was unable to comprehend the need of making this move.

我在这里的问题是:

  1. 有没有人比较过Apache Spark和Java Spring Batch之间的性能?
  2. 相比于Spark,使用Spring Batch有何优势?
  3. 与Apache Spark相比,Spring Batch是否真正分布"?我在 offcial docs ,但我不相信它的真正分布性.所有Spring Batch在单个JVM实例上运行之后.是不是???
  1. Has anybody compared the performances between Apache Spark and Java Spring Batch?
  2. What could be the advantages of using Spring Batch over Spark?
  3. Is Spring Batch "truly distributed" when compared to Apache Spark? I came across methods like chunk(), partition etc in offcial docs but I was not convinced of its true distributedness. After all Spring Batch is running on a single JVM instance. Isn't it ???

我无法绕过这些.因此,我想使用该平台在Spring Batch和Apache Spark之间进行公开讨论.

I'm unable to wrap my head around these. So, I want to use this platform for an open discussion between Spring Batch and Apache Spark.

推荐答案

作为Spring Batch项目的负责人,我相信您会理解我有特定的见解.但是,在开始之前,我应该指出,我们所讨论的框架是为两个截然不同的用例设计的. Spring Batch设计为在JVM上处理传统的企业批处理.它被设计为应用在企业批处理中很常见的易于理解的模式,并使其在JVM框架中方便使用.另一方面,Spark专为大数据和机器学习用例而设计.与传统的企业批处理系统相比,这些用例具有不同的模式,挑战和目标,这反映在框架的设计中.话虽如此,这是我对您的特定问题的回答.

As the lead of the Spring Batch project, I’m sure you’ll understand I have a specific perspective. However, before beginning, I should call out that the frameworks we are talking about were designed for two very different use cases. Spring Batch was designed to handle traditional, enterprise batch processing on the JVM. It was designed to apply well understood patterns that are common place in enterprise batch processing and make them convenient in a framework for the JVM. Spark, on the other hand, was designed for big data and machine learning use cases. Those use cases have different patterns, challenges, and goals than a traditional enterprise batch system, and that is reflected in the design of the framework. That being said, here are my answers to your specific questions.

有人可以比较Apache Spark和Java Spring Batch之间的性能吗?

没有人能真正为您回答这个问题.性能基准测试是非常具体的事情.用例很重要.硬件很重要.我鼓励您进行自己的基准测试和性能分析,以确定哪种最适合您的部署拓扑中的用例.

No one can really answer this question for you. Performance benchmarks are a very specific thing. Use cases matter. Hardware matters. I encourage you to do your own benchmarks and performance profiling to determine what works best for your use cases in your deployment topologies.

使用Spring Batch而不是Spark有什么优势?

与其他企业工作负载类似的编程模型
企业在制定体系结构决策时需要意识到他们拥有的资源.使用新技术X是否值得对技术Y进行再培训或雇用开销?对于Spark vs Spring Batch,在Spring Batch上现有的Spring开发人员的提升非常小.我可以聘请任何对Spring满意的开发人员,并迅速使他们在Spring Batch中变得完全有生产力.对于普通的企业开发人员来说,Spark的学习曲线更为陡峭,这不仅是因为学习Spark框架的开销很大,而且还因为所有相关技术都可以使该生态系统中的Spark作业成为标准(HDFS,Oozie等).

Programming model similar to other enterprise workloads
Enterprises need to be aware of the resources they have on hand when making architectural decisions. Is using new technology X worth the retraining or hiring overhead of technology Y? In the case of Spark vs Spring Batch, the ramp up for an existing Spring developer on Spring Batch is very minimal. I can take any developer that is comfortable with Spring and make them fully productive with Spring Batch very quickly. Spark has a steeper learning curve for the average enterprise developer, not only because of the overhead of learning the Spark framework but all the related technologies to prodictionalize a Spark job in that ecosystem (HDFS, Oozie, etc).

不需要专用的基础架构
在分布式环境中运行时,您需要使用YARN,Mesos或Spark自己的集群安装来配置集群(在撰写本文时,有一个实验性Kubernetes选项可用,但如前所述,它被标记为实验性).这需要用于特定用例的专用基础结构. Spring Batch可以部署在任何基础架构上.您可以使用可执行的JAR文件通过Spring Boot执行它,可以将其部署到servlet容器或应用程序服务器中,还可以通过YARN或任何云提供商运行Spring Batch作业.此外,如果您使用Spring Boot的可执行JAR概念,即使您在运行其他工作负载的基于云的基础架构上运行分布式应用程序,也无需事先设置.

No dedicated infrastructure required
When running in a distributed environment, you need to configure a cluster using YARN, Mesos, or Spark’s own clustering installation (there is an experimental Kubernetes option available at the time of this writing, but, as noted, it is labeled as experimental). This requires dedicated infrastructure for specific use cases. Spring Batch can be deployed on any infrastructure. You can execute it via Spring Boot with executable JAR files, you can deploy it into servlet containers or application servers, and you can run Spring Batch jobs via YARN or any cloud provider. Moreover, if you use Spring Boot’s executable JAR concept, there is nothing to setup in advance, even if running a distributed application on the same cloud-based infrastructure you run your other workloads on.

更多开箱即用的读写器可简化工作创建
Spark生态系统专注于大数据用例.因此,它提供的开箱即用的读写组件集中在那些用例上.诸如用于读取大数据用例中常用文件的不同序列化选项之类的事情是本地处理的.但是,处理诸如事务内的记录大块之类的事情并非如此.

More out of the box readers/writers simplify job creation
The Spark ecosystem is focused around big data use cases. Because of that, the components it provides out of the box for reading and writing are focused on those use cases. Things like different serialization options for reading files commonly used in big data use cases are handled natively. However, processing things like chunks of records within a transaction are not.

另一方面,Spring Batch提供了一整套用于声明性输入和输出的组件.从数据库,从NoSQL存储,从消息传递队列,编写电子邮件中读取和写入平面文件,XML文件……这个列表不胜枚举. Spring Batch可为所有这些提供开箱即用的服务.

Spring Batch, on the other hand, provides a complete suite of components for declarative input and output. Reading and writing flat files, XML files, from databases, from NoSQL stores, from messaging queues, writing emails...the list goes on. Spring Batch provices all of those out of the box.

Spark是为大数据而构建的...并非所有用例都是大数据用例
简而言之,Spark的功能特定于它所针对的领域:大数据和机器学习.诸如事务管理(或根本没有事务)之类的东西在Spark中不存在.如果没有自定义代码,就不会发生发生错误时回滚的想法(据我所知).在框架级别未提供更健壮的错误处理用例,例如跳过/重试.在Spark中,诸如重启之类的状态管理要比Spring Batch重得多(持久化整个RDD而不是为特定组件存储琐碎的状态).所有这些功能都是Spring Batch的本机功能.

Spark was built for big data...not all use cases are big data use cases
In short, Spark’s features are specific for the domain it was built for: big data and machine learning. Things like transaction management (or transactions at all) do not exist in Spark. The idea of rolling back when an error occurs doesn’t exist (to my knowledge) without custom code. More robust error handling use cases like skip/retry are not provided at the level of the framework. State management for things like restarting is much heavier in Spark than Spring Batch (persisting the entire RDD vs storing trivial state for specific components). All of these features are native features of Spring Batch.

Spring Batch是真正分发的"

Spring Batch的优点之一是能够将批处理过程从简单的顺序执行的单个JVM进程演变为具有最小更改的完全分布式集群解决方案. Spring Batch支持两种主要的分布式模式:

One of the advantages of Spring Batch is the ability to evolve a batch process from a simple sequentially executed, single JVM process to a fully distributed, clustered solution with minimal changes. Spring Batch supports two main distributed modes:

  1. 远程分区-这里的Spring Batch以主/工作配置运行.大师根据编排机制将工作下放给工人(这里有很多选择).完全可重新启动性,错误处理等都可用于此方法,同时将最少的网络开销(仅传输描述每个分区的元数据传输)到远程JVM. Spring Cloud Task还提供了Spring Batch的扩展,该扩展允许使用云本机机制动态部署工作程序.
  2. 远程分块-远程分块仅将步骤的处理和写入阶段委托给远程JVM.仍然使用主机/工作程序配置,主机负责将数据提供给工作人员进行处理和写入.在这种拓扑中,数据通过电线传输,从而导致更大的网络负载.通常仅在处理优势可以超过增加的网络流量的开销时使用.
  1. Remote Partitioning - Here Spring Batch runs in a master/worker configuration. The masters delegate work to workers based on the mechanism of orchestration (many options here). Full restartability, error handling, etc. is all available for this approach with minimal network overhead (transmission of metadata describing each partition only) to the remote JVMs. Spring Cloud Task also provides extensions to Spring Batch that allow for cloud native mechanisms to dynamically deploying the workers.
  2. Remote Chunking - Remote chunking delegates only the processing and writing phases of a step to a remote JVM. Still using a master/worker configuration, the master is responsible for providing the data to the workers for processing and writing. In this topology, the data travels over the wire, causing a heavier network load. It is typically used only when the processing advantages can surpass the overhead of the added network traffic.

还有其他Stackoverflow答案可以更详细地讨论这些功能(与文档一样):

There are other Stackoverflow answers that discuss these features in further detail (as does as the documentation):

春季批处理的优势
spring批处理远程分块与远程分区之间的区别
Spring Batch文档

Advantages of spring batch
Difference between spring batch remote chunking and remote partitioning
Spring Batch Documentation

这篇关于Java Spring Batch中的ETL与Apache Spark基准测试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆