多线程与分区之间的 Spring 批处理差异 [英] Spring batch difference between Multithreading vs partitioning

查看:74
本文介绍了多线程与分区之间的 Spring 批处理差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我无法理解 Spring 批处理中多线程和分区之间的区别.实现当然是不同的:在分区中,您需要准备分区然后对其进行处理.我想知道有什么区别,当瓶颈是项目处理器时,哪种处理方式更有效.

I cannot understand the difference between multi-threading and partitioning in Spring batch. The implementation is of course different: In partitioning you need to prepare the partitions then process it. I want to know what is the difference and which one is more efficient way to process when the bottleneck is the item-processor.

推荐答案

TL;DR;
当瓶颈出现在处理器中时,这两种方法都没有帮助.通过让多个项目同时通过处理器,您会看到一些好处,但是您指出的这两个选项在 I/O 绑定的进程中使用时都可以获得全部好处.AsyncItemProcessor/AsyncItemWriter 可能是更好的选择.

TL;DR;
Neither approach is intended to help when the bottleneck is in the processor. You will see some gains by having multiple items going through a processor at the same time, but both of the options you point out get their full benefits when used in processes that are I/O bound. The AsyncItemProcessor/AsyncItemWriter may be a better option.

Spring Batch 可扩展性概述
有五种扩展 Spring Batch 作业的选项:

Overview of Spring Batch Scalability
There are five options for scaling Spring Batch jobs:

  1. 多线程步骤
  2. 平行步骤
  3. 分区
  4. 远程分块
  5. AsyncItemProcessor/AsyncItemWriter

各有优缺点.让我们逐一介绍:

Each has it's own benefits and disadvantages. Let's walk through each:

多线程步骤
多线程步骤只需执行一个步骤,并在单独的线程上执行该步骤中的每个块.这意味着每个批处理组件(读取器、写入器等)的相同实例在线程之间共享.在大多数情况下,这可以通过以可重启性为代价向步骤添加一些并行性来提高性能.您牺牲了可重启性,因为在大多数情况下,重启能力是基于读取器/写入器等中维护的状态.随着多个线程更新该状态,它变得无效且无法重新启动.因此,您通常需要关闭单个组件的保存状态,并在作业中将可重启标志设置为 false.

Multithreaded step
A multithreaded step takes a single step and executes each chunk within that step on a separate thread. This means that the same instances of each of the batch components (readers, writers, etc) are shared across the threads. This can increase performance by adding some parallelism to the step at the cost of restartability in most cases. You sacrifice restartability because in most cases, the ability to restart is based on the state maintained within the reader/writer/etc. With multiple threads updating that state, it becomes invalid and useless for restart. Because of this, you typically need to turn save state off on individual components and set the restartable flag to false on the job.

并行步骤
并行步骤是通过拆分实现的.它允许您通过线程并行执行多个独立的步骤.这不会牺牲可重启性,但无助于提高单个步骤或业务逻辑的性能.

Parallel steps
Parallel steps are achieved via a split. It allows you to execute multiple, independent steps in parallel via threads. This does not sacrifice restartability, but does not help improve the performance of a single step or piece of business logic.

分区
分区是通过主步骤预先将数据划分为更小的块(称为分区),然后让从设备在分区上独立工作.在 Spring Batch 中,master 和每个 slave 都是一个独立的步骤,因此您可以在单个步骤中获得并行性的好处,而不会牺牲可重启性.分区还提供了扩展到单个 JVM 之外的能力,因为从站不必是本地的(您可以使用各种通信机制与远程从站进行通信).

Partitioning
Partitioning is the dividing of data, in advance, into smaller chunks (called partitions) by a master step and then having slaves work independently on the partitions. In Spring Batch, both the master and each slave, is an independent step so you can get the benefits of parallelism within a single step without sacrificing restartability. Partitioning also provides the ability to scale beyond a single JVM in that the slaves do not have to be local (you can use various communication mechanisms to communicate with remote slaves).

关于分区的一个重要说明是,主从之间的唯一通信是数据的描述,而不是数据本身.例如,master可能会告诉slave1处理记录1-100,slave2处理记录101-200等,master不发送实际数据,只发送slave获取它应该处理的数据所需的信息.因此,数据对于从属进程来说必须是本地的,而主进程可以位于任何地方.

An important note about partitioning is that the only communication between the master and slave is a description of the data and not the data itself. For example, the master may tell slave1 to process records 1-100, slave2 to process records 101-200, etc. The master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. Because of this, the data must be local to the slave processes and the master can be located anywhere.

远程分块
远程分块允许您跨 JVM 扩展进程和可选的写入逻辑.在这个用例中,主站读取数据,然后通过线路将其发送到从站进行处理,然后在本地写入从站或返回主站以本地写入主站.

Remote chunking
Remote chunking allows you to scale the process and optionally the write logic across JVMs. In this use case, the master reads the data and then sends it over the wire to the slaves where it is processed and then either written locally to the slave or returned to the master for writing local to the master.

分区和远程分块之间的重要区别在于,远程分块不是通过网络发送描述,而是通过网络发送实际数据.因此,不是单个数据包说进程记录 1-100,远程分块将发送实际记录 1-100.这可能会对步骤的 I/O 配置文件产生很大影响,但如果处理器的瓶颈足够大,这可能会很有用.

The important difference between partitioning and remote chunking is that instead of a description going over the wire, remote chunking sends the actual data over the wire. So instead of a single packet saying process records 1-100, remote chunking is going to send the actual records 1-100. This can have a large impact on the I/O profile of a step, but if the processor is enough of a bottleneck, this can be useful.

AsyncItemProcessor/AsyncItemWriter
扩展 Spring Batch 进程的最后一个选项是 AsyncItemProcessor/AsycnItemWriter 组合.在这种情况下,AsyncItemProcessor 包装您的 ItemProcessor 实现并在单独的线程中执行对您的实现的调用.AsyncItemProcessor 然后返回一个 Future,它被传递给 AsyncItemWriter,在那里它被解包并传递给委托 ItemWriter实施.

AsyncItemProcessor/AsyncItemWriter
The final option for scaling Spring Batch processes is the AsyncItemProcessor/AsycnItemWriter combination. In this case, the AsyncItemProcessor wraps your ItemProcessor implementation and executes the call to your implementation in a separate thread. The AsyncItemProcessor then returns a Future that is passed to the AsyncItemWriter where it is unwrapped and passed to the delegate ItemWriter implementation.

由于数据流经此选项的性质,不支持某些侦听器场景(因为我们不知道 ItemProcessor 调用的结果,直到在 ItemWriter) 但总的来说,它可以提供一种有用的工具,用于在单个 JVM 中并行化 ItemProcessor 逻辑,而不会牺牲可重启性.

Because of the nature of how data flows through this option, certain listener scenarios are not supported (since we don't know the outcome of the ItemProcessor call until inside the ItemWriter) but overall, it can provide a useful tool for parallelizing just the ItemProcessor logic in a single JVM without sacrificing restartability.

这篇关于多线程与分区之间的 Spring 批处理差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆