Spring Batch:哪个 ItemReader 实现用于高容量 &低延迟 [英] Spring Batch: Which ItemReader implementation to use for high volume & low latency

查看:28
本文介绍了Spring Batch:哪个 ItemReader 实现用于高容量 &低延迟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用例:从数据库中读取 1000 万行 [10 列] 并写入文件(csv 格式).

Use case: Read 10 million rows [10 columns] from database and write to a file (csv format).

  1. JdbcCursorItemReader 和 ItemReader 之间的哪个实现JdbcPagingItemReader 会被推荐吗?原因是什么?

  1. Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason?

在上述用例中哪个性能更好(快速)?

Which would be better performing (fast) in the above use case?

在单进程和多进程方法的情况下,选择会有所不同吗?

Would the selection be different in case of a single-process vs multi-process approach?

在使用 TaskExecutor 的多线程方法的情况下,哪个会更好 &简单吗?

In case of a multi-threaded approach using TaskExecutor, which one would be better & simple?

推荐答案

为了处理这种数据,如果可能的话,您可能想要并行化它(唯一阻止它的就是如果输出文件需要保留来自输入的订单).假设您要并行化处理,那么对于这种类型的用例,您将有两个主要选项(根据您提供的内容):

To process that kind of data, you're probably going to want to parallelize it if that is possible (the only thing preventing it would be if the output file needed to retain an order from the input). Assuming you are going to parallelize your processing, you are then left with two main options for this type of use case (from what you have provided):

  1. 多线程步骤 - 这将每个线程处理一个块,直到完成.这允许以非常简单的方式进行并行化(只需将 TaskExecutor 添加到您的步骤定义中).有了这个,你就可以轻松地重新启动,因为你需要关闭你提到的任何一个 ItemReaders 上的状态持久性(有办法解决这个问题,将数据库中的记录标记为已处理等).
  2. 分区 - 这将您的输入数据分解为由步骤实例并行处理的分区(主/从配置).分区可以通过线程(通过 TaskExecutor)在本地执行,也可以通过远程分区远程执行.在任何一种情况下,您都可以通过并行化获得可重新启动性(每个步骤都处理它自己的数据,因此从一个分区到另一个分区没有踩踏状态).

我做了一个关于与 Spring Batch 并行处理数据的演讲.具体来说,我提供的示例是一个远程分区作业.您可以在此处查看:https://www.youtube.com/watch?v=CYTj5YT7CZU

I did a talk on processing data in parallel with Spring Batch. Specifically, the example I present is a remote partitioned job. You can view it here: https://www.youtube.com/watch?v=CYTj5YT7CZU

针对您的具体问题:

  1. JdbcCursorItemReader & ItemReader 中的哪个实现会提示 JdbcPagingItemReader 吗?原因是什么? - 可以调整这两个选项中的任何一个以满足许多性能需求.这实际上取决于您使用的数据库、可用的驱动程序选项以及您可以支持的处理模型.另一个考虑因素是,您需要可重启性吗?
  2. 在上述用例中哪个性能更好(速度快)? - 同样取决于您选择的处理模型.
  3. 在单进程和多进程方法的情况下,选择会有所不同吗? - 这涉及到您如何管理作业,而不是 Spring Batch 可以处理的.问题是,您是要管理作业外部的分区(将数据描述作为参数传递给作业)还是希望作业管理它(通过分区).
  4. 在使用 TaskExecutor 的多线程方法的情况下,哪个会更好 &简单吗? - 我不会否认远程分区增加了本地分区和多线程步骤所没有的复杂程度.
  1. Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason? - Either of these two options can be tuned to meet many performance needs. It really depends on the database you're using, driver options available as well as processing models you can support. Another consideration is, do you need restartability?
  2. Which would be better performing (fast) in the above use case? - Again it depends on your processing model chosen.
  3. Would the selection be different in case of a single-process vs multi-process approach? - This goes to how you manage jobs more so than what Spring Batch can handle. The question is, do you want to manage partitioning external to the job (passing in the data description to the job as parameters) or do you want the job to manage it (via partitioning).
  4. In case of a multi-threaded approach using TaskExecutor, which one would be better & simple? - I won't deny that remote partitioning adds a level of complexity that local partitioning and multithreaded steps don't have.

我将从基本步骤定义开始.然后尝试多线程步骤.如果这不能满足您的需求,请移至本地分区,最后在需要时进行远程分区.请记住,Spring Batch 旨在使该过程尽可能无痛.您可以从常规步骤转为多线程步骤,只需进行配置更新.要进行分区,您需要添加一个新类(一个 Partitioner 实现)和一些配置更新.

I'd start with the basic step definition. Then try a multithreaded step. If that doesn't meet your needs, then move to local partitioning, and finally remote partitioning if needed. Keep in mind that Spring Batch was designed to make that progression as painless as possible. You can go from a regular step to a multithreaded step with only configuration updates. To go to partitioning, you need to add a single new class (a Partitioner implementation) and some configuration updates.

最后一点.其中大部分都谈到了并行处理这些数据.Spring Batch 的 FlatFileItemWriter 不是 线程安全的.最好的办法是并行写入多个文件,如果速度是您最关心的问题,然后再聚合它们.

One final note. Most of this has talked about parallelizing the processing of this data. Spring Batch's FlatFileItemWriter is not thread safe. Your best bet would be to write to multiple files in parallel, then aggregate them afterwards if speed is your number one concern.

这篇关于Spring Batch:哪个 ItemReader 实现用于高容量 &低延迟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆