Spring Batch:要用于大批量&的哪个ItemReader实现?低延迟 [英] Spring Batch: Which ItemReader implementation to use for high volume & low latency

查看:165
本文介绍了Spring Batch:要用于大批量&的哪个ItemReader实现?低延迟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用例:从数据库中读取1000万行[10列]并写入文件(csv格式).

Use case: Read 10 million rows [10 columns] from database and write to a file (csv format).

  1. JdbcCursorItemReader 和&中的哪个ItemReader实现建议使用 JdbcPagingItemReader 吗?会是什么原因呢?

  1. Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason?

在上述用例中哪个会表现得更好(快速)?

Which would be better performing (fast) in the above use case?

在单进程方法还是多进程方法的情况下选择会有所不同吗?

Would the selection be different in case of a single-process vs multi-process approach?

在使用TaskExecutor的多线程方法的情况下,哪种方法更好?简单吗?

In case of a multi-threaded approach using TaskExecutor, which one would be better & simple?

推荐答案

要处理这种数据,您可能会希望将其并行化(唯一阻止它的是输出文件需要保留来自输入的订单).假设您要并行化处理,那么您会为这种用例(从您提供的内容)中获得两个主要选择:

To process that kind of data, you're probably going to want to parallelize it if that is possible (the only thing preventing it would be if the output file needed to retain an order from the input). Assuming you are going to parallelize your processing, you are then left with two main options for this type of use case (from what you have provided):

  1. 多线程步骤-这将处理每个线程的一个块,直到完成.这允许以非常简单的方式进行并行化(只需将TaskExecutor添加到您的步骤定义中).这样,您就可以立即松开可重启性,因为您将需要关闭您提到的任何ItemReader的状态持久性(有一些解决方法,可以将数据库中的记录标记为已处理等).
  2. 分区-这会将您的输入数据分解为由步实例并行处理的分区(主/从配置).可以通过线程(通过TaskExecutor)在本地执行分区,也可以通过远程分区在远程执行分区.无论哪种情况,都可以通过并行化获得可重启性(每个步骤都处理自己的数据,因此不会在分区之间踩到状态).

我讲了与Spring Batch并行处理数据的话题.具体来说,我展示的示例是一个远程分区作业.您可以在此处查看它: https://www.youtube.com/watch?v=CYTj5YT7CZU

I did a talk on processing data in parallel with Spring Batch. Specifically, the example I present is a remote partitioned job. You can view it here: https://www.youtube.com/watch?v=CYTj5YT7CZU

对于您的具体问题:

  1. JdbcCursorItemReader&中的哪个ItemReader实现会建议使用JdbcPagingItemReader吗?是什么原因?-可以调整这两个选项中的任何一个以满足许多性能需求.这实际上取决于您正在使用的数据库,可用的驱动程序选项以及您可以支持的处理模型.另一个考虑因素是,您是否需要重启?
  2. 在上述使用情况下,哪个会表现得更好(快速)?-同样,这取决于您选择的处理模型.
  3. 在单进程方法还是多进程方法的情况下,选择会有所不同吗?-这涉及到如何管理作业,而不是Spring Batch可以处理的方式.问题是,您是要管理作业外部的分区(将数据描述作为参数传递给作业)还是要作业(通过分区)进行管理?
  4. 在使用TaskExecutor的多线程方法的情况下,哪种方法更好?简单吗?-我不会否认远程分区会增加本地分区和多线程步骤所没有的复杂性.
  1. Which ItemReader implementation among JdbcCursorItemReader & JdbcPagingItemReader would be suggested? What would be the reason? - Either of these two options can be tuned to meet many performance needs. It really depends on the database you're using, driver options available as well as processing models you can support. Another consideration is, do you need restartability?
  2. Which would be better performing (fast) in the above use case? - Again it depends on your processing model chosen.
  3. Would the selection be different in case of a single-process vs multi-process approach? - This goes to how you manage jobs more so than what Spring Batch can handle. The question is, do you want to manage partitioning external to the job (passing in the data description to the job as parameters) or do you want the job to manage it (via partitioning).
  4. In case of a multi-threaded approach using TaskExecutor, which one would be better & simple? - I won't deny that remote partitioning adds a level of complexity that local partitioning and multithreaded steps don't have.

我将从基本步骤定义开始.然后尝试一个多线程步骤.如果那不能满足您的需求,请转到本地分区,如果需要,最后移至远程分区.请记住,Spring Batch旨在使该过程尽可能轻松.您可以从常规步骤转到仅配置更新的多线程步骤.要进行分区,您需要添加一个新类(一个Partitioner实现)和一些配置更新.

I'd start with the basic step definition. Then try a multithreaded step. If that doesn't meet your needs, then move to local partitioning, and finally remote partitioning if needed. Keep in mind that Spring Batch was designed to make that progression as painless as possible. You can go from a regular step to a multithreaded step with only configuration updates. To go to partitioning, you need to add a single new class (a Partitioner implementation) and some configuration updates.

最后的笔记.其中大多数都讨论了并行处理此数据. Spring Batch的FlatFileItemWriter 不是线程安全的.最好的选择是并行写入多个文件,然后如果速度是您最关心的问题,则将它们汇总.

One final note. Most of this has talked about parallelizing the processing of this data. Spring Batch's FlatFileItemWriter is not thread safe. Your best bet would be to write to multiple files in parallel, then aggregate them afterwards if speed is your number one concern.

这篇关于Spring Batch:要用于大批量&的哪个ItemReader实现?低延迟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆