批处理时负载平衡SQL读取? [英] Load balancing SQL reads while batch-processing?

查看:56
本文介绍了批处理时负载平衡SQL读取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出带有时间戳记的SQL表.有时应用程序 App0 会执行类似 foreach记录的操作,因为(certainTimestamp)执行process(record);commitOffset(record.timestamp),即定期消耗一批新"数据,依次处理该数据并在每条记录后提交成功,然后休眠一段合理的时间(以累积另一批数据).这对单个实例非常适用.但是如何对多个实例进行负载平衡?

Given an SQL table with timestamped records. Every once in a while an application App0 does something like foreach record in since(certainTimestamp) do process(record); commitOffset(record.timestamp), i.e. periodically it consumes a batch of "fresh" data, processes it sequentially and commits success after each record and then just sleeps for reasonable time (to accumulate yet another batch). That works perfect with single instance.. however how to load balance multiple ones?

在完全相同的环境中, App0 App1 同时竞争新鲜数据.这个想法是,由 App0 执行的现成查询一定不能与 App1 执行的同一读取查询重叠-这样,他们就永远不会尝试处理相同的项目.换句话说,我需要基于SQL的保证,即并发读取查询返回不同的数据.那有可能吗?

In exactly the same environment App0 and App1 concurrently competite for the fresh data. The idea is that ready query executed by the App0 must not overlay with the same read query executed by the App1 - such that they never try to process the same item. In other words, I need SQL-based guarantees that concurrent read queries return different data. Is that even possible?

P.S.首选Postgres.

P.S. Postgres is preferred option.

推荐答案

问题描述在 App0 处理先前选择的记录时 App1 应该执行的操作上比较模糊.
在此答案中,我做出以下假设:

The problem description is rather vague on what App1 should do while App0 is processing the previously selected records.
In this answer, I make the following assumptions:

  • 所有 App 以某种方式知道最后一个 certainTimestamp 是什么,并且所有 App 每次启动数据库查询时都是相同的.
  • 正在处理 App0 时,说它开始工作时发现的10条记录,新记录就进来了.这意味着,关于 certainTimestamp 的新记录堆成长.
  • App1 (或任何其他 App )启动时,应当仅处理与 certainTimestamp 相关的新记录由其他 Apps 处理.
  • 但是,如果在 App 上失败/崩溃,则应在下次运行另一个 App 时选择未完成的记录.
  • all Apps somehow know what the last certainTimestamp is and it is the same for all Apps whenever they start a DB query.
  • while App0 is processing, say the 10 records it found when it started working, new records come in. That means, the pile of new records with respect to certainTimestamp grows.
  • when App1 (or any further App) starts, the should process only those new records with respect to certainTimestamp that are not yet being handled by other Apps.
  • yet, if on App fails/crashes, the unfinished records should be picked the next time another App runs.

这可以通过锁定许多SQL数据库中的记录来实现.

This can be achieved by locking records in many SQL databases.

一种解决方法是使用

 SELECT ... FOR UPDATE SKIP LOCKED

此语句与范围选择 since(certainTimestamp)一起选择并锁定所有符合条件且当前未锁定的记录.每当新的 App 实例运行此查询时,它只会剩下",并且可以进行处理.

This statement, in combination with the range-selection since(certainTimestamp) selects and locks all records matching the condition and not being locked currently. Whenever a new App instance runs this query, it only gets "what's left" to do and can work on that.

这解决了覆盖" 或处理相同数据的问题.

This solves the problem of "overlay" or working on the same data.

然后剩下的是 certainTimestamp 的定义和更新.为了简短地回答这个问题,我在这里不做任何说明,只是将指向OP的指针留给了它,需要仔细考虑这一点,以免出现以下情况:一条由于某种原因而无法处理的记录将 certainTimestamp 保持为永久最小值.

What's left is then the definition and update of the certainTimestamp. In order to keep this answer short, I don't go into that here and just leave the pointer to the OP that this needs to be thought through properly to avoid situations where e.g. a single record that cannot be processed for some reason keeps the certainTimestamp at a permanent minimum.

这篇关于批处理时负载平衡SQL读取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆