为什么在RedShift事务结束之前释放隐式表锁? [英] Why is an implicit table lock being released prior to end of transaction in RedShift?

查看:142
本文介绍了为什么在RedShift事务结束之前释放隐式表锁?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个ETL流程,该流程正在RedShift中逐步构建维度表.它按以下顺序执行操作:

  1. 开始交易
  2. 创建一个像foo这样的表staging_foo
  3. 将数据从外部源复制到staging_foo
  4. 对foo执行批量插入/更新/删除,使其与staging_foo匹配
  5. 拖放staging_foo
  6. 提交交易

此过程单独起作用,但是为了实现对foo的连续流刷新和出现故障时的冗余,我同时运行了该过程的多个实例.当发生这种情况时,我偶尔会遇到并发序列化错误.这是因为在重叠事务中,两个进程都正在从foo_staging重播对foo的某些相同更改.

发生的情况是,第一个进程创建了staging_foo表,而第二个进程在尝试创建具有相同名称的表时被阻止(这就是我想要的).当第一个进程提交其事务时(可能需要几秒钟),我发现第二个进程在提交完成之前就已解除阻塞.因此,在落实提交之前,它似乎正在获取foo表的快照,这会导致插入/更新/删除(其中一些可能是多余的)失败.

我根据文档 http://docs进行理论化. aws.amazon.com/redshift/latest/dg/c_serial_isolation.html 其中表示:

并发事务是彼此不可见的;他们无法检测到彼此的变化.每个并发事务将在事务开始时创建数据库的快照.第一次出现大多数SELECT语句,诸如COPY,DELETE,INSERT,UPDATE和TRUNCATE之类的DML命令以及以下DDL命令时,将在事务内创建数据库快照:

ALTER TABLE(添加或删除列)

创建表

DROP TABLE

截断表

上面引用的文档使我有些困惑,因为它首先说将在事务开始时创建快照,但是随后说仅在某些特定的DML/DDL操作第一次出现时才创建快照./p>

我不想在替换foo的地方进行深层复制,而不是逐步更新它.我还有其他进程会不断查询此表,因此永远不会有任何时间可以不间断地替换它.另一个问题针对深层复制提出了类似的问题,但它对我不起作用:解决方案

好的,Postgres(以及Redshift [或多或少])使用 MVCC(多版本并发控制)用于事务隔离,而不是

I have an ETL process that is building dimension tables incrementally in RedShift. It performs actions in the following order:

  1. Begins transaction
  2. Creates a table staging_foo like foo
  3. Copies data from external source into staging_foo
  4. Performs mass insert/update/delete on foo so that it matches staging_foo
  5. Drop staging_foo
  6. Commit transaction

Individually this process works, but in order to achieve continuous streaming refreshes to foo and redundancy in the event of failure, I have several instances of the process running at the same time. And when that happens I occasionally get concurrent serialization errors. This is because both processes are replaying some of the same changes to foo from foo_staging in overlapping transactions.

What happens is that the first process creates the staging_foo table, and the second process is blocked when it attempts to create a table with the same name (this is what I want). When the first process commits its transaction (which can take several seconds) I find that the second process gets unblocked before the commit is complete. So it appears to be getting a snapshot of the foo table before the commit is in place, which causes the inserts/updates/deletes (some of which may be redundant) to fail.

I am theorizing based on the documentation http://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html where it says:

Concurrent transactions are invisible to each other; they cannot detect each other's changes. Each concurrent transaction will create a snapshot of the database at the beginning of the transaction. A database snapshot is created within a transaction on the first occurrence of most SELECT statements, DML commands such as COPY, DELETE, INSERT, UPDATE, and TRUNCATE, and the following DDL commands :

ALTER TABLE (to add or drop columns)

CREATE TABLE

DROP TABLE

TRUNCATE TABLE

The documentation quoted above is somewhat confusing to me because it first says a snapshot will be created at the beginning of a transaction, but subsequently says a snapshot will be created only at the first occurrence of some specific DML/DDL operations.

I do not want to do a deep copy where I replace foo instead of incrementally updating it. I have other processes that continually query this table so there is never a time when I can replace it without interruption. Another question asks a similar question for deep copy but it will not work for me: How can I ensure synchronous DDL operations on a table that is being replaced?

Is there a way for me to perform my operations in a way that I can avoid concurrent serialization errors? I need to ensure that read access is available for foo so I can't LOCK that table.

解决方案

OK, Postgres (and therefore Redshift [more or less]) uses MVCC (Multi Version Concurrency Control) for transaction isolation instead of a db/table/row/page locking model (as seen in SQL Server, MySQL, etc.). Simplistically every transaction operates on the data as it existed when the transaction started.

So your comment "I have several instances of the process running at the same time" explains the problem. If Process 2 starts while Process 1 is running then Process 2 has no visibility of the results from Process 1.

这篇关于为什么在RedShift事务结束之前释放隐式表锁?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆