PostgreSQL 中的批量/批量更新/更新插入 [英] Bulk/batch update/upsert in PostgreSQL

查看:159
本文介绍了PostgreSQL 中的批量/批量更新/更新插入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个 Django-ORM 增强功能,它尝试缓存模型并将模型保存推迟到事务结束.一切都差不多完成了,但是我在 SQL 语法上遇到了一个意想不到的困难.

I'm writing a Django-ORM enchancement that attempts to cache models and postpone model saving until the end of the transaction. It's all almost done, however I came across an unexpected difficulty in SQL syntax.

我不是 DBA,但据我所知,数据库对于许多小查询并不能真正有效地工作.几个更大的查询要好得多.例如,最好使用大批量插入(比如一次 100 行)而不是 100 个单行.

I'm not much of a DBA, but from what I understand, databases don't really work efficiently for many small queries. Few bigger queries are much better. For example it's better to use large batch inserts (say 100 rows at once) instead of 100 one-liners.

现在,据我所知,SQL 并没有真正提供任何语句来对表执行批量更新.这个词似乎令人困惑,所以,我将解释我的意思.我有一组任意数据,每个条目描述表中的一行.我想更新表中的某些行,每行都使用来自数组中相应条目的数据.这个想法与批量插入非常相似.

Now, from what I can see, SQL doesn't really supply any statement to perform a batch update on a table. The term seems to be confusing so, I'll explain what I mean by that. I have an array of arbitrary data, each entry describing a single row in a table. I'd like to update certain rows in the table, each using data from its corresponding entry in the array. The idea is very similar to a batch insert.

例如:我的表可以有两列 "id""some_col".现在,描述批量更新数据的数组包含三个条目 (1, 'first updated')(2, 'second updated')(3, '第三次更新').更新前表包含行:(1, 'first'), (2, 'second'), (3, 'third').

For example: My table could have two columns "id" and "some_col". Now the array describing the data for a batch update consists of three entries (1, 'first updated'), (2, 'second updated'), and (3, 'third updated'). Before the update the table contains rows: (1, 'first'), (2, 'second'), (3, 'third').

我看到了这篇文章:

为什么是批处理插入/更新更快?批量更新如何工作?

这似乎是我想要的,但我无法真正弄清楚最后的语法.

which seems to do what I want, however I can't really figure out the syntax at the end.

我也可以删除所有需要更新的行并使用批量插入重新插入它们,但是我发现很难相信这实际上会表现得更好.

I could also delete all the rows that require updating and reinsert them using a batch insert, however I find it hard to believe that this would actually perform any better.

我使用 PostgreSQL 8.4,所以这里也可以使用一些存储过程.但是,由于我计划最终开源该项目,因此非常欢迎任何更多可移植的想法或在不同 RDBMS 上执行相同操作的方法.

I work with PostgreSQL 8.4, so some stored procedures are also possible here. However as I plan to open source the project eventually, any more portable ideas or ways to do the same thing on a different RDBMS are most welcome.

后续问题:如何批量执行插入或更新"/更新插入"语句?

Follow up question: How to do a batch "insert-or-update"/"upsert" statement?

测试结果

我已经执行了 100 倍的 10 次插入操作,这些操作分布在 4 个不同的表上(总共 1000 次插入).我在带有 PostgreSQL 8.4 后端的 Django 1.3 上进行了测试.

I've performed 100x times 10 insert operations spread over 4 different tables (so 1000 inserts in total). I tested on Django 1.3 with a PostgreSQL 8.4 backend.

结果如下:

  • 所有操作都通过 Django ORM 完成 - 每次 ~2.45 秒
  • 相同的操作,但没有使用 Django ORM - 每次通过 ~1.48 秒
  • 仅插入操作,无需查询数据库中的序列值~0.72 秒
  • 仅插入操作,以 10 个块为单位执行(总共 100 个块)~0.19 秒,
  • 仅插入操作,一个大执行块~0.13 秒.
  • 仅插入操作,每个块约 250 条语句,~0.12 秒.
  • All operations done through Django ORM - each pass ~2.45 seconds,
  • The same operations, but done without Django ORM - each pass ~1.48 seconds,
  • Only insert operations, without querying the database for sequence values ~0.72 seconds,
  • Only insert operations, executed in blocks of 10 (100 blocks in total) ~0.19 seconds,
  • Only insert operations, one big execution block ~0.13 seconds.
  • Only insert operations, about 250 statements per block, ~0.12 seconds.

结论:在单个 connection.execute() 中执行尽可能多的操作.Django 本身引入了大量开销.

Conclusion: execute as many operations as possible in a single connection.execute(). Django itself introduces a substantial overhead.

免责声明:除默认主键索引外,我没有引入任何索引,因此插入操作可能会运行得更快.

Disclaimer: I didn't introduce any indices apart from default primary key indices, so insert operations could possibly run faster because of that.

推荐答案

我已经使用了 3 种策略来处理批处理事务:

I've used 3 strategies for batch transactional work:

  1. 动态生成 SQL 语句,用分号连接它们,然后一次性提交语句.我以这种方式完成了多达 100 次插入,而且效率很高(针对 Postgres 完成).
  2. JDBC 具有内置的批处理功能(如果已配置).如果生成事务,则可以刷新 JDBC 语句,以便它们一次性完成事务.这种策略需要较少的数据库调用,因为所有语句都是在一批中执行的.
  3. Hibernate 也支持前面示例中的 JDBC 批处理,但在这种情况下,您对 Hibernate Session 执行 flush() 方法,而不是底层 JDBC联系.它完成与 JDBC 批处理相同的事情.
  1. Generate SQL statements on the fly, concatenate them with semicolons, and then submit the statements in one shot. I've done up to 100 inserts in this way, and it was quite efficient (done against Postgres).
  2. JDBC has batching capabilities built in, if configured. If you generate transactions, you can flush your JDBC statements so that they transact in one shot. This tactic requires fewer database calls, as the statements are all executed in one batch.
  3. Hibernate also supports JDBC batching along the lines of the previous example, but in this case you execute a flush() method against the Hibernate Session, not the underlying JDBC connection. It accomplishes the same thing as JDBC batching.

顺便说一句,Hibernate 还支持集合获取中的批处理策略.如果你用@BatchSize 注释一个集合,在获取关联时,Hibernate 将使用IN 而不是=,导致SELECT 语句来加载集合.

Incidentally, Hibernate also supports a batching strategy in collection fetching. If you annotate a collection with @BatchSize, when fetching associations, Hibernate will use IN instead of =, leading to fewer SELECT statements to load up the collections.

这篇关于PostgreSQL 中的批量/批量更新/更新插入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆