在PostgreSQL中批量/批量更新/ upsert [英] Bulk/batch update/upsert in PostgreSQL

查看:1566
本文介绍了在PostgreSQL中批量/批量更新/ upsert的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个Django-ORM加强,它试图缓存模型并推迟模型保存,直到事务结束。这几乎完成了,但是我遇到了一个意想不到的SQL语法的困难。



我不是一个DBA,但从我的理解,数据库不真正有效地为许多小查询工作。几乎没有更大的查询更好。例如,最好使用大批量插入(一次说100行),而不是100个单行。



现在,从我可以看到,SQL不真正提供任何语句来对表执行批量更新。这个词似乎是混乱所以,我会解释一下我的意思。我有一个任意数据的数组,每个条目描述表中的一行。我想更新表中的某些行,每个行使用数组中相应条目中的数据。这个想法非常类似于批处理插入。



例如:我的表可以有两列idsome_col 。现在描述批处理更新数据的数组包含三个条目(1,'first updated')(2,'second updated') (3,第三更新)。在更新之前,表包含行:(1,'first')(2,'second') (3,'third')



我来到这个帖子:



为什么批处理插入/更新更快?批量更新如何工作?



这似乎做了我想要的,但我不能真正弄清楚语法结束。



我也可以删除所有需要更新的行,并使用批处理插入重新插入它们,但我很难相信这将会更好。



我使用PostgreSQL 8.4,所以一些存储过程也可能在这里。但是,由于我计划最终开源项目,任何更多的便携式想法或方法在不同的RDBMS上做同样的事情是最受欢迎的。



后续 如何执行批量插入或更新/upsert语句?


$ b

我已经对4个不同的表执行了100x次10次插入操作(因此总共有1000个插入)。我在Django 1.3上用PostgreSQL 8.4后端测试。



这些是结果:




  • 所有操作通过Django ORM完成 - 每次传递〜2.45秒

  • 执行相同的操作,但没有使用Django ORM - 1.48秒,

  • 仅插入操作,而不查询数据库的序列值〜0.72秒

  • 仅插入操作,以10(每个块总共100个块)的块执行〜0.19秒

  • 只插入操作,一个大执行块 0.13秒


>

结论:在单个connection.execute()中执行尽可能多的操作。 Django本身引入了大量的开销。



免责声明:除了默认主键索引之外,我没有引入任何索引,因此插入操作可能运行得更快。 / p>

解决方案

我对批处理事务工作使用了3种策略:



<
  • 立即生成SQL语句,使用分号串联它们,然后一次提交语句。我以这种方式完成了多达100个插入,并且非常高效(针对Postgres)。

  • JDBC具有内置的批处理功能(如果已配置)。如果生成事务,您可以刷新JDBC语句,以便它们一次性进行事务。这种策略需要较少的数据库调用,因为语句都在一个批处理中执行。

  • Hibernate还支持JDBC批处理,前面的示例,但在这种情况下, c $ c> flush()方法针对Hibernate Session ,而不是底层的JDBC连接。

  • 顺便提及,Hibernate还支持在集合提取中使用批处理策略。如果你注释一个集合 @BatchSize ,当提取关联,Hibernate将使用 IN ,而不是 = ,导致减少 SELECT 语句来加载集合。


    I'm writing a Django-ORM enchancement that attempts to cache models and postpone model saving until the end of the transaction. It's all almost done, however I came across an unexpected difficulty in SQL syntax.

    I'm not much of a DBA, but from what I understand, databases don't really work efficiently for many small queries. Few bigger queries are much better. For example it's better to use large batch inserts (say 100 rows at once) instead of 100 one-liners.

    Now, from what I can see, SQL doesn't really supply any statement to perform a batch update on a table. The term seems to be confusing so, I'll explain what I mean by that. I have an array of arbitrary data, each entry describing a single row in a table. I'd like to update certain rows in the table, each using data from its corresponding entry in the array. The idea is very similar to a batch insert.

    For example: My table could have two columns "id" and "some_col". Now the array describing the data for a batch update consists of three entries (1, 'first updated'), (2, 'second updated'), and (3, 'third updated'). Before the update the table contains rows: (1, 'first'), (2, 'second'), (3, 'third').

    I came accross this post:

    Why are batch inserts/updates faster? How do batch updates work?

    which seems to do what I want, however I can't really figure out the syntax at the end.

    I could also delete all the rows that require updating and reinsert them using a batch insert, however I find it hard to believe that this would actually perform any better.

    I work with PostgreSQL 8.4, so some stored procedures are also possible here. However as I plan to open source the project eventually, any more portable ideas or ways to do the same thing on a different RDBMS are most welcome.

    Follow up question: How to do a batch "insert-or-update"/"upsert" statement?

    Test results

    I've performed 100x times 10 insert operations spread over 4 different tables (so 1000 inserts in total). I tested on Django 1.3 with a PostgreSQL 8.4 backend.

    These are the results:

    • All operations done through Django ORM - each pass ~2.45 seconds,
    • The same operations, but done without Django ORM - each pass ~1.48 seconds,
    • Only insert operations, without querying the database for sequence values ~0.72 seconds,
    • Only insert operations, executed in blocks of 10 (100 blocks in total) ~0.19 seconds,
    • Only insert operations, one big execution block ~0.13 seconds.
    • Only insert operations, about 250 statements per block, ~0.12 seconds.

    Conclusion: execute as many operations as possible in a single connection.execute(). Django itself introduces a substantial overhead.

    Disclaimer: I didn't introduce any indices apart from default primary key indices, so insert operations could possibly run faster because of that.

    解决方案

    I've used 3 strategies for batch transactional work:

    1. Generate SQL statements on the fly, concatenate them with semicolons, and then submit the statements in one shot. I've done up to 100 inserts in this way, and it was quite efficient (done against Postgres).
    2. JDBC has batching capabilities built in, if configured. If you generate transactions, you can flush your JDBC statements so that they transact in one shot. This tactic requires fewer database calls, as the statements are all executed in one batch.
    3. Hibernate also supports JDBC batching along the lines of the previous example, but in this case you execute a flush() method against the Hibernate Session, not the underlying JDBC connection. It accomplishes the same thing as JDBC batching.

    Incidentally, Hibernate also supports a batching strategy in collection fetching. If you annotate a collection with @BatchSize, when fetching associations, Hibernate will use IN instead of =, leading to fewer SELECT statements to load up the collections.

    这篇关于在PostgreSQL中批量/批量更新/ upsert的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆