如何加快PostgreSQL中的更新/替换操作? [英] How can I speed up update/replace operations in PostgreSQL?

查看:100
本文介绍了如何加快PostgreSQL中的更新/替换操作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个相当特定的应用程序,它使用PostgreSQL 8.3作为存储后端(使用Python和psycopg2)。在大多数情况下,我们对重要表执行的操作都是插入或更新(很少是删除或选择)。

We have a rather specific application that uses PostgreSQL 8.3 as a storage backend (using Python and psycopg2). The operations we perform to the important tables are in the majority of cases inserts or updates (rarely deletes or selects).

出于理智的原因,我们创建了自己的类似于Data Mapper 的层,效果不错,但是它有一个很大的瓶颈,即更新性能。当然,我不希望更新/替换方案能像插入到空表中一样快,但是更近一点也很好。

For sanity reasons we have created our own Data Mapper-like layer that works reasonably well, but it has one big bottleneck, the update performance. Of course, I'm not expecting the update/replace scenario to be as speedy as the 'insert to an empty table' one, but it would be nice to get a bit closer.

请注意,该系统不受并发更新的影响

Note that this system is free from concurrent updates

我们总是将每行的所有字段都设置为更新,这在我的术语中可以看到在我的测试中使用替换一词。到目前为止,我已经尝试了两种解决更新问题的方法:

We always set all the fields of each rows on an update, which can be seen in the terminology where I use the word 'replace' in my tests. I've so far tried two approaches to our update problem:


  1. 创建 replace()过程,该过程需要更新行:

  1. Create a replace() procedure that takes an array of rows to update:

CREATE OR REPLACE FUNCTION replace_item(data item[]) RETURNS VOID AS $$
BEGIN
    FOR i IN COALESCE(array_lower(data,1),0) .. COALESCE(array_upper(data,1),-1) LOOP
       UPDATE item SET a0=data[i].a0,a1=data[i].a1,a2=data[i].a2 WHERE key=data[i].key;
    END LOOP;
END;
$$ LANGUAGE plpgsql


  • 创建 insert_or_replace 规则,以便使除偶尔删除外的所有内容变为多行插入

  • Create an insert_or_replace rule so that everything but the occasional delete becomes multi-row inserts

    CREATE RULE "insert_or_replace" AS
        ON INSERT TO "item"
        WHERE EXISTS(SELECT 1 FROM item WHERE key=NEW.key)
        DO INSTEAD
            (UPDATE item SET a0=NEW.a0,a1=NEW.a1,a2=NEW.a2 WHERE key=NEW.key);
    


  • 这两个都可以加快更新速度,尽管后者会减慢插入速度:

    These both speeds up the updates a fair bit, although the latter slows down inserts a bit:

    Multi-row insert           : 50000 items inserted in  1.32 seconds averaging 37807.84 items/s
    executemany() update       : 50000 items updated  in 26.67 seconds averaging  1874.57 items/s
    update_andres              : 50000 items updated  in  3.84 seconds averaging 13028.51 items/s
    update_merlin83 (i/d/i)    : 50000 items updated  in  1.29 seconds averaging 38780.46 items/s
    update_merlin83 (i/u)      : 50000 items updated  in  1.24 seconds averaging 40313.28 items/s
    replace_item() procedure   : 50000 items replaced in  3.10 seconds averaging 16151.42 items/s
    Multi-row insert_or_replace: 50000 items inserted in  2.73 seconds averaging 18296.30 items/s
    Multi-row insert_or_replace: 50000 items replaced in  2.02 seconds averaging 24729.94 items/s
    

    不随机关于测试运行:


    • 所有测试都与数据库所在的同一台计算机上运行;

    • 插入和更新将按500个项目的批次分批应用到数据库,每个项目以其自己的事务发送( UPDATED )。

    • 所有更新/替换测试都使用与数据库中相同的值。

    • 所有数据都使用psycopg2 Adapt()函数进行了转义。

    • 所有表在使用前都会被截断和清理(添加,在以前的运行中仅发生了截断)

    • 表看起来像这样:

    • All tests are run on the same computer as the database resides; connecting to localhost.
    • Inserts and updates are applied to the database in batches of of 500 items, each sent in its own transaction (UPDATED).
    • All update/replace tests used the same values as were already in the database.
    • All data was escaped using the psycopg2 adapt() function.
    • All tables are truncated and vacuumed before use (ADDED, in previous runs only truncation happened)
    • The table looks like this:

    CREATE TABLE item (
        key MACADDR PRIMARY KEY,
        a0 VARCHAR,
        a1 VARCHAR,
        a2 VARCHAR
    )
    


    因此,真正的问题是:我如何才能进一步加快更新/替换操作? (我认为这些发现可能足够好,但我不想在没有吸引SO人群的情况下放弃:)

    So, the real question is: How can I speed up update/replace operations a bit more? (I think these findings might be 'good enough', but I don't want to give up without tapping the SO crowd :)

    任何人都暗示着要更加优雅

    Also anyones hints towards a more elegant replace_item(), or evidence that my tests are completely broken would be most welcome.

    可以使用测试脚本此处,如果您想尝试复制。

    The test script is available here if you'd like to attempt to reproduce. Remember to check it first though...it WorksForMe, but...

    您需要编辑db.connect()行以适合您的设置。

    You will need to edit the db.connect() line to suit your setup.

    p>

    编辑

    感谢#postgresql @ freenode中的andres我还有一个测试单查询更新;

    Thanks to andres in #postgresql @ freenode I have another test with a single-query update; much like a multi-row insert (listed as update_andres above).

    UPDATE item
    SET a0=i.a0, a1=i.a1, a2=i.a2 
    FROM (VALUES ('00:00:00:00:00:01', 'v0', 'v1', 'v2'), 
                 ('00:00:00:00:00:02', 'v3', 'v4', 'v5'),
                 ...
          ) AS i(key, a0, a1, a2)
    WHERE item.key=i.key::macaddr
    

    编辑

    感谢下面#postgresql @ freenode和jug / jwp中的merlin83,我进行了另一个测试,该测试使用温度插入/删除/插入方法(列出了就像上面的 update_merlin83(i / d / i)。)

    Thanks to merlin83 in #postgresql @ freenode and jug/jwp below I have another test with an insert-to-temp/delete/insert approach (listed as "update_merlin83 (i/d/i)" above).

    INSERT INTO temp_item (key, a0, a1, a2)
        VALUES (
            ('00:00:00:00:00:01', 'v0', 'v1', 'v2'),
            ('00:00:00:00:00:02', 'v3', 'v4', 'v5'),
            ...);
    
    DELETE FROM item
    USING temp_item
    WHERE item.key=temp_item.key;
    
    INSERT INTO item (key, a0, a1, a2)
        SELECT key, a0, a1, a2
        FROM temp_item;
    

    我的直觉是,这些测试并不能很好地代表现实情况下的性能,但我认为这些差异足以说明进一步研究的最有希望的方法。 perftest.py脚本还包含所有想要签出的更新。不过,它相当丑陋,所以请不要忘记您的护目镜:)

    My gut feeling is that these tests are not very representative to the performance in the real-world scenario, but I think the differences are great enough to give an indication of the most promising approaches for further investigation. The perftest.py script contains all updates as well for those of you who want to check it out. It's fairly ugly though, so don't forget your goggles :)

    编辑

    #postgresql @ freenode中的安德烈斯指出,我应该测试一个插入到临时/更新变量(上面列出为 update_merlin83(i / u))。

    andres in #postgresql @ freenode pointed out that I should test with an insert-to-temp/update variant (listed as "update_merlin83 (i/u)" above).

    INSERT INTO temp_item (key, a0, a1, a2)
        VALUES (
            ('00:00:00:00:00:01', 'v0', 'v1', 'v2'),
            ('00:00:00:00:00:02', 'v3', 'v4', 'v5'),
            ...);
    
    UPDATE item
    SET a0=temp_item.a0, a1=temp_item.a1, a2=temp_item.a2
    FROM temp_item
    WHERE item.key=temp_item.key
    

    编辑

    可能是最终编辑:
    我更改了脚本以更好地适应我们的负载情况,而且即使将内容稍微扩大一点并增加一些随机性,数字似乎仍然成立。如果有人从其他情况获得了截然不同的数字,我将对此有兴趣。

    Probably final edit: I changed my script to match our load scenario better, and it seems the numbers hold even when scaling things up a bit and adding some randomness. If anyone gets very different numbers from some other scenario I'd be interested in knowing about it.

    推荐答案

    我通常的做法pg中的这些内容是:使用复制,合并(有趣的部分),利润将匹配目标表的原始数据加载到临时表(无约束)中。

    The usual way I do these things in pg is: load raw data matching target table into temp table (no constraints) using copy, merge(the fun part), profit.

    我写了merge_by_key专门针对以下情况的功能:

    I wrote a merge_by_key function specifically for these situations:

    http:// mbk.projects.postgresql.org/

    文档并不十分友好,但我建议给它一个 >看。

    The docs aren't terribly friendly, but I'd suggest giving it a good look.

    这篇关于如何加快PostgreSQL中的更新/替换操作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆