Postgresql 批量插入或忽略 [英] Postgresql batch insert or ignore
问题描述
我有责任将我们的代码从 sqlite 切换到 postgres.下面复制了我遇到的问题之一.
I have the responsibility of switching our code from sqlite to postgres. One of the queries I am having trouble with is copied below.
INSERT INTO group_phones(group_id, phone_name)
SELECT g.id, p.name
FROM phones AS p, groups as g
WHERE g.id IN ($add_groups) AND p.name IN ($phones);
出现重复记录时会出现问题.在此表中,两个值的组合必须是唯一的.我在其他地方使用了一些 plpgsql 函数来执行更新或插入操作,但在这种情况下,我可以一次执行多个插入操作.我不确定如何为此编写存储例程.感谢所有 sql 大师的帮助!
The problem arises when there is a duplicate record. In this table the combination of both values must be unique. I have used a few plpgsql functions in other places to do update-or-insert operations, but in this case I can do several inserts at once. I am not sure how to write a stored routine for this. Thanks for all the help from all the sql gurus out there!
推荐答案
有 3 个挑战.
您的查询在
phones
和groups
表之间没有JOIN
条件,从而有效地做到了这一点一个有限的CROSS JOIN
- 你很可能不打算这样做.IE.每部符合条件的电话都与符合条件的每个组相结合.如果您有 100 部手机和 100 个群组,那么这已经是 10,000 种组合.
Your query has no
JOIN
condition between the tablesphones
andgroups
, making this effectively a limitedCROSS JOIN
- which you most probably do not intend. I.e. every phone that qualifies is combined with every group that qualifies. If you have 100 phones and 100 groups that's already 10,000 combinations.
插入(group_id, phone_name)的distinct组合
Insert distinct combinations of (group_id, phone_name)
避免在 group_phones
表中插入已经存在的行.
Avoid inserting rows that are already there in table group_phones
.
所有事情都认为它可能看起来像这样:
All things considered it could look like this:
INSERT INTO group_phones(group_id, phone_name)
SELECT i.id, i.name
FROM (
SELECT DISTINCT g.id, p.name -- get distinct combinations
FROM phones p
JOIN groups g ON ??how are p & g connected??
WHERE g.id IN ($add_groups)
AND p.name IN ($phones)
) i
LEFT JOIN group_phones gp ON (gp.group_id, gp.phone_name) = (i.id, i.name)
WHERE gp.group_id IS NULL -- avoid duping existing rows
并发
这种形式最大限度地减少了并发写入操作出现竞争条件的机会.如果您的表有大量并发写入负载,您可能需要以独占方式锁定表 或使用 可序列化事务隔离,这可以防止极不可能发生的情况,即在约束验证之间的微小时间段内并发事务更改行(行不是't 那里)和查询中的写操作.
Concurrency
This form minimizes the chance of a race condition with concurrent write operations. If your table has heavy concurrent write load, you may want to lock the table exclusively or use serializable transaction isolation, This safeguard against the extremely unlikely case that a row is altered by a concurrent transaction in the tiny time slot between the constraint verification (row isn't there) and the write operation in the query.
BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT ...
COMMIT;
如果由于序列化错误而回滚,请准备好重复该事务.有关该主题的更多信息,好的起点可能是 @depesz 的博客文章 或此关于 SO 的相关问题.
Be prepared to repeat the transaction if it rolls back with a serialization error. For more on that topic good starting points could be this blog post by @depesz or this related question on SO.
不过,通常情况下,您甚至不必为这些烦恼.
Normally, though, you needn't even bother with any of this.
LEFT JOIN tbl ON right_col = left_col WHERE right_col IS NULL
通常是在右表中具有不同列的最快方法.如果你在专栏中有骗子(特别是如果有很多),
is generally the fastest method with distinct columns in the right table. If you have dupes in the column (especially if there are many),
WHERE NOT EXISTS (SELECT 1 FROM tbl WHERE right_col = left_col)
可能会更快,因为它可以在找到第一行后立即停止扫描.
May be faster because it can stop to scan as soon as the first row is found.
您也可以使用 IN
,就像 @dezso 演示的那样,但在 PostgreSQL 中它通常较慢.
You can also use IN
, like @dezso demonstrates, but it is usually slower in PostgreSQL.
这篇关于Postgresql 批量插入或忽略的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!