Postgresql表,其中包含一个ID列,已排序的索引,具有重复的主键 [英] Postgresql table with one ID column, sorted index, with duplicate primary key

查看:157
本文介绍了Postgresql表,其中包含一个ID列,已排序的索引,具有重复的主键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用PostgreSQL表作为文档的一种工作队列。每个文档都有一个ID,并存储在另一个包含大量附加列的普通表中。但是这个问题是关于为工作队列创建表。

I want to use a PostgreSQL table as a kind of work queue for documents. Each document has an ID and is stored in another, normal table with lots of additional columns. But this question is about creating the table for the work queue.

我想为这个队列创建一个没有OID的表,只有一列:文档的ID为整数。如果此工作队列表中存在文档的ID,则表示具有该ID的文档是脏的,并且必须进行一些处理。
如果主文档表中的每个文档条目只有一个脏位,额外的表应避免出现VACUUM和死元组问题以及出现的事务死锁。

I want to create a table for this queue without OIDs with just one column: The ID of the document as integer. If an ID of a document exists in this work queue table, it means that the document with that ID is dirty and some processing has to be done. The extra table shall avoid the VACUUM and dead tuple problems and deadlocks with transactions that would emerge if there was just a dirty bit on each document entry in the main document table.

我的系统的许多部分都会将文档标记为脏,因此插入ID以处理该表。这些插入将在一个事务中用于许多ID。我不想使用任何类型的嵌套事务,似乎没有任何类型的INSERT IF NOT EXISTS命令。我宁愿在表格中有重复的ID。因此,该表中唯一的列必须可以重复。

Many parts of my system would mark documents as dirty and therefore insert IDs to process into that table. These inserts would be for many IDs in one transaction. I don't want to use any kind of nested transactions and there doesn't seem to be any kind of INSERT IF NOT EXISTS command. I'd rather have duplicate IDs in the table. Therefore duplicates must be possible for the only column in that table.

处理工作队列的进程将删除所有进程ID,从而处理重复项。 (顺便说一下:下一步还有另一个队列,所以关于竞争条件,这个想法应该是干净的,没有问题)

The process which processes the work queue will delete all processes IDs and therefore take care of duplicates. (BTW: There is another queue for the next step, so regarding race conditions the idea should be clean and have no problem)

但我也想要处理文件按顺序:始终首先处理ID较小的文档。

But also I want the documents to be processed in order: Always shall documents with smaller IDs be processed first.

因此我希望在ID列上有一个帮助LIMIT和ORDER BY的索引,这是唯一的列。工作队列表。
理想情况下,我只有一列,这应该是主键。但主键不能有重复,所以我似乎不能这样做。

Therefore I want to have an index which aids LIMIT and ORDER BY on the ID column, the only column in the workqueue table. Ideally given that I have only one column, this should be the primary key. But the primary key must not have duplicates, so it seems I can't do that.

没有索引,ORDER BY和LIMIT会很慢。

Without the index, ORDER BY and LIMIT would be slow.

我可以在该列上添加正常的二级索引。但我担心PostgreSQL会在光盘上添加第二个文件(PostgreSQL会为每个附加索引执行此操作)并对该表使用双倍数量的光盘操作。

I could add a normal, secondary index on that column. But I fear PostgreSQL would add a second file on disc (PostgreSQL does that for every additional index) and use the double amount of disc operations for that table.

什么是最好的事情?
添加一个带有随机内容的虚拟列(如OID),以使主键不会抱怨重复项?我必须在队列表中浪费那个空间吗?

What is the best thing to do? Add a dummy column with something random (like the OID) in order to make the primary key not complain about duplicates? Must I waste that space in my queue table?

或者添加第二个索引是无害的,它会成为直接在主元组btree中的主索引吗?

Or is adding the second index harmless, would it become kind of the primary index which is directly in the primary tuple btree?

我要删除上面的所有内容并留下以下内容吗?最初的问题是分散注意力并且包含太多不相关的信息。

Shall I delete everything above this and just leave the following? The original question is distracting and contains too much unrelated information.

我想在PostgreSQL中有一个包含这些属性的表:

I want to have a table in PostgreSQL with these properties:


  • 一个带整数的列

  • 允许重复

  • 列上的高效ORDER BY + LIMIT

  • INSERT不应该在该表或任何类型的唯一索引中进行任何查询。 INSERT只需找到该表的主文件/主btree的最佳页面,并将其间插入其他行,按ID排序。

  • INSERT将批量发生,必须没有失败,期待光盘已满等等。

  • 此表格不会有额外的btree文件,因此没有二级索引

  • 行应该占用的空间不大,例如没有OID

  • One column with an integer
  • Allow duplicates
  • Efficient ORDER BY+LIMIT on the column
  • INSERTs should not do any query in that table or any kind of unique index. INSERTs shall just locate the best page for the main file/main btree for this table and just insert the row in between to other rows, ordered by ID.
  • INSERTs will happen in bulk and must not fail, expect for disc full, etc.
  • There shall not be additional btree files for this table, so no secondary indexes
  • The rows should occupy not much space, e.g. have no OIDs

我想不出解决所有问题的解决方案。

I cannot think of a solution that solves all of this.

我的唯一解决方案会在最后一个要点上妥协:添加一个覆盖整数的PRIMARY KEY以及一个虚拟列,如OID,时间戳或SERIAL。

My only solution would compromise on the last bullet point: Add a PRIMARY KEY covering the integer and also a dummy column, like OIDs, a timestamp or a SERIAL.

另一种解决方案是使用假设的INSERT IF NOT EXISTS,或嵌套事务或带有WHERE的特殊INSERT。所有这些解决方案都会在插入时添加btree查询。
他们也可能导致死锁。

Another solution would either use a hypothetical INSERT IF NOT EXISTS, or nested transaction or a special INSERT with a WHERE. All these solutions would add a query of the btree when inserting. Also they might cause deadlocks.

(也发布在这里: https://dba.stackexchange.com/q/45126/7788

推荐答案

你说


我的系统的许多部分都会将文档标记为脏,因此
会插入ID以处理该表。因此,重复项必须是

Many parts of my system would mark documents as dirty and therefore insert IDs to process into that table. Therefore duplicates must be possible.


具有相同ID的5行意味着与1或10行相同的事物,其中
相同的ID:它们意味着具有该ID的文档是脏的。

5 rows with the same ID mean the same thing as 1 or 10 rows with that same ID: They mean that the document with that ID is dirty.

您不需要重复项。如果此表的唯一目的是识别脏文档,则包含文档ID号的单行就足够了。没有令人信服的理由允许重复。

You don't need duplicates for that. If the only purpose of this table is to identify dirty documents, a single row containing the document's id number is sufficient. There's no compelling reason to allow duplicates.

如果您需要跟踪插入该行的进程,则每个ID号的单行就足够了,或者在插入行时对行进行排序,但是单个列对于 来说是不够的。所以我确信主键约束或唯一约束对你来说很好。

A single row for each ID number is not sufficient if you need to track which process inserted that row, or order rows by the time they were inserted, but a single column isn't sufficient for that in the first place. So I'm sure a primary key constraint or unique constraint would work fine for you.

其他进程必须忽略重复的键错误,但这很简单。这些进程无论如何都必须捕获错误 - 除了重复键之外还有很多东西可以防止insert语句成功。

Other processes have to ignore duplicate key errors, but that's simple. Those processes have to trap errors anyway--there are a lot of things besides a duplicate key that can prevent an insert statement from succeeding.

允许重复的实现。 。 。

An implementation that allows duplicates . . .

create table dirty_documents (
  document_id integer not null
);

create index on dirty_documents (document_id);

在该表中插入100k ID号进行测试。这必然需要更新索引。 (Duh。)包含一堆副本。

Insert 100k ID numbers into that table for testing. This will necessarily require updating the index. (Duh.) Include a bunch of duplicates.

insert into dirty_documents 
select generate_series(1,100000);

insert into dirty_documents
select generate_series(1, 100);

insert into dirty_documents
select generate_series(1, 50);

insert into dirty_documents
select generate_series(88000, 93245);

insert into dirty_documents
select generate_series(83000, 87245);

在我的桌面上花了不到一秒钟,这不是什么特别的,而且运行三个不同的数据库服务器,两个Web服务器,以及播放Rammstein CD。

Took less than a second on my desktop, which isn't anything special, and which is running three different database servers, two web servers, and playing a Rammstein CD.

选择第一个脏文档ID号进行清理。

Pick the first dirty document ID number for cleaning up.

select min(document_id) 
from dirty_documents; 

document_id
--
1

Took只有0.136毫秒。现在让我们删除每个文档ID为1的行。

Took only 0.136 ms. Now lets delete every row that has document ID 1.

delete from dirty_documents
where document_id = 1; 

花了0.272毫秒。

让我们开始吧结束。

drop table dirty_documents;
create table dirty_documents (
  document_id integer primary key
);

insert into dirty_documents 
select generate_series(1,100000); 

花了500毫秒。让我们再次找到第一个。

Took 500 ms. Let's find the first one again.

select min(document_id) 
from dirty_documents; 

支付.054毫秒。这大约是使用允许重复的表的一半时间。

Took .054 ms. That's about half the time it took using a table that allowed duplicates.

delete from dirty_documents
where document_id = 1;

也需要.054毫秒。这比其他表大约快50倍。

Also took .054 ms. That's roughly 50 times faster than the other table.

让我们重新开始,尝试一个未编制索引的表。

Let's start over again, and try an unindexed table.

drop table dirty_documents;
create table dirty_documents (
  document_id integer not null
);

insert into dirty_documents 
select generate_series(1,100000);

insert into dirty_documents
select generate_series(1, 100);

insert into dirty_documents
select generate_series(1, 50);

insert into dirty_documents
select generate_series(88000, 93245);

insert into dirty_documents
select generate_series(83000, 87245);

获取第一份文件。

select min(document_id) 
from dirty_documents; 

花了32.5毫秒。删除这些文件。 。 。

Took 32.5 ms. Delete those documents . . .

delete from dirty_documents
where document_id = 1;

花了12毫秒。

全部这花了我12分钟。 (我使用了秒表。)如果你想知道将会有什么性能,可以构建表并编写测试。

All of this took me 12 minutes. (I used a stopwatch.) If you want to know what performance will be, build tables and write tests.

这篇关于Postgresql表,其中包含一个ID列,已排序的索引,具有重复的主键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆