向具有4000万条记录的表中添加多列主键 [英] Adding a multi-column primary key to a table with 40 million records

查看:105
本文介绍了向具有4000万条记录的表中添加多列主键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在维护一个数据库,该数据库存储不同网络之间的数据传输信息。本质上,每次数据传输都会记录下来,在每个月的月底,我运行一个perl脚本,该脚本将日志文件加载到数据库中的表中。我没有设计perl脚本或数据库模式。

I'm working on maintaining a database which stores data transfer information between different networks. Essentially, each data transfer is logged and at the end of each month I run a perl script that loads the log files into a table in the database. I did not design the perl script or the database schema. It was done before I started working on this project.

我使用了链接,用于检索表的主键(usage_detail是表的名称),它什么也没给我。由于表中有太多记录,因此跟踪重复项不是很容易。我们遇到了很多重复项被加载的问题(由于bug脚本会记录每次传输的日志,而另一个主题是记录),最终不得不删除最新的加载并在修复日志文件后重新加载所有新的。您可能已经猜到这是愚蠢而乏味的。

I used this link to retrieve the primary keys of the table (usage_detail is the name of the table) and it gave me nothing. Since, there are so many records in the table, its not very easy to keep track of duplicates. We've had problems where a lot of duplicates were loaded (because of bugs script that does the logging for each transfer but thats for another topic) and ended up having to drop the latest load and reload all the new ones after fixing the log files. As you may have guessed this is stupid and tedious.

要解决此问题,我想在表中添加一个主键。由于多种原因,我们不想为主键添加整个新列。看完这些字段后,我想出了一个多列主键。它基本上包括:传输开始时间戳,传输结束时间戳,已传输文件的名称(还包括整个路径)。似乎极不可能有两个记录具有相同的字段。

To fix this, I would like to add a primary key to the table. Due to several reasons, we don't want to add an entire new column for the primary keys. After looking at the fields, I've figured out a multi-column primary key. Basically it consists of: transfer start timestamp, transfer end timestamp, name of file transferred (which also includes the entire path). It seems highly unlikely that there would be two records which have those fields the same.

这是我的问题:
1)如果在此添加主键在表中,表中可能已经存在的所有重复项将如何处理?

Here are my questions: 1) If I add this primary key in the table, what would happen to any duplicates that might already be present in the table?

2)我实际上如何将这个主键添加到表中(我们使用的是PostgreSQL 8.1.22)。

2) How would I actually add this primary key to the table (we are using PostgreSQL 8.1.22).

3)添加主键后,假设在加载脚本运行时,它尝试加载副本。 PostgreSQL会抛出什么样的错误?我可以在脚本中捕获它吗?

3) After the primary key is added, lets say while the load script is running it tries to load a duplicate. What sort of error would PostgreSQL throw? Would I be able to catch it in the script?

4)我知道您没有太多有关加载脚本的信息,但是鉴于我提供的信息,您可以预见到可能需要更改的内容脚本?

4) I know you don't have much information about the load script, but given the information that I have provided do you foresee something that might need to changed in the script?

任何帮助将不胜感激。
谢谢。

Any help is greatly appreciated. Thanks.

推荐答案

使用序列列



您的计划是为4000万(!)行添加不必要的巨大索引。而且您甚至不确定它是否会独一无二。我强烈建议不要采取这种行动。添加 serial 列代替并完成操作:

Use a serial column

Your plan is to add a needlessly huge index for 40 million (!) rows. And you aren't even sure it's going to be unique. I would strongly advice against that route of action. Add a serial column instead and be done with it:

ALTER TABLE tbl ADD COLUMN tbl_id serial PRIMARY KEY;

这就是您需要做的。其余的自动发生。更多的手册或与这些密切相关的答案:

PostgreSQL主键自动增量在C ++中崩溃

自动递增SQL函数

That's all you need to do. The rest happens automatically. More in the manual or in these closely related answers:
PostgreSQL primary key auto increment crashes in C++
Auto increment SQL function

添加 serial 列是一次性操作,但价格昂贵。整个表必须重写,在操作期间阻止更新。最好在下班时间没有并发负载的情况下完成。我在此处引用手册

Adding a serial column is one-time operation, but expensive. The whole table has to be rewritten, blocking updates for the duration of the operation. Best done without concurrent load at off hours. I quote the manual here:


添加具有非空默认值的列或更改
现有列的类型将需要整个表和索引为
被重写。大型表的表和/或索引重建可能花费大量的时间。并暂时需要
来作为磁盘空间的两倍。

Adding a column with a non-null default or changing the type of an existing column will require the entire table and indexes to be rewritten. [...] Table and/or index rebuilds may take a significant amount of time for a large table; and will temporarily require as much as double the disk space.

由于这有效地重写了整个表,因此您可能会很好地创建一个带有序列pk列的新表,插入旧表的所有行,让序列填充序列中的默认值,删除旧表并重命名新表。这些紧密相关的答案中的更多内容:

更新数据库行而不在PostgreSQL 9.2中锁定表

添加没有表锁的新列?

Since this effectively rewrites the whole table, you might as well create a new table with a serial pk column, insert all rows from the old table, letting the serial fill with default values from its sequence, drop the old and rename the new. More in these closely related answers:
Updating database rows without locking the table in PostgreSQL 9.2
Add new column without table lock?

确保所有INSERT语句都有目标列表,那么另外一列就不会造成混淆:

Make sure all your INSERT statements have a target list, then an additional column can't confuse them:

INSERT INTO tbl (col1, col2, ...) VALUES ...

否:

INSERT INTO tbl VALUES ...

一个序列是用整数列(4个字节)实现。

主键约束通过唯一索引和 NOT NULL实现。 对相关列的约束。

索引的内容存储得很多像桌子。单独需要其他物理存储。有关此查询中有关物理存储的更多信息:

在PostgreSQL中计算和节省空间

A serial is implemented with an integer column (4 bytes).
A primary key constraint is implemented with a unique index and a NOT NULL constraint on the involved columns.
The contents of an index are stored much like tables. Additional physical storage is needed separately. More about physical storage in this related answer:
Calculating and saving space in PostgreSQL

您的索引将包括2个时间戳(2 x 8字节)加上一个冗长的文件名,包括。路径(〜50个字节?),这会使索引大约增加2.5 GB(40M x 60 ..字节),并且所有操作都变慢。

Your index would include 2 timestamps (2 x 8 bytes) plus a lengthy filename incl. path (~ 50 bytes?) That would make the index around 2.5 GB bigger (40M x 60 .. something bytes) and all operations slower.

如何处理导入重复项取决于您如何导入数据以及如何精确定义重复项。

How to deal with "importing duplicates" depends on how you are importing data and how "duplicate" is defined exactly.

如果我们正在谈论 COPY 语句,一种方法是使用临时登台表并使用简单的 SELECT DISTINCT DISTINCT折叠重复项在 INSERT 命令中打开

If we are talking about COPY statements, one way would be to use a temporary staging table and collapse duplicates with a simple SELECT DISTINCT or DISTINCT ON in the INSERT command:

CREATE TEMP TABLE tbl_tmp AS
SELECT * FROM tbl LIMIT 0;     -- copy structure without data and constraints

COPY tbl_tmp FROM '/path/to/file.csv';

INSERT INTO tbl (col1, col2, col3)
SELECT DISTINCT ON (col1, col2)
       col1, col2, col3 FROM tbl_tmp;

或者,也禁止具有现有行的重复项:

Or, to also prohibit duplicates with already existing rows:

INSERT INTO tbl (col1, col2, col3)
SELECT i.*
FROM  (
   SELECT DISTINCT ON (col1, col2)
          col1, col2, col3
   FROM   tbl_tmp
   ) i
LEFT   JOIN tbl t USING (col1, col2)
WHERE  t.col1 IS NULL;

温度表会在会话结束时自动删除。

The temp. table is dropped at the end of the session automatically.

但是正确的解决方法是首先处理产生重复项的错误根源。

But the proper fix would be to deal with the root of the error that produces duplicates in the first place.

1)如果所有列上都有一个重复项,则根本无法添加pk

1) You could not add the pk at all, if there is a single duplicate over all columns.

2)我只会碰到五英尺长的PostgreSQL数据库 8.1版。它绝望地过时,过时且效率低下,不再受支持,并且可能存在许多未解决的安全漏洞。 Postgres官方版本网站。

@David 已经提供了SQL语句。

2) I would only touch a PostgreSQL database version 8.1 with a five-foot pole. It's hopelessly ancient, outdated and inefficient, not supported any more and probably has a number of unfixed security holes. Official Postgres versioning site.
@David already supplied the SQL statement.

3& 4)重复的密钥冲突。 PostgreSQL抛出错误还意味着整个事务都将回滚。在perl脚本中捕获该错误将无法完成其余的事务。例如,您必须使用plpgsql创建一个服务器端脚本,您可以在其中捕获异常。

3 & 4) A duplicate key violation. PostgreSQL throwing an error also means the whole transaction is rolled back. Catching that in a perl script cannot make the rest of the transaction go through. You would have to create a server-side script with plpgsql for instance, where you can catch exceptions.

这篇关于向具有4000万条记录的表中添加多列主键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆