PostgreSQL:在大型数据库上定义主键 [英] PostgreSQL: Defining a primary key on a large database

查看:147
本文介绍了PostgreSQL:在大型数据库上定义主键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在规划一个数据库来存储大量的文本。 (博客帖子,新闻文章等)数据库需要有标题,内容(最多50k个字符),日期,链接和语言字段。相同的内容不能发生在一个链接上。旧内容(例如旧的30天)将被删除。

I am planing a database to store lots of text. (blog posts, news articles, etc.) The database needs to have the title, content (50k characters max), date, link and language fields. The same content can't occur on one link. Old content (older then 30 days, for example) will be deleted.

现在,问题是主键。我可以设置一个自动递增(SERIAL类型)字段并将其用作主键。但是,它似乎愚蠢和浪费光盘空间,因为字段不会用于任何目的,但是作为主键。 (并且字段可能最终用完,或者不是?)并且总是存在其他性能问题:插入的每个新行的内容需要检查重复。所以我提出的主键的另一个解决方案是计算sha256哈希的内容+链接值,然后把它放在一个新的哈希列,并使用它作为主键。两只鸟与一块石头。当然,问题是哈希冲突。这是一个很大的威胁吗?

Now, the problem is the primary key. I could just set a automatically incrementing (SERIAL type) field and use it as a primary key. But, it seems stupid and a waste of disc space, because the field wouldn't serve any purpose but to be a primary key. (and the field could eventually run out, or not?) And there's always the other performance issue: the content of each new row inserted needs to be checked for duplicates. So the other solution for the primary key I've come up with would be to calculate a sha256 hash of content+link value and then put that in a new 'hash' column and use that as a primary key. Two birds with one stone. Of course, the problem with that are hash collisions. Is it a big threat?

我没有任何PostgreSQL的经验,很少有DBMS的经验,所以我想欣赏第二个意见之前创建一个数据库与高速公路上的蜗牛的性能特点(可怕的比较)。

I don't have any experience with PostgreSQL, and very little experience with DBMS's generally, so I would appreciate a second opinion before a create a database with the performance characteristics of a snail on the highway (horrible comparison).

如果你有大数据库的任何经验,请帮助我在这里。设置64个字符的字符串作为主键字段在我的情况下是个好主意? (因为我的印象是,通常这是避免的)

Please help me out here if you have any experience with large databases. Is setting a 64 character string as a primary key field a good idea in my situation? (because I'm under the impression that generally this is avoided)

推荐答案

刚刚做了这个确切的测试, DB(200GB +),bigserial赢得了相当大的幅度。它生成速度更快,加入更快,代码更少,占用空间更小。由于postgres存储它的方式,bigint与正常int相比是可以忽略的。在你不必担心溢出bigint之前,你会从内容中耗尽存储空间。完成了计算哈希vs bigint - 一直代理bigint。

Just did this exact test for a rather medium-large DB (200GB+), bigserial won by quite a large margin. It was faster to generate, faster to join, less code, smaller footprint. Because of the way postgres stores it, a bigint is negligible compared to a normal int. You'll run out of storage space from your content long before you ever have to worry about overflowing the bigint. Having done the computed hash vs bigint - surrogate bigint all the way.

这篇关于PostgreSQL:在大型数据库上定义主键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆