PostgreSQL 使用 UUID 与文本作为主键 [英] PostgreSQL using UUID vs Text as primary key

查看:231
本文介绍了PostgreSQL 使用 UUID 与文本作为主键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们当前的 PostgreSQL 数据库使用 GUID 作为主键并将它们存储为文本字段.

Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.

我对此的最初反应是,尝试执行任何类型的最小笛卡尔连接对于试图找到所有匹配记录的索引来说都是一场噩梦.然而,也许我对数据库索引的有限理解在这里是错误的.

My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.

我认为我们应该使用 UUID,因为它们存储为 GUID 的二进制表示,其中 Text 不是,并且您在 Text 列上获得的索引量最少.

I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.

改变这些将是一个重要的项目,我想知道这是否值得?

It would be a significant project to change these, and I'm wondering if it would be worth it?

推荐答案

正如@Kevin 提到的,确定确切数据的唯一方法是比较和对比两种方法,但是根据您所描述的,我不明白为什么这与字符串是表中的主键或唯一索引的一部分的任何其他情况不同.

As @Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.

预先可以说的是,您的索引可能会更大,因为它们必须存储更大的字符串值,理论上索引的比较将花费更长的时间,但如果要进行过早优化,我不提倡这样做会很痛苦.

What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.

根据我的经验,我在具有数十亿行的表上使用 md5sums 在唯一索引上看到了非常好的性能.我发现它往往是有关查询的其他因素,这些因素往往会导致性能问题.例如,当你最终需要查询一个非常大的表,比如数十万行时,顺序扫描最终是更好的选择,这就是查询规划器的选择,它可能需要更长的时间.

In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.

对于这种情况还有其他缓解策略,例如对查询进行分块,然后对结果进行UNION处理(例如,手动模拟将在 中完成的那种事情)Hadoop 领域中的 HiveImpala).

There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).

Re:您对文本索引的担忧,虽然我确信在某些情况下,数据集会产生一个关键分布,以至于它的表现非常糟糕,但 GUID,很像 md5sums、sha1 等,通常应该索引得很好并且不需要顺序扫描(除非,正如我上面提到的,您查询了大量的表格).

Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).

关于索引如何执行的一个重要因素是有多少唯一值.出于这个原因,具有大量行的表上的布尔索引不太可能有帮助,因为它基本上最终会导致任何值(真、假和潜在的)的大量行冲突NULL) 在索引中.另一方面,GUID 索引可能有大量没有冲突的值(理论上定义,因为它们是 GUID).

One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).

根据 OP 的评论进行

那么您是说就索引而言,UUID guid 与 Text guid 是一回事吗?我们的整个表结构使用带有类似 guid 字符串的文本字段,但我不确定 Postgre 是否将其识别为 Guid.只是一个碰巧唯一的字符串.

So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.

不完全一样,不.但是,我是说它们在这种特殊情况下应该具有非常相似的性能,而且我不明白为什么预先优化值得做,尤其是考虑到您说这样做会是一项非常复杂的任务.

Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.

如果在您的特定环境中遇到性能问题,您可以随时进行更改.但是,正如我之前提到的,我认为如果您遇到这种情况,还有其他事情可能会比更改 PK 数据类型产生更好的性能.

You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.

A UUID 是 128 位数据类型(所以,16 个字节),而 text 有 1 或4 字节的开销加上字符串的实际长度.对于 GUID,这意味着 最少 33 个字节,但可能会因使用的编码而有很大差异.

A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.

因此,考虑到这一点,基于文本的 UUID 的索引肯定会更大,因为值更大,并且比较两个字符串与两个数值在理论上效率较低,但不会产生巨大的影响在这种情况下有所不同,至少不是通常情况.

So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.

我不会预先优化何时这样做会产生巨大的成本并且可能永远不需要.如果那个时候真的来了,那座桥是可以跨越的(尽管我会首先进行其他查询优化,正如我上面提到的).

I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).

关于 Postgres 是否知道字符串是 GUID,默认情况下它绝对不知道.就它而言,它只是一个唯一的字符串.但这对于大多数情况应该没问题,例如匹配行等.如果您发现自己需要一些特别需要 GUID 的行为(例如,一些基于非等式的比较,其中 GUID 比较可能与纯词法的比较不同),那么您始终可以将字符串转换为 UUID,并且 Postgres 将在该查询期间处理该值.

Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.

例如对于文本列 foo,您可以执行 foo::uuid 将其转换为 uuid.

e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.

还有一个模块可用于生成 uuids,uuid-ossp.

There's also a module available for generating uuids, uuid-ossp.

这篇关于PostgreSQL 使用 UUID 与文本作为主键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆