提案:本地唯一的GUID替代方案 [英] Proposal: locally unique GUID alternative

查看:102
本文介绍了提案:本地唯一的GUID替代方案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我正在寻找有关探索本地唯一GUID替代方案的反馈 ,具有以下要求:

I'm looking for feedback on this exploration of a locally unique alternative for GUIDs, with the following requirements:


  1. 碰撞的可能性非常低(我们宁愿每年碰撞一次而不是执行检查)

  2. 不泄漏敏感信息,例如存在多少项

  3. 在SQL数据库中具有高性能

  4. 是否可以复制/匹配手动查询(作为查询字符串和查询结果)

  5. 可用作不带编码的URI组件

  1. Has very low chance of collisions (to the point that we'd rather collide once a year than perform checks)
  2. Does not leak sensitive information, such as how many items exist
  3. Has high performance in an SQL database
  4. Is copy/pastable for manual querying (as both query string and query result)
  5. Is usable as a URI component without encoding

为了满足要求,我决定采用 64位无符号整数的形式。它很容易在CPU上使用,对于主键使用很好和很小,半人类可读,仅数字,并且在手动查询时易于复制/粘贴。 (作为反例,BLOB严重阻碍了对大多数SQL数据库的手动查询。)

To meet the requirements, I have decided on the form of a 64-bit unsigned integer. It is easy on the CPU, nice and small for primary key use, semi-human-readable, digits-only, and easy to copy/paste when querying by hand. (As a counter-example, BLOBs severely hinder manual querying on most SQL databases.)

此外,Percona 演示 单调递增值作为主键执行得更好,尤其是插入速度,所以这是一个目标特征。

Additionally, Percona demonstrates that monotonically increasing values perform much better as primary keys, particularly on insertion speed, so this is a trait to aim for.

建议结构

从左到右,最重要的位在左侧


  1. 46位。的时间戳即可。 Unix时间,以毫秒为单位。 (至少在C#中,亚毫秒时间不可用。)这将持续到4199年的某个地方。它给我们单调递增的值。

  2. 8位。 本地IP的一部分。机器内部IP地址的最后一个组件,即最快的可用网络接口。大多数服务器都应该是以太网LAN。

  3. 10位。的 Uniquefier 即可。一个静态计数器,在使用时递增(互锁),带有环绕。

  1. 46 bits. Timestamp. Unix time in milliseconds. (At least in C#, sub-millisecond time is not readily available.) This will last until somewhere in the year 4199. It gives us monotonically increasing values.
  2. 8 bits. Part of local IP. The last component of the machine's internal IP address, of the fastest available network interface. Should be ethernet LAN for most servers.
  3. 10 bits. Uniquefier. A static counter that is incremented (interlocked) on use, with wrap-around.

碰撞

每当有以下情况发生1/1024(~0.1%)的碰撞机会:

There is a 1/1024 (~0.1%) collision chance whenever:


  1. 两个系统共享相同的最后一个IP地址组件以相同的毫秒进行呼叫。 这可以完全避免。

  2. 系统的时钟被转回并且它会在同一毫秒的呼叫中进行呼叫在时间变化之前。 这应该是一种非常罕见的情况,似乎符合要求。

  1. Two systems share the same last IP address component and make a call in the same millisecond. This can be completely avoided.
  2. A system's clock is turned back and it makes a call on the same millisecond of a call from before the time change. This should be a highly infrequent situation that seems within the requirements.

限制

有趣的是,我们似乎满足了要求(#2是一个狡猾的要求)。让我们来看看一些限制。

Interestingly, we seem to be meeting the requirements (#2 being a dodgy one). Let's take a look at some of the limitations.


  1. 必须仔细维护服务器的本地IP地址 - 即使在不同的数据中心之间也是如此(如果适用) 。

  2. 如果存在对IP的其他限制,我们不能支持超过255台服务器 - 可能更少。

  3. 我们泄漏有关创建了哪些标识符的信息由同一台服务器。我相信这也是许多GUID实现的情况。但是。

  4. 通过检查用户自己的请求之间的计数器增量,可以获得有关流量的信息。由于计数器用于各种数据,迅速增加并且难以归因于任何特定类型的数据,因此有效性降低。

  5. 标识符很多比那些有很多随机性的东西更可猜测。蛮力攻击每次尝试毫秒需要大约512次调用(uniquefier)。理想情况下,这种攻击不产生任何结果,即系统报告未授权,无论标识符是否不存在或不属于用户,并且抵抗定时攻击。实际上,让我们假设专门的攻击者会发现泄密。

  1. Servers' local IP addresses must be carefully maintained - even amongst different data centers, if applicable.
  2. We cannot support more than 255 servers - possibly fewer, if other constraints on the IPs exist.
  3. We leak information about which identifiers were created by the same server. I believe this is the case with many GUID implementations also, however.
  4. Information can be obtained about traffic volumes, by checking counter increments between a user's own requests. The effectiveness is diminished by the fact that the counter is used for various kinds of data, increasing rapidly and in a way that is hard to ascribe to any particular type of data.
  5. Identifiers are much more guessable than ones that have plenty of randomness. A brute-force attack would need about 512 calls (uniquefier) per attempted millisecond. Ideally, this attack yields nothing, i.e. the system reports "unauthorized" regardless of whether the identifier is non-existent or does not belong to the user, and is resistant to timing attacks. Realistically, let's assume a dedicated attacker will find a leak.

注意事项


  1. 限制#1和#2必须适合公司。

  1. Limitations #1 and #2 must simply fit the company.

限制# 3在现有的GUID实现中似乎被认为是可接受的,并且是我愿意接受的。

Limitation #3 appears to be deemed acceptable in existing GUID implementations, and is one I am willing to live with.

限制#4是一个棘手的问题。这些信息有多敏感? 因此,我们每分钟进行10K插入,进入未知数量的表格。相对数量确实提供了更多的洞察力:在08:00-09:00之间,活动的时间是一小时前的两倍。不过,这通常是特定领域的常识。意外的峰值可能会泄漏更多信息。 所以系统在早上03:00努力工作。这有多糟糕?从暴露自动增量标识符的公司数量来看,我们可能会说这是一种改进,而不是......但它可能是一个交易破坏者。

Limitation #4 is a tricky one. How sensitive is this information? "So we do a 10K inserts per minute, into an unknown number of tables." Relative volumes do grant more insight: "Between 08:00-09:00 there is twice as much activity as an hour earlier." Still, this will usually be common knowledge in a given field. Unexpected peaks might leak some more info. "So the system works hard at 03:00 in the morning." How bad is all this? Judging by the number of companies that expose auto-increment identifiers, we might say it's an improvement more often than not... But it could be a deal-breaker.

我们可以使用(加密)随机位作为处理限制#4的唯一文件,但这会引入第三次碰撞机会:每当系统在一毫秒内生成多个标识符时。生日悖论在那里特别成问题。

We could use (crypto)random bits as the uniquefier to deal with limitation #4, but that would introduce a third collision opportunity: whenever the system generates multiple identifiers within a millisecond. The birthday paradox is particularly problematic there.

如果我们允许时间戳已经在2527中回绕,我们可以释放2位。自私和对后代不敏感,或傲慢地假设我们的代码会被更长时间使用? : - )

We could free up 2 bits if we allow the timestamp to wrap around in 2527 already. Selfish and insensitive to future generations, or arrogant to assume that our code would be used longer? :-)

还有什么?

我欢迎您错过我的反馈,改进,想法和限制!你会如何解决这个问题?

I welcome your feedback, improvements, ideas, limitations that I have missed! How would you solve this problem?

推荐答案

有可能成为那个回答你为什么要这样做? - 我想知道您的基本业务问题是什么,阻止您使用GUID?

At risk of being that guy that responds with "why would you want to do that?" -I am wondering what your underlying business problem is, that prevents you from using a GUID?

BIGINT,GUID和HashTables ..

我使用 BIGINT 作为主键,它保持所有顺序,非浮动和快速。这适用于所有内部工作,即在我的存储过程中,在SQL连接等上。然后我有一个带有 GUID 的哈希表,它变成了来自外部呼叫者的起点。

I use a BIGINT for primary key which keeps everything sequential, unbloated and fast. This is for all internal work i.e. within my stored procedures, on SQL joins etc. Then I have a hash table with GUID's which becomes the starting point from external callers.

由于我使用表继承, BIGINT ID可以用作顺序主键,因为所有ID在整个数据库中都是唯一的(但仍然是顺序的)。然后进一步,我在哈希表上创建一个复合键,包括 GUID 的最后几位数,然后我对这些值的哈希表进行分区,以便每个是单独存储在磁盘上并且仍然是顺序的,但是给了我一种自然的方式来索引 GUID 我正在查找。

Since I use table inheritance, the BIGINT ID's can be used as a sequential primary key in my hash table as all ID's are unique across the entire database (yet still sequential). Then to take it further I create a composite key on the hash table that includes the last several digits of the GUID, and then I partition the hash table on those values so that each is stored separately on disk and still sequential, yet gives me a natural way to index on the GUID I'm looking up.

当我最初开始这样做时,我在这里发布了一个方法(不包括分区部分):

When I originally started doing it this way, I posted a how-to (excluding the partitioning part) here:

在Sql Server中查找重复uniqueidentifier的最快方法是什么?

初始性能测试对100,000,000条记录的快速减少。

Initial performance tests were lightening fast against 100,000,000 records.

不是您问题的答案,但也许价值2美分给某人。

Not the answer to your question, but perhaps worth 2 cents to somebody.

这篇关于提案:本地唯一的GUID替代方案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆