为什么这些Datetime值返回相同的Checksum& Checksum_Agg?如何使它更具特色? [英] Why do these Datetime values return the same Checksum & Checksum_Agg? How can I make it more unique?

查看:107
本文介绍了为什么这些Datetime值返回相同的Checksum& Checksum_Agg?如何使它更具特色?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图快速确定两组时间表是否相同,并且还生成一个可以供这些唯一时间表参考的密钥.我最初尝试使用HASHBYTES,但是很快发现您只能对8000个字符进行哈希处理,并且我有大量的日期时间,这些日期时间在连接时会超过8000.

I'm trying to quickly determine if two sets of schedules are identical and also generate a key which these unique schedules can be references by. I originally attempted to use HASHBYTES but quickly discovered that you are only able to hash 8000 characters and I have a significant number of datetimes which when concatenated are longer than 8000.

因此,我尝试使用Checksum和Checksum_Agg,因为它们似乎是为此类事情而设计的.我知道Checksum有更高的机会生成非唯一值.但是我需要将它们相互比较的范围/上下文是如此狭窄,以至于我认为我可以摆脱它.

So, I attempted to use Checksum and Checksum_Agg since they seem to be designed for this sort of thing. I'm aware that Checksum has a higher chance of generating non-unique values. But my scope/context in which these needed to be compared to each other is so narrow I thought I could get away with it.

不幸的是,在进行了一点测试之后,我得知我可以在4行日期时间数据中找到Checksum冲突"!我以为这有点奇怪,并发现了碰撞的一种模式.

Unfortunately after just a little bit of testing I learned I could find Checksum "collisions" in just 4 rows of datetime data! I thought this was a bit strange and discovered a pattern to the collisions.

下面是一个演示问题的示例脚本:

Below is a sample script which demonstrates the problem:


DECLARE @Rows TABLE
(
    [GroupId] INT,
    [StartDate] DATETIME,
    [EndDate] DATETIME
)

--Group1
INSERT INTO @Rows VALUES (1, '2013-01-20 01:00:00.000', '2013-01-20 01:20:00.000')
INSERT INTO @Rows VALUES (1, '2013-01-20 01:20:00.000', '2013-01-20 01:40:00.000')
--INSERT INTO @Rows VALUES (1, '2013-01-20 01:40:00.000', '2013-01-20 02:00:00.000')
--INSERT INTO @Rows VALUES (1, '2013-01-20 02:00:00.000', '2013-01-20 02:20:00.000')
--INSERT INTO @Rows VALUES (1, '2013-01-20 02:20:00.000', '2013-01-20 02:40:00.000')
--INSERT INTO @Rows VALUES (1, '2013-01-20 02:40:00.000', '2013-01-20 03:00:00.000')

--Group2
INSERT INTO @Rows VALUES (2, '2013-01-21 01:00:00.000', '2013-01-21 01:20:00.000')
INSERT INTO @Rows VALUES (2, '2013-01-21 01:20:00.000', '2013-01-21 01:40:00.000')
--INSERT INTO @Rows VALUES (2, '2013-01-21 01:40:00.000', '2013-01-21 02:00:00.000')
--INSERT INTO @Rows VALUES (2, '2013-01-21 02:00:00.000', '2013-01-21 02:20:00.000')
--INSERT INTO @Rows VALUES (2, '2013-01-21 02:20:00.000', '2013-01-21 02:40:00.000')
--INSERT INTO @Rows VALUES (2, '2013-01-21 02:40:00.000', '2013-01-21 03:00:00.000')

SELECT [ChecksumAgg1] = CHECKSUM_AGG([CheckSum])
FROM
(
    SELECT [CheckSum] = CHECKSUM([StartDate], [EndDate])
    FROM @Rows
    WHERE GroupId = 1
) G1

SELECT [ChecksumAgg2] = CHECKSUM_AGG([CheckSum])
FROM
(
    SELECT [CheckSum] = CHECKSUM([StartDate], [EndDate])
    FROM @Rows
    WHERE GroupId = 2
) G2

结果是:

ChecksumAgg1: 5681728

ChecksumAgg1: 5681728

ChecksumAgg2: 5681728

ChecksumAgg2: 5681728

两个日期系列之间的唯一区别是它们相隔1天.但是它们生成相同的校验和!但是只有行数偶数时才可以.如果取消注释来自组1的INSERT和来自组2的INSERT,则将获得两个不同的校验和.但是然后取消注释另一对,您将获得另一场比赛!

The only difference between the two series of dates is that they are 1 day apart. But they generate the same checksum! BUT only if there is an even number of rows. If you un-comment an INSERT from Group 1 and one from Group 2 you will get two difference Checksums. But then un-comment yet another pair and you'll get another match!

最后我有两个问题.我想学习更多有关它如何工作以及为什么这种模式似乎影响可预测的校验和值的更多信息.更重要的是,我想知道是否存在一种更好的方法来本质上创建非常大的数据集的指纹".我了解我无法保证此哈希在全球范围内是唯一的,但显然我需要比Checksum更好的东西.

In the end I have two questions. I would love to learn a bit more about how this works and why this pattern seems to influence a pretty predictable checksum value. And even more importantly I would like to know if there's a better way to essentially create a "fingerprint" of a very large set of data. I understand that I cannot guarantee this hash to be globally unique but I apparently need something better than Checksum.

我能够欺骗校验和计算的一种方法是在Datetime上执行HASHBYTES,然后再将其提供给Checksum函数.这样,就向Checksum提供的值比具有相似外观差异的一组日期看起来更随机.但是就足够了吗?

One way I was able to sort of trick the Checksum calculation was to perform a HASHBYTES on the Datetime beforehand feeding it to the Checksum function. This way the Checksum is fed values which are far more random looking than a set of dates with similar looking differences. But will it be enough?

编辑-这里只是上下文.

基本上,我有一个拥有大量日程表数据的系统,还有一个对特定时间对这些日程表感兴趣的单独系统.例如,多个用户可能会看到此复杂计划的一部分的特定版本,并想要添加一些元数据(可能是他们的批准状态,注释或其他内容).如果某个外部来源更改了单个日期时间,则此链接需要中断,因为它不再是相同的时间表了!

Basically I have one system which has a huge amount of schedule data and a separate system which is interested in these schedules at specific times. For example, multiple users may see a specific version of a portion this complex schedule and want to add some metadata (maybe their approval status, notes or something). If some external source makes a change to any single datetime then this link needs to be broken because it isn't the same schedule anymore!

有许多不同的系统可以对核心调度数据进行更改,这就是为什么我很难将这种担忧鼓吹到代码级别来以某种方式将其管理和规范化"为代表每个快照的实体的原因.某种方式.我将不得不在100万个地方挂上钩子,监听变化,然后清理所有指向时间表的东西.

There are many different systems which can introduce changes to the core schedule data which is why I'm having a hard time bubbling up this concern to the code level to somehow manage and "normalize" this into entities which represent each snapshot in some way. I would have to have hooks in a million places listening for changes and then cleaning up anything pointing to the schedule.

推荐答案

来自此页的评论:

http://msdn.microsoft.com/en-us/library/ms188920.aspx

似乎Checksum_Agg是使用XOR构建的.而且,关于异或的问题是,通过两次包含相同的数字,它们往往很容易被逆转.这就解释了为什么您只有在偶数时才注意到它.

It appears that Checksum_Agg is built by using XORs. And the thing about XORs is, they tend to be easily reversible by including the same number twice. This explains why you've only noticed it when it's even.

只要您知道XOR问题,并以混合所有位的方式预先加扰您要提供的内容,就可以了.

As long as you're aware of the XOR issue and pre-scramble what you feed to it in a way that mixes up all the bits you should be OK.

这篇关于为什么这些Datetime值返回相同的Checksum& Checksum_Agg?如何使它更具特色?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆