使用MD5 Hash作为主键的优点和缺点,在SQL Server中使用int标识作为主键 [英] Pros and cons of using MD5 Hash as the primary key vs. use a int identity as the primary key in SQL Server

查看:928
本文介绍了使用MD5 Hash作为主键的优点和缺点,在SQL Server中使用int标识作为主键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个应用程序来处理一个文件并将其分段到多个段,然后将结果保存到sql server数据库中。有很多重复的文件(可能有不同的文件路径),所以首先我浏览所有这些文件,并计算每个文件的Md5哈希,并使用[复制]列标记重复的文件。



然后每天,我将运行此应用程序并将结果保存到[Result]表中。
数据库模式如下:

  CREATE TABLE [dbo]。[FilePath] 
b $ b [FilePath] NVARCHAR(256)NOT NULL PRIMARY KEY,
[FileMd5Hash] binay(16)NOT NULL,
[Duplicated] BIT NOT NULL DEFAULT 0,
[LastRunBuild] NVARCHAR (30)NOT NULL DEFAULT 0


CREATE TABLE [dbo]。[result]

[Build] NVARCHAR $ b [FileMd5Hash] binay(16)NOT NULL,
[SegmentId] INT NOT NULL,
[SegmentContent] text NOT NULL
PRIMARY KEY([FileMd5Hash],[Build] ])

我需要在FileMd5Hash上加入这两个表。 / p>

由于[Result]的行数非常大,我想添加一个int标识列来将它们连接到表格,如下所示:

  CREATE TABLE [dbo]。[FilePath] 

[FilePath] NVARCHAR(256)NOT NULL PRIMARY KEY,
[FileMd5Hash] binay(16)NOT NULL,
** [Id] INT NOT NULL IDENTITY,**
[Duplicated] BIT NOT NULL DEFAULT 0,
[LastRunBuild] NVARCHAR (30)NOT NULL DEFAULT 0


CREATE TABLE [dbo]。[result]

[Build] NVARCHAR $ b ** [Id] INT NOT NULL,**
[SegmentId] INT NOT NULL,
[SegmentContent] text NOT NULL
PRIMARY KEY([FileMd5Hash],[Build] SegmentId])

那么这两种方法的利弊

解决方案

int键更容易实现,更易于使用和理解。它也较小(4字节对比16字节),因此索引将适合大约是每个IO页面的条目数量的两倍,意味着更好的性能。表格行也会更小(OK,不会太小),所以再次每页可以容纳更多的行=更少的IO。



哈希总是会产生冲突。虽然极其罕见,但是,作为生日问题显示,冲突变得越来越可能作为记录计数增加。与各种位长度哈希冲突50%的机会所需的项目数如下:

 哈希长度位)50%碰撞概率的项目数
32 77000
64 51亿
128 22亿亿美元b $ b 256 400亿亿亿美元b $ b

还有一个问题是不得不传递非ascii字节 - 更难调试,通过线发送等。



为表使用 int 顺序主键。其他人都是。


I have an application to deal with a file and fragment it to multiple segments, then save the result into sql server database. There are many duplicated file (maybe with different file path), so first I go through all these files and compute the Md5 hash for each file, and mark duplicated file by using the [Duplicated] column.

Then everyday, I'll run this application and save the results into the [Result] table. The db schema is as below:

    CREATE TABLE [dbo].[FilePath]
    (
        [FilePath] NVARCHAR(256) NOT NULL PRIMARY KEY,
        [FileMd5Hash] binay(16) NOT NULL,
        [Duplicated] BIT NOT NULL DEFAULT 0, 
        [LastRunBuild] NVARCHAR(30) NOT NULL DEFAULT 0
    )

    CREATE TABLE [dbo].[Result]
    (
        [Build] NVARCHAR(30) NOT NULL,
        [FileMd5Hash] binay(16) NOT NULL , 
        [SegmentId] INT NOT NULL,
        [SegmentContent] text NOT NULL 
        PRIMARY KEY ([FileMd5Hash], [Build], [SegmentId])
    )

And I have a requirement to join these 2 table on FileMd5Hash.

Since the number of rows of [Result] is very large, I'd like to add an int Identity column to join these to tables as below:

    CREATE TABLE [dbo].[FilePath]
    (
        [FilePath] NVARCHAR(256) NOT NULL PRIMARY KEY,
        [FileMd5Hash] binay(16) NOT NULL,
        **[Id] INT NOT NULL IDENTITY,**
        [Duplicated] BIT NOT NULL DEFAULT 0, 
        [LastRunBuild] NVARCHAR(30) NOT NULL DEFAULT 0
    )

    CREATE TABLE [dbo].[Result]
    (
        [Build] NVARCHAR(30) NOT NULL,
        **[Id] INT NOT NULL,**  
        [SegmentId] INT NOT NULL,
        [SegmentContent] text NOT NULL 
        PRIMARY KEY ([FileMd5Hash], [Build], [SegmentId])
    )

So What's the Pros and cons of these 2 ways?

解决方案

An int key is simpler to implement and easier to use and understand. It's also smaller (4 bytes vs 16 bytes), so indexes will fit about double the number of entries per IO page, meaning better performance. The table rows too will be smaller (OK, not much smaller), so again you'll fit more rows per page = less IO.

Hash can always produce collisions. Although exceedingly rare, nevertheless, as the birthday problem shows, collisions become more and more likely as record count increases. The number of items needed for a 50% chance of a collision with various bit-length hashes is as follows:

Hash length (bits)   Item count for 50% chance of collision
                32   77000
                64   5.1 billion
               128   22 billion billion
               256   400 billion billion billion billion

There's also the issue of having to pass around non-ascii bytes - harder to debug, send over wire, etc.

Use int sequential primary keys for your tables. Everybody else does.

这篇关于使用MD5 Hash作为主键的优点和缺点,在SQL Server中使用int标识作为主键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆