Binary_Checksum Vs HashBytes 函数 [英] Binary_Checksum Vs HashBytes function

查看:19
本文介绍了Binary_Checksum Vs HashBytes 函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个复杂的查询,它使用了很多二进制校验和函数,当我用一些测试数据测试它时,它实际上返回了相同的校验和值.请在下面找到我使用的测试数据

SELECT BINARY_CHECKSUM(16 ,'EP30461105',1) AS BinaryCheckSumEx UNION ALLSELECT BINARY_CHECKSUM(21 ,'EP30461155',1) 作为 BinaryCheckSumEx

现在我正在尝试使用带有 'MD5' 算法的 HASHBYTES 函数,我可以肯定得到唯一的记录,但现在让我担心的是,在当前查询中,我使用 'Checksum' 值加入我的 '合并'语句以查找新记录.由于 'HashBytes' 返回给我 Varbinary 数据类型,当我用 'HashByte' 字段替换连接条件时,我可以预期有多少性能开销.

SELECT HASHBYTES('MD5', CONCAT(Col1,Col2,Col3,Col4,..))

此外,我需要为多个列创建散列,在这种情况下,我需要一个额外的 Concat 函数,这会对我的性能产生额外的开销.

解决方案

以下是选项:

  1. 使用哈希索引作为 VARBINARY

    • 这不是个好主意.检查这些链接:
    • 来源:https://blogs.msdn.microsoft.com/sqlsecurity/2011/08/26/data-hashing-in-sql-server/

      对于您的第二个问题,您应该使 Hash 列保持不变,以避免对运行每个查询的影响.

      I have a complex query which uses a lot of binary checksum function, when I was testing it with some test data for two distinct records it actually returned me same checksum value. Please find the test data I used below

      SELECT BINARY_CHECKSUM(16   ,'EP30461105',1) AS BinaryCheckSumEx UNION ALL
      SELECT BINARY_CHECKSUM(21   ,'EP30461155',1) AS BinaryCheckSumEx
      

      Now I am trying to use HASHBYTES function with 'MD5' algorithm for which I can be certain to get unique records, but what concerns me now is that in the current query I use the 'Checksum' value to join in my 'Merge' statements to look for new records. Since 'HashBytes' returns me Varbinary data type how much of a performance overhead I can expect when I replace the join conditions with the 'HashByte' field.

      SELECT HASHBYTES('MD5', CONCAT(Col1,Col2,Col3,Col4,..))
      

      And moreover I need to create hashing for multiple columns in which case I need to have an additional Concat function will this have additional overhead to my performance.

      解决方案

      Here are the options:

      1. Using index on hash as VARBINARY

      2. Using BINARY_CHECKSUM and CHECKSUM

        • It’s good but the problem is there is a high chance of duplication in checksum and as you google you see that many people have a problem with it.

      However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

      Source: https://msdn.microsoft.com/en-us/library/ms189788(v=SQL.100).aspx

      1. Casting HASBYTES to BIGINT and having index on that
        • It’s not a good idea

      I would also be careful about converting the hashed value to BIGINT given that BIGINT is only 8 bytes yet all hash algorithms -- even MD5 -- are greater than 8 bytes (MD5 = 16 bytes, SHA1 = 20, SHA2_256 = 32, and SHA2_512 = 64). And converting binary values larger than 8 bytes to BIGINTsilently truncates the values. Hence you lose accuracy and increasing occurrences of false positives. The following query shows this behavior:

      SELECT CONVERT(BIGINT, 0xFFFFFFFFFFFFFF),      --  7 bytes = 72057594037927935
             CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFF),    --  8 bytes = -1
             CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFF),  --  9 bytes = -1
             CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFFFF) -- 10 bytes = -1
      

      Source: https://dba.stackexchange.com/questions/154945/index-maintenance-for-varbinary

      1. Casting HASHBYTES to VARCHAR and having index on that
        • This is the good choice
        • You have two options:

      a) If you're using SQL 2008 or above

      SELECT CONVERT(NVARCHAR(32),HashBytes('MD5', CONTENT),2)
      

      b) If you're using SQL 2005

      SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', CONTENT)), 3, 32)
      

      PS: If you wonder which Hash Algorithm you should use:

      MD5 = 16 bytes
      SHA1 = 20 bytes
      SHA2_256 = 32 bytes
      SHA2_512 = 64 bytes
      

      Source: https://blogs.msdn.microsoft.com/sqlsecurity/2011/08/26/data-hashing-in-sql-server/

      For your second question, you should make Hash columns PERSISTED, to avoid the impact on running each query.

      这篇关于Binary_Checksum Vs HashBytes 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆