Binary_Checksum Vs HashBytes 函数 [英] Binary_Checksum Vs HashBytes function
问题描述
我有一个复杂的查询,它使用了很多二进制校验和函数,当我用一些测试数据测试它时,它实际上返回了相同的校验和值.请在下面找到我使用的测试数据
SELECT BINARY_CHECKSUM(16 ,'EP30461105',1) AS BinaryCheckSumEx UNION ALLSELECT BINARY_CHECKSUM(21 ,'EP30461155',1) 作为 BinaryCheckSumEx
现在我正在尝试使用带有 'MD5' 算法的 HASHBYTES 函数,我可以肯定得到唯一的记录,但现在让我担心的是,在当前查询中,我使用 'Checksum' 值加入我的 '合并'语句以查找新记录.由于 'HashBytes' 返回给我 Varbinary 数据类型,当我用 'HashByte' 字段替换连接条件时,我可以预期有多少性能开销.
SELECT HASHBYTES('MD5', CONCAT(Col1,Col2,Col3,Col4,..))
此外,我需要为多个列创建散列,在这种情况下,我需要一个额外的 Concat 函数,这会对我的性能产生额外的开销.
以下是选项:
使用哈希索引作为 VARBINARY
- 这不是个好主意.检查这些链接:
来源:https://blogs.msdn.microsoft.com/sqlsecurity/2011/08/26/data-hashing-in-sql-server/
对于您的第二个问题,您应该使 Hash 列保持不变,以避免对运行每个查询的影响.
I have a complex query which uses a lot of binary checksum function, when I was testing it with some test data for two distinct records it actually returned me same checksum value. Please find the test data I used below
SELECT BINARY_CHECKSUM(16 ,'EP30461105',1) AS BinaryCheckSumEx UNION ALL SELECT BINARY_CHECKSUM(21 ,'EP30461155',1) AS BinaryCheckSumEx
Now I am trying to use HASHBYTES function with 'MD5' algorithm for which I can be certain to get unique records, but what concerns me now is that in the current query I use the 'Checksum' value to join in my 'Merge' statements to look for new records. Since 'HashBytes' returns me Varbinary data type how much of a performance overhead I can expect when I replace the join conditions with the 'HashByte' field.
SELECT HASHBYTES('MD5', CONCAT(Col1,Col2,Col3,Col4,..))
And moreover I need to create hashing for multiple columns in which case I need to have an additional Concat function will this have additional overhead to my performance.
解决方案Here are the options:
Using index on hash as VARBINARY
Using BINARY_CHECKSUM and CHECKSUM
- It’s good but the problem is there is a high chance of duplication in checksum and as you google you see that many people have a problem with it.
However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.
Source: https://msdn.microsoft.com/en-us/library/ms189788(v=SQL.100).aspx
- Casting HASBYTES to BIGINT and having index on that
- It’s not a good idea
I would also be careful about converting the hashed value to BIGINT given that BIGINT is only 8 bytes yet all hash algorithms -- even MD5 -- are greater than 8 bytes (MD5 = 16 bytes, SHA1 = 20, SHA2_256 = 32, and SHA2_512 = 64). And converting binary values larger than 8 bytes to BIGINTsilently truncates the values. Hence you lose accuracy and increasing occurrences of false positives. The following query shows this behavior:
SELECT CONVERT(BIGINT, 0xFFFFFFFFFFFFFF), -- 7 bytes = 72057594037927935 CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFF), -- 8 bytes = -1 CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFF), -- 9 bytes = -1 CONVERT(BIGINT, 0xFFFFFFFFFFFFFFFFFFFF) -- 10 bytes = -1
Source: https://dba.stackexchange.com/questions/154945/index-maintenance-for-varbinary
- Casting HASHBYTES to VARCHAR and having index on that
- This is the good choice
- You have two options:
a) If you're using SQL 2008 or above
SELECT CONVERT(NVARCHAR(32),HashBytes('MD5', CONTENT),2)
b) If you're using SQL 2005
SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', CONTENT)), 3, 32)
PS: If you wonder which Hash Algorithm you should use:
MD5 = 16 bytes SHA1 = 20 bytes SHA2_256 = 32 bytes SHA2_512 = 64 bytes
Source: https://blogs.msdn.microsoft.com/sqlsecurity/2011/08/26/data-hashing-in-sql-server/
For your second question, you should make Hash columns PERSISTED, to avoid the impact on running each query.
这篇关于Binary_Checksum Vs HashBytes 函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!