计算UTF8字符串的MD5哈希值 [英] Compute MD5 hash of a UTF8 string

查看：590 发布时间：2017/8/16 21:42:16 sql-server tsql hash encoding sql-server-2008-r2

本文介绍了计算UTF8字符串的MD5哈希值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个SQL表，其中存储大字符串值必须是唯一的。
为了确保唯一性，我在一个列上存储一个唯一索引，其中我存储大字符串的MD5哈希的字符串表示。

保存这些记录的C＃应用程序使用以下方法进行散列：

  public static string CreateMd5HashString（byte [] input）
 {
 var hashBytes = MD5.Create（）。ComputeHash（input）; 
 return string.Join（，hashBytes.Select（b => b.ToString（X）））; 
}

为了调用这个，我先转换字符串到 byte [] 使用UTF-8编码：

  //这是我在我的应用程序中使用
 CreateMd5HashString（Encoding.UTF8.GetBytes（abc））
 //结果：90150983CD24FB0D6963F7D28E17F72

现在我希望能够在SQL中实现这个哈希函数，使用 HASHBYTES 功能，但我得到一个不同的值：

  print hashbytes（'md5'，N'abc'）
 - 结果：0xCE1473CF80C6B3FDA8E3DFC006ADC315

这是因为SQL计算字符串的UTF-16表示的MD5 。
如果我执行 CreateMd5HashString（Encoding.Unicode.GetBytes（abc）），则C＃中得到相同的结果。

我无法更改应用程序中的散列方式。

有没有办法让SQL Server计算UTF- 8字节的字符串？

我查了类似的问题，我尝试使用归类，但迄今没有运气。

解决方案

您需要创建一个UDF才能将NVARCHAR数据转换为UTF-8表示形式的字节。说它被称为 dbo.NCharToUTF8Binary 然后你可以做：

  hashbytes （'md5'，dbo.NCharToUTF8Binary（N'abc'，1））

这是一个UDF这样做：

 创建函数dbo.NCharToUTF8Binary（@txt NVARCHAR（max），@modified位）
返回varbinary（max）
作为
 begin 
  - 注意：这不是最快的例程。 
  - 如果你想要一个快速例程，使用SQLCLR 
 set @modified = isnull（@modified，0）
  - 首先切成一个表。 
声明@chars表（
 ix int identity主键，
 codepoint int，
 utf8 varbinary（6）
）
 declare @ix int 
 set @ix = 0 
 while @ix< datalength（@txt）/ 2  - 尾随空格
 begin 
 set @ix = @ix + 1 
 insert @chars（codepoint）
 select unicode（substring（@txt， @ix，1））
 end 
 
  - 现在寻找代理对。 
  - 如果我们找到一对（铅跟踪跟踪），我们将配对
  - 高代理是\\\�到\\\� 
  - 低代理是\\\�到\\\� 
  - 查找高替代码，然后是低代理，并更新代码点
更新c1 set codepoint =（（c1.codepoint& 0x07ff）* 0x0800）+（c2.codepoint& 0x07ff ）+ 0x10000 
 from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1 
其中c1.codepoint> = 0xD800和c1.codepoint< = 0xDBFF 
和c2.codepoint> = 0xDC00和c2.codepoint< = 0xDFFF 
  - 摆脱找到的对象的尾部一半
从@chars c1删除c2 
内部连接@ c1.ix = c2.ix -1 
上的chars c2其中c1.codepoint> = 0x10000 
 
  - 现在我们utf-8对每个代码点进行编码。 
  - 孤独的代理一半仍然在这里
  - 所以它们将被编码，就像它们不是代理对。 
更新c 
 set utf8 = 
 case 
  - 一字节编码（修改后的UTF8作为两字节编码输出零）
当代码点<= 0x7f和（@modified = 0 OR codepoint<> 0）
然后cast（substring（cast（codepoint as binary（4）），4,1）作为varbinary（6））
  - 字符编码
当代码点<= 0x07ff 
然后substring（cast（（0x00C0 +（（codepoint / 0x40）& 0x1f））作为二进制（4）），4,1）
 +子字符串（（0x0080 +（codepoint& 0x3f））作为二进制（4）），4,1）
  - 三字节编码
当代码点<= 0x0ffff 
然后将子串（（0x00E0 +（（codepoint / 0x1000）& 0x0f））作为二进制（4）），4,1）
 + substring（cast（（0x0080 +（（codepoint / 0x40）& ; 0x3f））作为二进制（4）），4,1）
 + substring（cast（（0x0080 +（codepoint& 0x3f））as binary（4）），4,1）
  -  - 四字节编码
当代码点<= 0x1FFFFF 
 then substring（cast（（0x00F0 +（（codepoint / 0x00040000）&am磷; （4）），4,1）
 + substring（cast（（0x0080 +（（codepoint / 0x1000）& 0x3f））作为二进制（4）），4,1）
 + substring（cast（（0x0080 +（（codepoint / 0x40）& 0x3f））作为二进制（4）），4,1）
 + substring（cast（（0x0080 +（codepoint& 0x3f） ）作为二进制（4）），4,1）
 
 end 
 from @chars c 
 
  - 最后连接它们并返回。 
 declare @ret varbinary（max）
 set @ret = cast（''as varbinary（max））
 select @ret = @ret + utf8 from @chars c order by ix 
 return @ret 
 
 end

I have an SQL table in which I store large string values that must be unique. In order to ensure the uniqueness, I have a unique index on a column in which I store a string representation of the MD5 hash of the large string.

The C# app that saves these records uses the following method to do the hashing:

public static string CreateMd5HashString(byte[] input)
{
    var hashBytes = MD5.Create().ComputeHash(input);
    return string.Join("", hashBytes.Select(b => b.ToString("X")));
}

In order to call this, I first convert the string to byte[] using the UTF-8 encoding:

// this is what I use in my app
CreateMd5HashString(Encoding.UTF8.GetBytes("abc"))
// result: 90150983CD24FB0D6963F7D28E17F72

Now I would like to be able to implement this hashing function in SQL, using the HASHBYTES function, but I get a different value:

print hashbytes('md5', N'abc')
-- result: 0xCE1473CF80C6B3FDA8E3DFC006ADC315

This is because SQL computes the MD5 of the UTF-16 representation of the string. I get the same result in C# if I do CreateMd5HashString(Encoding.Unicode.GetBytes("abc")).

I cannot change the way hashing is done in the application.

Is there a way to get SQL Server to compute the MD5 hash of the UTF-8 bytes of the string?

I looked up similar questions, I tried using collations, but had no luck so far.

解决方案

You need to create a UDF to convert the NVARCHAR data to bytes in UTF-8 Representation. Say it is called dbo.NCharToUTF8Binary then you can do:

hashbytes('md5', dbo.NCharToUTF8Binary(N'abc', 1))

Here is a UDF which will do that:

create function dbo.NCharToUTF8Binary(@txt NVARCHAR(max), @modified bit)
returns varbinary(max)
as
begin
-- Note: This is not the fastest possible routine. 
-- If you want a fast routine, use SQLCLR
    set @modified = isnull(@modified, 0)
    -- First shred into a table.
    declare @chars table (
    ix int identity primary key,
    codepoint int,
    utf8 varbinary(6)
    )
    declare @ix int
    set @ix = 0
    while @ix < datalength(@txt)/2  -- trailing spaces
    begin
        set @ix = @ix + 1
        insert @chars(codepoint)
        select unicode(substring(@txt, @ix, 1))
    end

    -- Now look for surrogate pairs.
    -- If we find a pair (lead followed by trail) we will pair them
    -- High surrogate is \uD800 to \uDBFF
    -- Low surrogate  is \uDC00 to \uDFFF
    -- Look for high surrogate followed by low surrogate and update the codepoint   
    update c1 set codepoint = ((c1.codepoint & 0x07ff) * 0x0800) + (c2.codepoint & 0x07ff) + 0x10000
    from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1
    where c1.codepoint >= 0xD800 and c1.codepoint <=0xDBFF
    and c2.codepoint >= 0xDC00 and c2.codepoint <=0xDFFF
    -- Get rid of the trailing half of the pair where found
    delete c2 
    from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1
    where c1.codepoint >= 0x10000

    -- Now we utf-8 encode each codepoint.
    -- Lone surrogate halves will still be here
    -- so they will be encoded as if they were not surrogate pairs.
    update c 
    set utf8 = 
    case 
    -- One-byte encodings (modified UTF8 outputs zero as a two-byte encoding)
    when codepoint <= 0x7f and (@modified = 0 OR codepoint <> 0)
    then cast(substring(cast(codepoint as binary(4)), 4, 1) as varbinary(6))
    -- Two-byte encodings
    when codepoint <= 0x07ff
    then substring(cast((0x00C0 + ((codepoint/0x40) & 0x1f)) as binary(4)),4,1)
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
    -- Three-byte encodings
    when codepoint <= 0x0ffff
    then substring(cast((0x00E0 + ((codepoint/0x1000) & 0x0f)) as binary(4)),4,1)
    + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1)
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
    -- Four-byte encodings 
    when codepoint <= 0x1FFFFF
    then substring(cast((0x00F0 + ((codepoint/0x00040000) & 0x07)) as binary(4)),4,1)
    + substring(cast((0x0080 + ((codepoint/0x1000) & 0x3f)) as binary(4)),4,1)
    + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1)
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)

    end
    from @chars c

    -- Finally concatenate them all and return.
    declare @ret varbinary(max)
    set @ret = cast('' as varbinary(max))
    select @ret = @ret + utf8 from @chars c order by ix
    return  @ret

end

这篇关于计算UTF8字符串的MD5哈希值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

计算UTF8字符串的MD5哈希值 [英] Compute MD5 hash of a UTF8 string

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

计算UTF8字符串的MD5哈希值 [英] Compute MD5 hash of a UTF8 string

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭