VBA哈希字符串 [英] VBA hash string

查看:358
本文介绍了VBA哈希字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用Excel VBA获得长字符串的短散列



给出的内容
$ b


  • 输入字符串不超过80个字符

  • 有效输入字符为:[0..9] [A_Z]。 _ /

  • 有效的输出字符是[0..9] [A_Z] [a_z] (可以使用大写和小写)

  • 输出散列不应超过〜12个字符(更短更好)
  • 不需要是唯一的,因为这会导致过长散列



到目前为止我所做的工作

我认为这个SO答案是一个很好的开始,因为它生成一个4位十六进制代码(CRC16)。



但是4位数字很少。在我使用400个字符串的测试中,20%在其他地方得到了重复。

产生碰撞的机会太高。



Sub tester()
For i = 2至433
单元(i,2)= CRC16(单元(i,1))
下一个我
结束小组


功能CRC16 txt As String)
Dim x As Long
Dim mask,i,j,nC,Crc As Integer
Dim c As String

Crc =& HFFFF

对于nC = 1 To Len(txt)
j = Val(& H+ Mid(txt,nC,2))
Crc = Crc Xor j
对于j = 1至8
掩码= 0
如果Crc / 2 < Int(Crc / 2)Then mask =& HA001
Crc = Int(Crc / 2)And&H7FFF:Crc = Crc Xor掩码
下一个j
下一个nC

CRC16 = Hex $(Crc)
End Function

重现



您可以复制这些400 来自pastebin的测试字符串

将它们粘贴到新的Excel工作簿中的列A并执行上面的代码。



Q:我如何获得一个足够短(12个字符)的字符串散列,并且足够长以获得一小部分重复。

解决方案

将你的字符串分成三个较短的字符串(如果不能被三个整除,最后一个会比另外两个长)。



我可以编写代码,但是基于问题的质量,我认为你可以从这里拿到它!



编辑:事实证明,这个建议是不够​​的。在您的原始CRC16代码中存在一个严重的缺陷 - 即这样一行:

  j = Val(& H + Mid(txt,nC,2))

这只能处理可以被解释为十六进制值的文本:小写字母和大写字母是相同的,并且字母F后面的任何内容都被忽略(据我所知)。有什么好的出现是一个奇迹。如果你用

  j = asc(mid(txt,nC,1))

事情效果更好 - 每个ASCII代码至少都以自己的价值开始。



将此更改与我之前提出的建议结合使用,您将获得以下代码:

 函数hash12(s As String )
'从字符串s

创建12个字符的散列值Dim l As Integer,l3 As Integer
Dim s1 As String,s2 As String,s3 As String

l = Len(s)
l3 = Int(1/3)
s1 = Mid(s,1,l3)'第一部分
s2 = Mid(s,l3 + 1,l3)'中间部分
s3 = Mid(s,2 * l3 + 1)'字符串的其余部分...

hash12 = hash4(s1)+ hash4(s2 )+ hash4(s3)

结束函数

函数hash4(txt)
'从示例复制
Dim x As Long
Dim mask,i,j,nC,crc As Integer
Dim c As String

crc =& HFFFF

对于nC = 1 To Len(txt)
j = Asc(Mid(txt,nC ))<<<<<<<<<<<<<<<<<<<<<<新行代码 - 使所有差异
'代替j = Val(& H+ Mid(txt,nC,2))
crc = crc Xor j
对于j = 1到8
掩码= 0
如果crc / 2<> Int(crc / 2)Then mask =& HA001
crc = Int(crc / 2)And&H7FFF:crc = crc Xor掩码
下一个j
下一个nC

c = Hex $(crc)

'<<< <新部分:确保返回的字符串始终为4个字符>>>>>
'pad始终有长度4:
Len(c)< 4
c =0& c
Wend

hash4 = c

End Function

您可以将此代码放置在电子表格中,作为 = hash12(A2)等。为了好玩,您还可以使用新改进 hash4算法,并查看它们如何比较。我创建了一个数据透视表来计算冲突 - 对于 hash12 算法没有,只有3个用于 hash4 。我相信你可以找出如何从这里创建 hash8 ,...。你的问题中的不需要是唯一的表明也许改进 hash4 是你所需要的。



<原则上,一个四个字符的十六进制应该有64k的唯一值 - 因此两个具有相同散列的随机字符串在64k中的概率是1。当你有400个字符串时,有400 x 399/2个可能的碰撞对〜80k个机会(假设你有高度随机的字符串)。因此观察样本数据集中的三个碰撞并不是一个不合理的分数。随着字符串数量N的增加,碰撞的概率变为N的平方。在hash12中额外的32位信息中,当N> 20 M左右时,您会看到碰撞(handwaving,in-my-头部数学)。

显然,你可以使hash12代码更紧凑一些,而且应该很容易理解如何将它扩展到任何长度。 / p>

哦 - 还有最后一件事。如果您启用了RC地址,则使用 = CRC16(string)作为电子表格公式会带来难以追踪的 #REF error ...这就是为什么我将它重命名为 hash4


How do I get a short hash of a long string using Excel VBA

Whats given

  • Input string is not longer than 80 characters
  • Valid input characters are: [0..9] [A_Z] . _ /
  • Valid output characters are [0..9] [A_Z] [a_z] (lower and upper case can be used)
  • The output hash shouldn't be longer than ~12 characters (shorter is even better)
  • No need to be unique at all since this will result in a too long hash

What I have done so far

I thought this SO answer is a good start since it generates a 4-digit Hex-Code (CRC16).

But 4 digits were to little. In my test with 400 strings 20% got a duplicate somewhere else.
The chance to generate a collision is too high.

Sub tester()
    For i = 2 To 433
        Cells(i, 2) = CRC16(Cells(i, 1))
    Next i
End Sub


Function CRC16(txt As String)
Dim x As Long
Dim mask, i, j, nC, Crc As Integer
Dim c As String

Crc = &HFFFF

For nC = 1 To Len(txt)
    j = Val("&H" + Mid(txt, nC, 2))
    Crc = Crc Xor j
    For j = 1 To 8
        mask = 0
        If Crc / 2 <> Int(Crc / 2) Then mask = &HA001
        Crc = Int(Crc / 2) And &H7FFF: Crc = Crc Xor mask
    Next j
Next nC

CRC16 = Hex$(Crc)
End Function

How to reproduce

You can copy these 400 test strings from pastebin.
Paste them to column A in a new Excel workbook and execute the code above.

Q: How do I get a string hash which is short enough (12 chars) and long enough to get a small percentage of duplicates.

解决方案

Split your string into three shorter strings (if not divisible by three, the last one will be longer than the other two). Run your "short" algorithm on each, and concatenate the results.

I could write the code but based on the quality of the question I think you can take it from here!

EDIT: It turns out that that advice is not enough. There is a serious flaw in your original CRC16 code - namely the line that says:

j = Val("&H" + Mid(txt, nC, 2))

This only handles text that can be interpreted as hex values: lowercase and uppercase letters are the same, and anything after F in the alphabet is ignored (as far as I can tell). That anything good comes out at all is a miracle. If you replace the line with

j = asc(mid(txt, nC, 1))

Things work better - every ASCII code at least starts out life as its own value.

Combining this change with the proposal I made earlier, you get the following code:

Function hash12(s As String)
' create a 12 character hash from string s

Dim l As Integer, l3 As Integer
Dim s1 As String, s2 As String, s3 As String

l = Len(s)
l3 = Int(l / 3)
s1 = Mid(s, 1, l3)      ' first part
s2 = Mid(s, l3 + 1, l3) ' middle part
s3 = Mid(s, 2 * l3 + 1) ' the rest of the string...

hash12 = hash4(s1) + hash4(s2) + hash4(s3)

End Function

Function hash4(txt)
' copied from the example
Dim x As Long
Dim mask, i, j, nC, crc As Integer
Dim c As String

crc = &HFFFF

For nC = 1 To Len(txt)
    j = Asc(Mid(txt, nC)) ' <<<<<<< new line of code - makes all the difference
    ' instead of j = Val("&H" + Mid(txt, nC, 2))
    crc = crc Xor j
    For j = 1 To 8
        mask = 0
        If crc / 2 <> Int(crc / 2) Then mask = &HA001
        crc = Int(crc / 2) And &H7FFF: crc = crc Xor mask
    Next j
Next nC

c = Hex$(crc)

' <<<<< new section: make sure returned string is always 4 characters long >>>>>
' pad to always have length 4:
While Len(c) < 4
  c = "0" & c
Wend

hash4 = c

End Function

You can place this code in your spreadsheet as =hash12("A2") etc. For fun, you can also use the "new, improved" hash4 algorithm, and see how they compare. I created a pivot table to count collisions - there were none for the hash12 algorithm, and only 3 for the hash4. I'm sure you can figure out how to create hash8, ... from this. The "no need to be unique" from your question suggests that maybe the "improved" hash4 is all you need.

In principle, a four character hex should have 64k unique values - so the chance of two random strings having the same hash would be 1 in 64k. When you have 400 strings, there are 400 x 399 / 2 "possible collision pairs" ~ 80k opportunities (assuming you had highly random strings). Observing three collisions in the sample dataset is therefore not an unreasonable score. As your number of strings N goes up, the probability of collisions goes as the square of N. With the extra 32 bits of information in the hash12, you expect to see collisions when N > 20 M or so (handwaving, in-my-head-math).

You can make the hash12 code a little bit more compact, obviously - and it should be easy to see how to extend it to any length.

Oh - and one last thing. If you have RC addressing enabled, using =CRC16("string") as a spreadsheet formula gives a hard-to-track #REF error... which is why I renamed it hash4

这篇关于VBA哈希字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆