SQL中二进制字符串的汉明距离 [英] Hamming distance on binary strings in SQL

查看:210
本文介绍了SQL中二进制字符串的汉明距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在我的数据库中有一张表,我将SHA256哈希存储在BINARY(32)列中。我正在寻找一种方法来计算列中条目的汉明距离到一个提供的值,即类似于:

  SELECT * FROM table 
ORDER BY HAMMINGDISTANCE(hash,UNHEX(< insert provided sha256 hash here))ASC
LIMIT 10
BIT_COUNT(A ^ B),其中^是按位XOR运算符,BIT_COUNT返回二进制字符串中1的个数)。

现在,我知道^运算符和BIT_COUNT函数都是唯一的在INTEGER上工作,所以我会说可能唯一的方法就是打破子字符串中的二进制字符串,将每个二进制子字符串转换为整数,计算汉明距离substring-wise,然后添加它们。问题在于它听起来非常复杂,效率不高,绝对不优雅。因此我的问题是:你能提出更好的方法吗? (请注意,我在共享主机,因此我无法修改数据库服务器或加载库)

编辑(1):显然加载整个表PHP和做计算有可能,但我宁愿避免它,因为这张表可能会变得相当大。

edit(2):数据库服务器是MySQL 5.1



edit(3):我的答案如下,包含了我刚刚描述的代码。

编辑(4):我发现使用4个BIGINT来存储散列而不是BINARY(32)可以大幅提高速度(超过100倍)。看到我的答案在下面的评论。

解决方案

看来,将数据存储在 BINARY 列是一种执行不力的方法。获得体面表现的唯一快速方法是将多个 BIGINT 列中的 BINARY 列的内容拆分,每列包含原始数据的一个8字节的子字符串。

在我的例子中(32字节),这意味着使用4 BIGINT 列和使用这个函数:

pre $ CREATE FUNCTION HAMMINGDISTANCE(
A0 BIGINT,A1 BIGINT,A2 BIGINT,A3 BIGINT ,
B0 BIGINT,B1 BIGINT,B2 BIGINT,B3 BIGINT

RETURNS INT DETERMINISTIC
RETURN
BIT_COUNT(A0 ^ B0)+
BIT_COUNT( A1 ^ B1)+
BIT_COUNT(A2 ^ B2)+
BIT_COUNT(A3 ^ B3);

使用这种方法,在我的测试中,比使用 BINARY 方法。






FWIW,这是我在解释问题。更好的方法来完成同样的事情是受欢迎的(我特别不喜欢二进制>十六进制>十进制转换):

  CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32),B BINARY(32))
RETURNS INT确定性
RETURN
BIT_COUNT(
CONV(HEX(SUBSTRING(A,1,8)) ,16,10)
CONV(HEX(SUBSTRING(B,1,8)),16,10)
)+
BIT_COUNT(
CONV(HEX(SUBSTRING( A,9,8)),16,10)^
CONV(HEX(SUBSTRING(B,9,8)),16,10)
)+
BIT_COUNT(
十六进制(HEX(SUBSTRING(B,17,8)),16,10)
)+
BIT_COUNT(
CONV(HEX(SUBSTRING(A,25,8)),16,10)^
CONV(HEX(SUBSTRING(B,25,8)),16,10) b $ b);


I have a table in my DB where I store SHA256 hashes in a BINARY(32) column. I'm looking for a way to compute the Hamming distance of the entries in the column to a supplied value, i.e. something like:

SELECT * FROM table 
  ORDER BY HAMMINGDISTANCE(hash, UNHEX(<insert supplied sha256 hash here>)) ASC 
  LIMIT 10

(in case you're wondering, the Hamming distance of strings A and B is defined as BIT_COUNT(A^B), where ^ is the bitwise XOR operator and BIT_COUNT returns the number of 1s in the binary string).

Now, I know that both the ^ operator and BIT_COUNT function only work on INTEGERs and so I'd say that probably the only way to do it would be to break up the binary strings in substrings, cast each binary substring to integer, compute the Hamming distance substring-wise and then add them. The problem with this is that it sounds terribly complicated, not efficient and definitely not elegant. My question therefore is: could you suggest any better way? (please note that I'm on shared hosting and therefore I can't modify the DB server or load libraries)

edit(1): Obviously loading the whole table in PHP and doing the computations there would be possible but I'd rather avoid it because this table will probably grow quite large.

edit(2): The DB server is MySQL 5.1

edit(3): My answer below contains the code that I just described above.

edit(4): I just found out that using 4 BIGINTs to store the hash instead of a BINARY(32) yields massive speed improvements (more than 100 times faster). See the comments to my answer below.

解决方案

It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

CREATE FUNCTION HAMMINGDISTANCE(
  A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT, 
  B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(A0 ^ B0) +
  BIT_COUNT(A1 ^ B1) +
  BIT_COUNT(A2 ^ B2) +
  BIT_COUNT(A3 ^ B3);

Using this approach, in my testing, is over 100 times faster than using the BINARY approach.


FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN 
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
  ) +
  BIT_COUNT(
    CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^ 
    CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
  );

这篇关于SQL中二进制字符串的汉明距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆