SQL中二进制字符串的汉明距离 [英] Hamming distance on binary strings in SQL
问题描述
我在我的数据库中有一张表,我将SHA256哈希存储在BINARY(32)列中。我正在寻找一种方法来计算列中条目的汉明距离到一个提供的值,即类似于:
SELECT * FROM table
$ p $ (如果你想知道,字符串A和B的汉明距离定义为
ORDER BY HAMMINGDISTANCE(hash,UNHEX(< insert provided sha256 hash here))ASC
LIMIT 10
BIT_COUNT(A ^ B)
,其中^是按位XOR运算符,BIT_COUNT返回二进制字符串中1的个数)。现在,我知道^运算符和BIT_COUNT函数都是唯一的在INTEGER上工作,所以我会说可能唯一的方法就是打破子字符串中的二进制字符串,将每个二进制子字符串转换为整数,计算汉明距离substring-wise,然后添加它们。问题在于它听起来非常复杂,效率不高,绝对不优雅。因此我的问题是:你能提出更好的方法吗? (请注意,我在共享主机,因此我无法修改数据库服务器或加载库)
编辑(1):显然加载整个表PHP和做计算有可能,但我宁愿避免它,因为这张表可能会变得相当大。
edit(2):数据库服务器是MySQL 5.1
edit(3):我的答案如下,包含了我刚刚描述的代码。
编辑(4):我发现使用4个BIGINT来存储散列而不是BINARY(32)可以大幅提高速度(超过100倍)。看到我的答案在下面的评论。
解决方案看来,将数据存储在
BINARY
列是一种执行不力的方法。获得体面表现的唯一快速方法是将多个BIGINT
列中的BINARY
列的内容拆分,每列包含原始数据的一个8字节的子字符串。
在我的例子中(32字节),这意味着使用4
BIGINT
列和使用这个函数:
pre $CREATE FUNCTION HAMMINGDISTANCE(
A0 BIGINT,A1 BIGINT,A2 BIGINT,A3 BIGINT ,
B0 BIGINT,B1 BIGINT,B2 BIGINT,B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN
BIT_COUNT(A0 ^ B0)+
BIT_COUNT( A1 ^ B1)+
BIT_COUNT(A2 ^ B2)+
BIT_COUNT(A3 ^ B3);
使用这种方法,在我的测试中,比使用 BINARY
方法。
FWIW,这是我在解释问题。更好的方法来完成同样的事情是受欢迎的(我特别不喜欢二进制>十六进制>十进制转换):
CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32),B BINARY(32))
RETURNS INT确定性
RETURN
BIT_COUNT(
CONV(HEX(SUBSTRING(A,1,8)) ,16,10)
CONV(HEX(SUBSTRING(B,1,8)),16,10)
)+
BIT_COUNT(
CONV(HEX(SUBSTRING( A,9,8)),16,10)^
CONV(HEX(SUBSTRING(B,9,8)),16,10)
)+
BIT_COUNT(
十六进制(HEX(SUBSTRING(B,17,8)),16,10)
)+
BIT_COUNT(
CONV(HEX(SUBSTRING(A,25,8)),16,10)^
CONV(HEX(SUBSTRING(B,25,8)),16,10) b $ b);
I have a table in my DB where I store SHA256 hashes in a BINARY(32) column. I'm looking for a way to compute the Hamming distance of the entries in the column to a supplied value, i.e. something like:
SELECT * FROM table
ORDER BY HAMMINGDISTANCE(hash, UNHEX(<insert supplied sha256 hash here>)) ASC
LIMIT 10
(in case you're wondering, the Hamming distance of strings A and B is defined as BIT_COUNT(A^B)
, where ^ is the bitwise XOR operator and BIT_COUNT returns the number of 1s in the binary string).
Now, I know that both the ^ operator and BIT_COUNT function only work on INTEGERs and so I'd say that probably the only way to do it would be to break up the binary strings in substrings, cast each binary substring to integer, compute the Hamming distance substring-wise and then add them. The problem with this is that it sounds terribly complicated, not efficient and definitely not elegant. My question therefore is: could you suggest any better way? (please note that I'm on shared hosting and therefore I can't modify the DB server or load libraries)
edit(1): Obviously loading the whole table in PHP and doing the computations there would be possible but I'd rather avoid it because this table will probably grow quite large.
edit(2): The DB server is MySQL 5.1
edit(3): My answer below contains the code that I just described above.
edit(4): I just found out that using 4 BIGINTs to store the hash instead of a BINARY(32) yields massive speed improvements (more than 100 times faster). See the comments to my answer below.
It appears that storing the data in a BINARY
column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY
column in multiple BIGINT
columns, each containing an 8-byte substring of the original data.
In my case (32 bytes) this would mean using 4 BIGINT
columns and using this function:
CREATE FUNCTION HAMMINGDISTANCE(
A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT,
B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT
)
RETURNS INT DETERMINISTIC
RETURN
BIT_COUNT(A0 ^ B0) +
BIT_COUNT(A1 ^ B1) +
BIT_COUNT(A2 ^ B2) +
BIT_COUNT(A3 ^ B3);
Using this approach, in my testing, is over 100 times faster than using the BINARY
approach.
FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):
CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32))
RETURNS INT DETERMINISTIC
RETURN
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 1, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 1, 8)), 16, 10)
) +
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 9, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 9, 8)), 16, 10)
) +
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)
) +
BIT_COUNT(
CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^
CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)
);
这篇关于SQL中二进制字符串的汉明距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!