两个十六进制数的相似度 [英] Similarity of two Hexadecimal numbers

查看:221
本文介绍了两个十六进制数的相似度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用汉明和Levenshtein距离查找类似的哈希(十六进制哈希)。如果他们的汉明距离小于10(不同位数),可以说两个散列是相似的。

I am trying to find similar hashes (hexadecimal hash) using hamming and Levenshtein distance. Lets say two hashes are similar if their hamming distance is less than 10 (number of differing bits).

Hash 1= ffffff (base 16)
Hash 2= fffff0 (base 16)

两个散列之间的汉明距离他们是相似的。因为

The hamming distance between two hashes is 4. They are similar. Because,

Hash 1= 11111111 11111111 11111111 (base 2)
Hash 2= 11111111 11111111 11110000 (base 2)

我有800万这样的哈希。我想知道什么将是一个合适的数据结构,用于存储800万个哈希值。我最初尝试Trie,但考虑以下情况

I have 8 million such hashes. I am wondering what will be a suitable data structure for storing the 8 million hashes. I initially tried "Trie" but consider the following scenario,

Hash 1 = 0fabde (00001111 10101011 11011110)
Hash 2 = adcbfe (10101010 11001011 11111110)

汉明距离是7.所以我不能做前缀搜索。

The hamming distance is 7. So I cannot do prefix search.

我知道我可以使用XOR和Integer.bitCount()来获取不同位的数量,但是我有一个目标散列和800万个哈希可以搜索ie给我一个哈希,我必须找到所有类似的哈希值,我们在存储库中的800万个哈希值。

I know that i can use XOR and Integer.bitCount() to get the number of differing bits, but I have one target hash and 8 million hashes to search against i.e Given a hash i have to find all the similar hashes in 8 million hashes that we have in repository.

有没有办法有效地存储散列,以便我的搜索库是减少?

Is there any way store the hashes effectively so that my search base is reduced?

推荐答案

如果哈希值如图所示,您可以直接对它们进行索引 - 也就是说,一个大数组,只是在索引上做一些数学。

If the hashes are as small as shown, you can index them "directly" - that is, put them in a big array and do just do some math on the index.

只生成可能对应于所请求的汉明距离 d ,只需使用包含最多 d 设置位的所有掩码(见下文)即可。由于有八百万个哈希值,但只能存在一千六百万,所以大概有一半的访问索引是有用的,即有东西可以找到。

It's fairly simple to generate only the indexes that may correspond to hashes that are within the requested hamming distance d, just XOR the key with all masks that contain up to d set bits (see below). Since there are 8 million hashes but only 16 million could exist, about half of the visited indexes are expected to be "useful" ie there will be something there to find.

要生成面具,您可以使用旧的 NextBitPermutation 技巧之前发布在StackOverflow上,例如此处。对于java,只需使用逻辑正确的移位,并将 __ builtin_ctz 替换为 numberOfTrailingZeros 以获取(未测试)

To generate the masks, you can use the old NextBitPermutation trick, which has been posted on StackOverflow several times before, for example here. For java, just use the logical right shift and replace __builtin_ctz by numberOfTrailingZeros to get (not tested)

int t = v | (v - 1);
int w = (t + 1) | (((~t & -~t) - 1) >>> (Integer.numberOfTrailingZeros(v) + 1));

这里 w 将是位置换后 v

全局结构将类似(未测试)

The global structure would be something like (not tested)

for (int k = 1; k <= d; k++) {
    int diff = (1 << k) - 1;
    while (diff <= 0xFFFFFF) {
        if (hashes[key ^ diff])
            // do something with it
        diff = nextBitPermutation(diff);
    }
}

这篇关于两个十六进制数的相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆