为所有字谜生成相同的唯一哈希码 [英] Generate same unique hash code for all anagrams

查看:19
本文介绍了为所有字谜生成相同的唯一哈希码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我参加了一个面试,遇到了一个关于哈希冲突的好问题.

Recently, I attended an interview and faced a good question regarding hash collisions.

问题:给定一个字符串列表,将字谜一起打印出来.

Question : Given a list of strings, print out the anagrams together.

示例:

i/p:         {行为,上帝,动物,狗,猫}

o/p :           行为,猫,狗,上帝

Example :

i/p :              {act, god, animal, dog, cat}

o/p :                 act, cat, dog, god


我想创建 hashmap 并将单词作为键和值作为字谜列表


I want to create hashmap and put the word as key and value as list of anagrams

为了避免冲突,我想为字谜生成唯一的哈希码,而不是排序并使用排序后的单词作为关键字.

To avoid collision, I want to generate unique hash code for anagrams instead of sorting and using the sorted word as key.

我正在寻找除使用链接之外的处理冲突的哈希算法.我希望算法为 act 和 cat 生成相同的哈希码......以便它将下一个单词添加到值列表中

I am looking for hash algorithm which take care of collision other than using chaining. I want algorithm to generate same hash code for both act and cat... so that it will add next word to the value list

谁能推荐一个好的算法?

Can anyone suggest a good algorithm ?

推荐答案

对排序后的字符串进行散列非常好,我可能会这样做,但它确实可能很慢而且很麻烦.这是另一个想法,不确定它是否有效 - 选择一组素数,只要你喜欢,大小与你的字符集相同,然后构建一个从你的字符到它的快速映射函数.然后对于给定的单词,将每个字符映射到匹配的素数,然后相乘.最后,使用结果散列.

Hashing with the sorted string is pretty nice, i'd have done that probably, but it could indeed be slow and cumbersome. Here's another thought, not sure if it works - pick a set of prime numbers, as small as you like, the same size as your character set, and build a fast mapping function from your chars to that. Then for a given word, map each character into the matching prime, and multiply. finally, hash using the result.

这与 Heuster 建议的非常相似,只是碰撞较少(实际上,我相信不会有虚假碰撞,因为任何数的素数分解都是唯一的).

This is very similar to what Heuster suggested, only with less collisions (actually, I believe there will be no false collisions, given the uniqueness of the prime decomposition of any number).

简单例如-

int primes[] = {2, 3, 5, 7, ...} // can be auto generated with a simple code

inline int prime_map(char c) {
    // check c is in legal char set bounds
    return primes[c - first_char];
}

...
char* word = get_next_word();
char* ptr = word;
int key = 1;
while (*ptr != NULL) {
    key *= prime_map(*ptr);
    ptr++;
}
hash[key].add_to_list(word); 

关于唯一性的几句话 - 任何整数都可以分解为素数的乘法,因此给定散列中的整数键,您实际上可以重建所有可能的散列到它的字符串,并且只有这些词.只需分解为素数 p1^n1*p2^n2*... 并将每个素数转换为匹配的字符.p1 的字符将出现 n1 次,依此类推.你不能得到任何你没有明确使用的新素数,作为素数意味着你不能通过任何其他素数的乘法得到它.

A few words about the uniqueness - any integer number has a single breakdown to multiplications of primes, so given an integer key in the hash you can actually reconstruct all possible strings that would hash to it, and only these words. Just break into primes, p1^n1*p2^n2*... and convert each prime to the matching char. the char for p1 would appear n1 times, and so on. You can't get any new prime you didn't explicitly used, being prime means you can't get it by any multiplication of other primes.

这带来了另一个可能的改进——如果你可以构造字符串,你只需要标记你在填充哈希时看到的排列.由于排列可以按字典顺序排列,您可以用数字替换每个排列.这将节省在散列中存储实际字符串的空间,但需要更多的计算,因此它不一定是一个好的设计选择.尽管如此,它还是使最初的面试问题变得很复杂:)

This brings another possible improvement - if you can construct the string, you just need to mark the permutations you saw when populating the hash. since the permutations can be ordered by lexicographic order, you can replace each one with a number. This would save the space of storing the actual strings in the hash, but would require more computations so it's not necessarily a good design choice. Still, it makes a nice complication of the original question for interviews :)

这篇关于为所有字谜生成相同的唯一哈希码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆