字谜 - 与链接和用C探测哈希 [英] Anagrams - Hashing with chaining and probing in C
问题描述
我的标题编辑得到了,所以我想确保每个人都知道这是功课。现在的问题是仅仅优化方案,散列是我的主意。
My title got edited, so I wanted to make sure everyone knows this is homework. The problem is just to optimize the program, the hashing is my idea.
-
我正在优化C程序组合在一起的话是对方的字谜,然后打印出来。
I'm working on optimizing a C program that groups together words that are anagrams of each other, and then prints them out.
目前的程序是基本上链表的链接列表。在外部列表中的每个链接是一组互为字谜字
Currently the program is basically a linked list of linked lists. Each link in the outer list is a group of words that are anagrams of each other.
该方案的剖面显示,到目前为止,执行时间的最大部分是功能 wordLookup
。这是因为它具有以搜寻每个节点,并与可能100k的话从文件读取中,这可能需要一个很长的时间。举例来说,这里是 gprof的
输出读取40K字:
The profile for the program shows that by far, the largest portion of execution time is the function wordLookup
. This is because it has to search every node, and with a possible 100k words read in from a file, this can take a very long time. For instance, here is the gprof
output for reading in 40k words:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
100.31 1.48 1.48 40000 37.12 37.12 wordLookup
0.00 1.48 0.00 78235 0.00 0.00 newnode
0.00 1.48 0.00 40000 0.00 0.00 sort_string
0.00 1.48 0.00 38235 0.00 0.00 wordInsert
0.00 1.48 0.00 1996 0.00 0.00 swap_words
0.00 1.48 0.00 1765 0.00 0.00 wordAppend
我提出这个想法速度是数据结构更改为一个哈希表链彼此都字谜在同一插槽。
My idea for making this faster is to change the data structure to a hash table that chains all anagrams of each other in the same slot.
根据事情我的教授说,事情我已经读到这里,我为我的散列函数的思维是这样的。 (注意:在素数分布成使得最常用的字母是低数和所用的至少是高数量)
Based on things my professor has said and things that I've read here, I'm thinking of something like this for my hash function. (Note: the prime numbers are distributed such that the most used letters are low numbers and the least used are high numbers.)
sort(string)
array alpha_primes = 5,71,37,29,2,53,59,19,11,83,79,31,43,13,7,67,97,23,17,3,41,73,47,89,61,101
hash(String) {
hash = 1
for (char in String) {
hash *= alpha_primes[char-'a'];
}
return hash % tablesize
}
是否有此问题,将适当地分配值,每个组的字谜在表中一个独特的索引哈希表的大小?
如果这是不可能的,那么我应该:
If that is not possible, then should I:
- 链中的单词表一起(名单列表)
- 使用一个探测(线性或二次)解决方案
- 对于任何一种情况,当相比有哪些有利的一面/缺点?
推荐答案
有没有办法保证哈希将是独一无二的。发生碰撞的概率可以通过生日问题进行计算,最好的办法是尽量减少它。
There is no way to guarantee that hashes will be unique. The probability of a collision can be calculated via the birthday problem, and your best bet is to minimize it.
有2组的概率散列为相同的值可以近似为1-E ^(( - K(K-1))/ 2n)后,其中k是有基团的总量(大致相同作为你的字计数),n是你的哈希(2 ^(你的哈希的长度))的搜索空间。
The probability for 2 groups to hash to the same value can be approximated as 1-e^((-k(k-1))/2n), where k is the total amount of groups you have (roughly the same as your word count), and n is the search space of your hash (2^(length of your hash)).
我的字典有大约100000字,使得32B哈希很好(colissions的2%)。但是,大的会使用4GB的内存哈希表。使用较小的表意味着更多colissions。链接或探测不会使时间的巨大差异。
My dictionary has roughly 100000 words, making a 32b hash very good (2% of colissions). However, a hash table that big would use 4GB of RAM. Using a smaller table means more colissions. Chaining or probing won't make a huge difference in time.
作为评论你的问题reccomended,字典树将在总体较小的数据结构中结束。
As reccomended in comments to your question, a trie will end up in a smaller data structure overall.
这篇关于字谜 - 与链接和用C探测哈希的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!