基本Hashtable算法 - 删除重复 [英] Basic Hashtable algorithm - removing duplicates

查看:135
本文介绍了基本Hashtable算法 - 删除重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

今天早上我刚刚接受了采访,并给出了给出从整数列表中删除重复数据的算法这个问题。这是一个相当标准的问题,所以我很自信我可以回答。

I just had an interview this morning and I was given the question "Give an algorithm for removing duplicates from a list of integers". This is a fairly standard question so I was pretty confident I could answer it.

我更改了,但我说的话可以使用散列表。从第一个整数开始,并将其插入到哈希表中,然后对于每个连续的整数,执行散列表查找,以检查整数是否已经在哈希表中,如果不是,则将其插入,如果已经存在,则将其抛弃,因为它是重复,所以以这种方式遍历列表,如果哈希表设计正确,查找和插入应该是平均的固定时间。

I am paraphrasing, but I said something along the lines of "You could use a hashtable. Start with the first integer and insert it into the hashtable. Then for each successive integer do a hashtable lookup to check if the integer is already in the hashtable, if not then insert it, if its already there then throw it away because it is a duplicate. So iterate though the list in this way. If the hashtable is designed correctly, the lookups and inserts should be constant time on average."

然后面试官回应再次,我是释义)但是哈希表查找不是常数时间,它们取决于已经有多少个元素,您描述的算法将是O(n ^ 2)

Then the interviewer responded (again I am paraphrasing) "But hashtable lookups are not constant time, they depend on how many elements are already in it. The algorithm you described would be O(n^2)"

然后我回答真的吗?我以为如果你设计了一个很好的散列函数,那将是一个恒定的时间,通常会使它成为一个

Then I responded "Really? I thought that if you designed a good hash-function, it would be constant time? Making it O(n) typically"

然后面试官sponded所以你说的查找时间将是一样的哈希表与许多条目和哈希表与几个条目

Then the interviewer responded "So you are saying that the lookup time would be the same for a hash table with many entries and a hashtable with few entries"

然后我说是的。如果它的设计正确。

Then I said "Yes. If it is designed correctly."

然后面试官说这不是真的

Then the interviewer said "This is not true"

所以我是如果有人可以指出我错了什么,我将非常感激。

SO I am very confused right now. If someone could point out where I am wrong, I would be very grateful

推荐答案


如果有人可以指出我错了什么

If someone could point out where I am wrong

你没有错:正确设计的哈希表给你一个预期的查询 O(1)的效率,并插入摊销的 O(1),所以你的算法是 O(N)。由于可能的重复分辨率,查找大量加载的哈希表确实有点慢,但是预期的查找时间仍然是 O(1)。对于摊销不计算的实时系统来说,这可能不够好,但在所有实际情况下都足够。

You are not wrong at all: properly designed hash tables gives you an expected lookup efficiency of O(1) and inserts in amortized O(1), so your algorithm is O(N). Lookup in heavily loaded hash tables is indeed a little slower because of possible duplicate resolution, but the expected lookup time remains O(1). This may not be good enough for real-time systems where "amortized" does not count, but in all practical situations this is enough.

当然,可以总是使用一个平衡树,为您遇到的最坏情况 O(N * LogN)算法,或者如果数字具有合理的边界(例如,0到100,000之间),则可以使用布尔数组来测试 O(1)最坏情况下的成员资格,由于较小的常数乘数,对哈希表的潜在改进。

Of course you could always use a balanced tree for the items that you've seen for a worst-case O(N*LogN) algorithm, or if the numbers have reasonable bounds (say, between 0 and 100,000) you could use a boolean array to test membership in O(1) worst-case, and a potential improvement over a hash table because of a smaller constant multiplier.

这篇关于基本Hashtable算法 - 删除重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆