找到十大搜索字词的算法 [英] Algorithm to find top 10 search terms
问题描述
你被要求设计一些软件,以在Google上不断显示前10个搜索字词,您可以访问一个可以在Google上搜索的搜索词无尽的实时流的资料,描述您将要使用什么算法和数据结构用于实现这一点,您将设计两种变体:
(i)显示所有时间的前10个搜索字词(即自您开始阅读Feed以来) / p>
(ii)仅显示过去一个月的前10个搜索字词,每小时更新一次。
您可以使用获得前十名单的近似值,但您必须证明您的选择。
我在这次采访中遭到轰炸,仍然不知如何实现这一点。
第一部分要求在无限列表中不断增长的子序列中的10个最常用项目。我研究了选择算法,但找不到任何在线版本来解决这个问题。
第二部分使用有限列表,但由于正在处理大量数据,您无法真正将整个月的搜索字词存储在内存中并计算每小时一个柱状图
由于前10个列表不断更新,所以问题变得更加困难,所以您需要在滑动窗口中计算前10名。 / p>
任何想法?
嗯,看起来很糟糕数据,存储所有频率可能是非常昂贵的成本。 当数据量如此之大时,我们无法希望将其全部存储,我们输入数据流算法的域。
在这方面有用的书:
Muthukrishnan - Data Streams:Algorithms and Applications / a>
从上面我选择的问题的密切相关的引用:
Manku,Motwani - 数据流近似频率计数[pdf]
顺便说一下,斯坦福大学的Motwani(编辑)是非常重要的作者,作者是http://rads.stackoverflow.com/amzn/点击/ 0521474655rel =noreferrer>随机算法s书。 呃,好的面试问题。 I'm currently preparing for an interview, and it reminded me of a question I was once asked in a previous interview that went something like this: "You have been asked to design some software to continuously display the top 10 search terms on Google. You are given access to a feed that provides an endless real-time stream of search terms currently being searched on Google. Describe what algorithm and data structures you would use to implement this. You are to design two variations: (i) Display the top 10 search terms of all time (i.e. since you started reading the feed). (ii) Display only the top 10 search terms for the past month, updated hourly. You can use an approximation to obtain the top 10 list, but you must justify your choices."
The first part asks for the 10 most frequent items in a continuously growing sub-sequence of an infinite list. I looked into selection algorithms, but couldn't find any online versions to solve this problem. The second part uses a finite list, but due to the large amount of data being processed, you can't really store the whole month of search terms in memory and calculate a histogram every hour. The problem is made more difficult by the fact that the top 10 list is being continuously updated, so somehow you need to be calculating your top 10 over a sliding window. Any ideas? Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms. Useful book in this area:
Muthukrishnan - "Data Streams: Algorithms and Applications" Closely related reference to the problem at hand which I picked from the above:
Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf] By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. Heh, nice interview question. 这篇关于找到十大搜索字词的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! 本书第11章涉及此问题 。 编辑:对不起,不好的参考,那个特定的章节是一个不同的问题。检查后,我建议
$
I bombed in this interview and still have really no idea how to implement this. The 11th chapter of this book deals with this problem. Edit: Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.