找到十大搜索字词的算法 [英] Algorithm to find top 10 search terms

查看:99
本文介绍了找到十大搜索字词的算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在准备面试,这让我想起了在以前的一次面试中曾经问过的一个问题:



你被要求设计一些软件,以在Google上不断显示前10个搜索字词,您可以访问一个可以在Google上搜索的搜索词无尽的实时流的资料,描述您将要使用什么算法和数据结构用于实现这一点,您将设计两种变体:



(i)显示所有时间的前10个搜索字词(即自您开始阅读Feed以来) / p>

(ii)仅显示过去一个月的前10个搜索字词,每小时更新一次。



您可以使用获得前十名单的近似值,但您必须证明您的选择。

我在这次采访中遭到轰炸,仍然不知如何实现这一点。



第一部分要求在无限列表中不断增长的子序列中的10个最常用项目。我研究了选择算法,但找不到任何在线版本来解决这个问题。



第二部分使用有限列表,但由于正在处理大量数据,您无法真正将整个月的搜索字词存储在内存中并计算每小时一个柱状图



由于前10个列表不断更新,所以问题变得更加困难,所以您需要在滑动窗口中计算前10名。 / p>

任何想法?

解决方案

嗯,看起来很糟糕数据,存储所有频率可能是非常昂贵的成本。 当数据量如此之大时,我们无法希望将其全部存储,我们输入数据流算法的域。



在这方面有用的书:
Muthukrishnan - Data Streams:Algorithms and Applications / a>



从上面我选择的问题的密切相关的引用:
Manku,Motwani - 数据流近似频率计数[pdf]



顺便说一下,斯坦福大学的Motwani(编辑)是非常重要的作者,作者是http://rads.stackoverflow.com/amzn/点击/ 0521474655rel =noreferrer>随机算法s书。 本书第11章涉及此问题 编辑:对不起,不好的参考,那个特定的章节是一个不同的问题。检查后,我建议
$

呃,好的面试问题。


I'm currently preparing for an interview, and it reminded me of a question I was once asked in a previous interview that went something like this:

"You have been asked to design some software to continuously display the top 10 search terms on Google. You are given access to a feed that provides an endless real-time stream of search terms currently being searched on Google. Describe what algorithm and data structures you would use to implement this. You are to design two variations:

(i) Display the top 10 search terms of all time (i.e. since you started reading the feed).

(ii) Display only the top 10 search terms for the past month, updated hourly.

You can use an approximation to obtain the top 10 list, but you must justify your choices."
I bombed in this interview and still have really no idea how to implement this.

The first part asks for the 10 most frequent items in a continuously growing sub-sequence of an infinite list. I looked into selection algorithms, but couldn't find any online versions to solve this problem.

The second part uses a finite list, but due to the large amount of data being processed, you can't really store the whole month of search terms in memory and calculate a histogram every hour.

The problem is made more difficult by the fact that the top 10 list is being continuously updated, so somehow you need to be calculating your top 10 over a sliding window.

Any ideas?

解决方案

Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms.

Useful book in this area: Muthukrishnan - "Data Streams: Algorithms and Applications"

Closely related reference to the problem at hand which I picked from the above: Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf]

By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. The 11th chapter of this book deals with this problem. Edit: Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.

Heh, nice interview question.

这篇关于找到十大搜索字词的算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆