算法找到前10个搜索词 [英] Algorithm to find top 10 search terms

查看:200
本文介绍了算法找到前10个搜索词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前preparing接受记者采访时,并想起一个问题,我曾经问了一句这样的previous采访我:

I'm currently preparing for an interview, and it reminded me of a question I was once asked in a previous interview that went something like this:

你被要求设计一些软件不断地显示在谷歌排名前10位的搜索条件。您将获得访问饲料,可提供当前正在搜索在谷歌的搜索字词无尽的实时流。说明什么算法以及数据结构,你会用它来实现这一点,你要设计两个变化:

"You have been asked to design some software to continuously display the top 10 search terms on Google. You are given access to a feed that provides an endless real-time stream of search terms currently being searched on Google. Describe what algorithm and data structures you would use to implement this. You are to design two variations:

(I)显示的所有时间前10名搜索词(即从你开始阅读的饲料)。

(i) Display the top 10 search terms of all time (i.e. since you started reading the feed).

(二)只显示前10个搜索条件在过去一个月,每小时更新一次。

(ii) Display only the top 10 search terms for the past month, updated hourly.

您可以使用近似获得前10名,但你必须证明你的选择。
我轰炸了这次采访,而且还有真的不知道如何实现这一点。

You can use an approximation to obtain the top 10 list, but you must justify your choices."
I bombed in this interview and still have really no idea how to implement this.

第一部分要求在一个无限列表中不断增长的子序列的10个最常见的物品。我看着选择算法,但找不到任何网上的版本来解决这个问题。

The first part asks for the 10 most frequent items in a continuously growing sub-sequence of an infinite list. I looked into selection algorithms, but couldn't find any online versions to solve this problem.

第二部分采用的是有限的名单,但由于正在处理的数据量很大,你不能真正全月的搜索字词存储在内存中,每隔一小时计算的直方图。

The second part uses a finite list, but due to the large amount of data being processed, you can't really store the whole month of search terms in memory and calculate a histogram every hour.

您需要的问题变得更加困难的是,排名前10位名单不断更新,所以在某种程度上要计算你的前10名在一个滑动窗口。

The problem is made more difficult by the fact that the top 10 list is being continuously updated, so somehow you need to be calculating your top 10 over a sliding window.

任何想法?

推荐答案

嗯,看起来像一个可怕的大量数据,有可能是成本过高来存储所有频率。 :当数据量是如此之大,我们不能希望保存这一切,我们进入数据流算法的域

Well, looks like an awful lot of data, with a perhaps prohibitive cost to store all frequencies. When the amount of data is so large that we cannot hope to store it all, we enter the domain of data stream algorithms.

在这方面有用的书: <一href="http://books.google.com/books?id=415loiMd_c0C&lpg=PP1&dq=Data%20Streams%3a%20Algorithms%20and%20Application&hl=el&pg=PP1#v=onepage&q=Data%20Streams%3A%20Algorithms%20and%20Application&f=false">Muthukrishnan - 数据流:算法及应用

Useful book in this area: Muthukrishnan - "Data Streams: Algorithms and Applications"

密切相关的参考手头这是我从上面挑问题: <一href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.8594&rep=rep1&type=pdf">Manku, Motwani - 近似频率计数的数据流[PDF]

Closely related reference to the problem at hand which I picked from the above: Manku, Motwani - "Approximate Frequency Counts over Data Streams" [pdf]

顺便说一句,Motwani斯坦福大学(编辑)的,是非常重要的随机算法<的作者/一>的书。 <击> 这本书涉及这一问题的第11章 修改对不起,错误引用,特定的章是一个不同的问题。检查完毕后,我反而建议<一href="http://books.google.com/books?id=415loiMd_c0C&lpg=PP1&dq=Data%20Streams%3A%20Algorithms%20and%20Application&hl=el&pg=PA33#v=onepage&q&f=false">section 5.1.2的 Muthukrishnan的的书,可在网上。

By the way, Motwani, of Stanford, (edit) was an author of the very important "Randomized Algorithms" book. The 11th chapter of this book deals with this problem. Sorry, bad reference, that particular chapter is on a different problem. After checking, I instead recommend section 5.1.2 of Muthukrishnan's book, available online.

嘿,漂亮的面试问题。

Heh, nice interview question.

这篇关于算法找到前10个搜索词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆