查找十大最频繁访问的URL,数据存储翻过网络 [英] Find Top 10 Most Frequent visited URl , data is stored accross network
问题描述
来源:谷歌面试问题
由于大型网络电脑,访问过的URL的每个饲养日志文件,寻找十大最常访问的网址。
Given a large network of computers, each keeping log files of visited urls, find the top ten most visited URLs.
有很多大<字符串(URL) - > INT(访问)>图
。
计算<字符串(URL) - > INT(所有分布图中访问的总和)
,并获得合并后的地图前十名。
Calculate < string (url) -> int (sum of visits among all distributed maps)
, and get the top ten in the combined map.
主要制约因素:地图是太大,通过网络传输。也不能使用麻preduce直接
我现在已经走到翻过这一类型,其中processiong必须在大型分布式系统做了相当多的问题。我想不出或找到合适的答案。
I have now come accross quite a few questions of this type ,where processiong needs to be done over large Distributed systems. I cant think or find a suitable answer .
所有我能想到的是蛮力,这在一些或其他方式,违反了给定约束。的
任何想法/方法的人吗?
Any ideas/approaches people ?
推荐答案
它说,你不能使用的map-reduce的直接的是一个提示问题的作者想你想怎么地图减少的作品,所以我们只会模仿的动作的map-reduce
It says you can't use map-reduce directly which is a hint the author of the question wants you to think how map reduce works, so we will just mimic the actions of map-reduce:
- pre-处理:令R为集群服务器的数量,给每个 从0,1,2服务器的唯一ID,...,R-1
- (图)对于每一个(串号) - 发送元组到具有该ID的服务器
哈希(字符串)%R
- (减少)一旦第2步完成(简单的控制通信),产生
(字符串,计数)
每个服务器排名前10位的字符串。需要注意的是,其中那些在步骤2到这个特定的服务器发送的元组。 - (图)每个服务器将发送所有的排名前10位为1的服务器(让它成为服务器0)。它应该被罚款,只有10 *这些记录R上。
- (减少)服务器0将产生前10名在网络上。
- pre-processing: let R be the number of servers in cluster, give each server unique id from 0,1,2,...,R-1
- (map) For each (string,id) - send the tuple to the server which has the id
hash(string) % R
. - (reduce) Once step 2 is done (simple control communication), produce the
(string,count)
of the top 10 strings per server. Note that the tuples where those sent in step2 to this particular server. - (map) Each server will send all his top 10 to 1 server (let it be server 0). It should be fine, there are only 10*R of those records.
- (reduce) Server 0 will yield the top 10 across the network.
注:
- 与算法的问题,最喜欢的大数据算法 不要使用框架正在处理发生故障的服务器。马preduce花费 照顾它给你。
- 在上面的算法可以转化为一个2阶段的map-reduce算法pretty的直线前进。
- The problem with the algorithm, like most big-data algorithms that don't use frameworks is handling failing servers. MapReduce takes care of it for you.
- The above algorithm can be translated to a 2 phases map-reduce algorithm pretty straight forward.
这篇关于查找十大最频繁访问的URL,数据存储翻过网络的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!